Thibauld - Imagination and Execution -

24Sep/100

postgresql, date input format and black magic

RT @thibauld postgresql, date input format and black magic

I hate when I have the impression that black magic took the control of my PC. If you're a developer, you know it happens sometimes... when something is obviously wrong but you just can't figure out *what* is wrong. Hopefully, we work in IT, a field where we can reasonably assume that *everything* happens for a reason. So after digging a while, you eventually end up finding where the issue comes from.

It happened this week while trying to figure out why Postgresql had a different behavior on apparently completely similar platforms. I'm born April 1st, let's say I want to save my birthday in a table, I'd go with the following SQL statement :
INSERT INTO birthdays (date) VALUES ('01-04-1981');

As I'm french, each time I deal with dates, I tend to verify that they were understood correctly. Indeed, the above date format was french ('dd-mm-YYYY') but postgresql could have interpreted as a US date format ('mm-dd-YYY'). So let's check with the following statement:
SELECT date_part('month',date) FROM birthdays;

Here comes the problem. On 2 different machines with (apparently) the same setup, this statement would return 2 different results. One would return '04' (correct) and the other one would return '01' (wrong).

I took me a while to find out that postgresql had datestyle option (in postgresql.conf) which defines the rule to interpret ambiguous date input values.

So if datestyle is set to 'iso, dmy', it means dates will be interpreted as 'date-month-year' and if it is set to 'iso-mdy', dates will be interpreted as 'month-date-year'. It is that simple... once you found it :)

4Jan/096

Web application step 5: implement a “relevant” search engine

RT @thibauld Web application step 5: implement a “relevant” search engine

For this very first post of 2009, let me first wish you all the best for the coming year! I worked *a lot* in 2008 and 2009 will be the year where all this underground work will finally make sense so I'm really looking forward to it... and I hope you do too! Let's make 2009 rock :)

Let's now go back to the subject of this post! Search is a very important feature for most web applications (read this book in case you need to be convinced) but, unfortunately, it is also one of the most overlooked feature. Why ? Probably because implementing a good search is a difficult thing: Placing a search form at the top right of your web app is not difficult, having it display some results when a user searches for something is not a problem either... but displaying search results which are actually relevant and useful for the end user is much harder!

Implementing a relevant search engine for your web app is a task that cannot be delegated to a third party web service or to an external library, you have to do it by yourself for a simple reason: a search can only be relevant if it leverages the information contained in your business objects with intelligence. This is why technologies like the new Google Custom Search will not help you to implement a relevant search engine. Google Custom Search is ok for your blog or any other editorial website... but what if you want, for example, your users to be able to look for the freelancers that have the longest experience with SAP FI CO ?

Implementing a good search begins with pen and paper:

  • On which business objects can the end user perform a search ?
  • And for each business objects:
    • How to rank results by relevancy for any given search ?
    • How should I weight each information ?
    • What information do I need to rank my results ?
    • Is this information already present in my database ? Where ?
    • Which filters would be relevant ?
    • What are the relevant way to sort the search results ?

Once you've thought about all this, you did the most important part... but not the most difficult. Now let's implement it!

The main problem you'll be facing when implementing a relevant search engine is performance. Being relevant means processing a lot of information, which is very time consuming, even for today's fast processors and hard drives. This is why you should let the database handle most of the searching work, using well constructed indexes and queries, and avoid as much as possible doing work outside the database.

Use full-text search

Obviously, using the LIKE sql operator (or any other regexp matching...) will not help much in implementing a good search. It is way too slow because LIKE queries cannot use table indexes and requires a full sequential scan of your table for each query. The only exception to this is when you know the first characters you're looking for (i.e. WHERE name LIKE 'query%'), in this case, the database will leverage the btree index on column name (if available) and you're likely to get acceptable performance with very small and targeted queries.

In all other cases, you will have to use fulltext search mecanisms. Fulltext search refers to special type of indexes which are specialized in string pattern matching. Fulltext search has been implemented completely differently in MySQL and PostgreSQL. I will now give you an overview on how to use fulltext search with both databases.

Full-text search with MySQL

As I've said in a previous post, MySQL's fulltext search mechanism focuses on ease-of-use. Indeed, with MySQL you just have to create a 'fulltext' index on the relevant columns to enable fulltext search. Example:

CREATE FULLTEXT INDEX description ON experiences (description);

It creates a fulltext index named description on the column description in the table experiences. To leverage this new index, you need to write your queries using the following syntax:

SELECT id, MATCH(description) AGAINST('') AS score
FROM experiences
WHERE MATCH(description) AGAINST('')
ORDER BY score DESC
OFFSET 0 LIMIT 10;

Using the fulltext index, MySQL will quickly determine the best matching descriptions for the input query. For each row, MySQL returns its relevancy score. You can influence the calculation of this score by using parameters in the AGAINST() operator.

As you can see, it is really easy to implement fulltext search in MySQL... maybe too easy as there are common search related issues that are not addressed. For example, a very classical requirement is perform a search on multiple columns of a table (i.e. title,summary and description) and with different weights assigned to each column.. Now how do I implement that in MySQL? Well... I'm sorry to announce that I did not figure out a beautiful way to address the issue. On Freelance Business Club, we solved the issue by constructing queries like the following one:

SELECT e.id, e.title, e.desc, SUM(s.score) AS score
FROM experiences AS e
INNER JOIN (
SELECT id, <title_weight>+MATCH(title) AGAINST (query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(title) AGAINST ('query' IN BOOLEAN MODE)
UNION ALL
SELECT id, <summary_weight>+MATCH(summary) AGAINST('query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(summary) AGAINST('query' IN BOOLEAN MODE)
UNION ALL
SELECT id, <description_weight>+MATCH(description) AGAINST('query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(description) AGAINST('query' IN BOOLEAN MODE)
) AS s ON a.id=s.id
ORDER BY score DESC
OFFSET 0 LIMIT 10;

Even if I'm not particularly proud of this query, it gives us the flexibility we were looking for (searching different columns with different weight for each) and offers at the same time rather good performance. I was really hoping MySQL would offer a standard way to perform this kind of search but I was unable to found an 'official' solution. I would be really interested if someone here had a more elegant / performant solution, so please let me know!

It is important to note that you cannot use fulltext indexes in MySQL on InnoDB tables, you can only use it with MyISAM tables. For more information, please check MySQL documentation for full-text search.

Full-text search with PostgreSQL

Fulltext search support has been integrated in PostgreSQL official distribution since v8.3. (before this version, it was distributed as a plugin). I personally think that PostgreSQL implementation of fulltext search is way better than the MySQL one. It may be a little harder to apprehend at first sight but it is really worth the effort!

Like MySQL, PostgreSQL provides a special index (called GIN) optimized for fulltext searching. But to leverage the real power of this index, you will need to store the text to be searched in a special column of type 'tsvector'. An example will make it easier to understand:

CREATE TABLE experiences_desc (
id_exp integer NOT NULL REFERENCES experiences(id) ON DELETE CASCADE,
locale regconfig NOT NULL DEFAULT 'english',
title text NOT NULL,
summary text NOT NULL DEFAULT '',
description text NOT NULL DEFAULT '',
search_field tsvector
);
INSERT INTO experiences (id_exp,title,summary,description,search_field) VALUES (
1,
'title',
'summary',
'description',
setweight(to_tsvector('english','title'),'A') ||
setweight(to_tsvector('english','summary'),'B') ||
setweight(to_tsvector('english','description'),'C')
);
CREATE INDEX experiences_idx ON experiences USING gin(search_field);

What is important to note here is the construction of the tsvector search_field ('||' is the string concatenation operator):

  • A tsvector tokenizes your string according to the given locale. This is why you need to pass it the locale in which your string is written. The function to_tsvector() uses a dictionary in the given locale to tokenize the input string. This way, it tokenizes intelligently the string (by detecting plurals, verbs...) and prevents common words from being tokenized. It means that the end user will not have to enter the exact words to have your search engine return relevant results.
  • You can assign different weights to the different parts of your tsvector. PostgreSQL defines 4 weights (1 >= A > B > C > D >= 0) that you can assign to a tsvector and that will help PostgreSQL rank your search results automatically and also filter your search by limiting your search to a certain weight category. Of course, you can change the default weight values assign to each letter.

Now to perform a search, you can use the following:

SELECT title, summary, description, ts_rank_cd(search_field, to_tsquery('english','search query')) AS score
FROM experiences_desc
WHERE to_tsquery('english','search query') @@ search_field
ORDER BY score DESC
OFFSET 0 LIMIT 10;

Important things to note are:

  • you need to use to_tsquery() to prepare your search query and the operator @@ to make it search using the specific GIN index.
  • ranking is done automatically by PostgreSQL thanks to the ts_rank_cd() operator. Several other ranking algorithms are available (using different functions / parameters )

This is really just a quick overview of the tools PostgreSQL provides to perform fulltext search. You can also ask PostgreSQL to highlight the matching patterns in the search results, gather documents statistics... it is really flexible and powerful! To find out more on these features, here's a link to PostgreSQL 8.3 Full Text Search documentation. Needless to say that I find PostgreSQL to be much more powerful than MySQL on the fulltext search aspect.

Use SQL jointures with caution

In order to keep performance acceptable, you should also try to keep the number of SQL jointures in the search query as low as possible. Also, don't forget to perform your jointures after you have found your results and not before. To illustrate this, consider the 2 following queries, the first one:

SELECT e.id, e.title, e.desc, SUM(s.score) AS score
FROM experiences AS e
INNER JOIN (
SELECT id, MATCH(title) AGAINST (query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(title) AGAINST ('query' IN BOOLEAN MODE)
) AS s ON a.id=s.id
INNER JOIN some_other_table AS sot ON (sot.id=s.id)
ORDER BY score DESC
OFFSET 0 LIMIT 10;

and the second one:

SELECT e.id, e.title, e.desc, SUM(s.score) AS score
FROM experiences AS e
INNER JOIN (
SELECT id, MATCH(title) AGAINST ('query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(title) AGAINST ('query' IN BOOLEAN MODE)
ORDER BY score DESC
OFFSET 0 LIMIT 10;
) AS s ON a.id=s.id
INNER JOIN some_other_table AS sot ON (sot.id=s.id)

The 2 queries give the exact same results but the second one will take only a tiny fraction of the time required by the second one to perform the search. This may seem obvious here as the query is small, but this is the kind of stupid errors that can slip easily in bigger queries (especially if queries are dynamically built).

Dedicated search tables

If you still cannot reach decent performance with all the above, you might have too many jointures in your query and should consider building a dedicated table to perform your searches. This method might also fit your needs in case you want full text search on an InnoDB table. The problem with these dedicated search tables is that they are never completely up-to-date. So it's up to you to know, given your business requirements, if you can use this method or not. We use this method in Freelance Business Club, this is why it may take up to 1 day for new members to appear in members search results.

Limit computation in your search queries

To achieve good performance, you should also limit as much as possible the use of user defined functions or calculus in your queries. Example: If you want to search the most experienced freelancers on SAP, then the total number of months a guy spent working on SAP is a relevant indicator. The problem is that you only have a start date and an end date at your disposal. In this case, instead of calculating the total number of month directly in your query, you'd better use a batch process and set up a cron job which would pre-calculate the total number of month for each experience every <put_a_frequence_here>. This way, it will not slow down every one of your search queries!

Filter and sort your search

It is very convenient for the end user to be able to filter and sort the search results. You should think about this possibility when building your search engine. Sorting is quite easy but beware of filtering. In the best case, you will just have to add another constraint in the WHERE clause or in an existing jointure but some filters might require you to add another jointure to your search which might slow it down.

Calculate the total number of results for your search

Finally, in order to paginate your search results well, you probably need the total number of results for a given search query. To get this value, I know no other method than performing a new SQL search query (without any ORDER BY / OFFSET / LIMIT and without any useless jointure) with a brutal COUNT(*) in the SELECT statement. If anybody has another / better method, please let me know!

To avoid this additional query (COUNT(*) can be a very time-consuming query!), some sites just don't give the total of number of results. See Facebook for example, they don't give the exact number of results and their pagination displays '1 2 3 next'. I guess this is one the reasons (but this is just pure speculation...).

Search implementation methodology

I wanted to finish this long post with the method I use to implement search. In my presentation layer, I build a query array which looks like this :
array(
'text'=>'<search_query>',
'filter'=>array(
'filter_type1'=>'<filter_value1',
'filter_type2'=>'<filter_value2>'
),
'order_by'=>'<order_by>',
'limit'=> <limit>,
'start'=> <start>
);

This query structure is the same no matter what type of search is performed, this is very flexible and facilitate code reuse enormously.
Then, in the business layer, I have a method <object_type>_search() which:

  • transform the array query values into values understandable by the data layer (when necessary).
  • calls <object_type>_dbSearch()
  • computes the search results to return a result array which looks like:
    array(
    'results'=> <search_results>,
    'nb_pages'=> <nb_pages>,
    'nb_results'=> <nb_results>
    );

Finally, in the data layer, I have a method  <object_type>_dbSearch() which calls a method <object_type>_buildSearchQuery() twice : the first time to build the search query and the second time to build the search count query (the one that counts the total number of results for the search query). It returns the raw search results.

This was definitely a long post but I hope nevertheless that it will be intelligible and useful for you!

See you next post!

15Nov/084

Web application implementation step 1: choose your weapons!

RT @thibauld Web application implementation step 1: choose your weapons!

During the next series of posts, I will describe the different steps involved in building a web application. These posts will only deal with the technical aspects of implementing a web application that has already been designed. Web application design is another subject which I'll develop in another series of posts.

I'll take the new FBC website as an example. It is a good example as we almost had to start from scratch to build the new version. Indeed, the architecture in place was extremely inflexible, the worst being :

  • Webpage layout relied on field values in SQL tables
  • 1 url <=> 1 php page containing everything from HTML to SQL

Functionally speaking, their website was working well but it had been at the expense of countless hours spent testing and testing and retesting that every process was functional. But now they were stuck... Only changing the graphical chart would have needed to go change *every* php file and restest the whole thing entirely... When I saw that, it was obvious that it would take us less time to rebuild the entire architecture from scratch and then add the new features than trying to work on this extremely rigid and non evolutive code organization. Of course, we would need to reutilize pieces from the old code in order not to re-implement the business process, but it was pretty much everything we would be able to reutilize.

So the first step when building a web application (and not a website) is choosing your weapons, that is the technology stack you're going to build your application upon :

  • Which OS are you going to use ?
  • Which webserver ?
  • Which database ?
  • Which programming language and framework ?

Choosing the OS

I'll not elaborate on this one, use Linux. The only reason I can see for not using Linux are :

  1. you know nothing about Linux and don't want to learn
  2. you were asked explicitely not to use Linux
  3. there are obvious reasons which require your app run on another system

If none of the above matches, choose Linux. And among all Linuxes, if you have no preference, I'd personally go for a Ubuntu Server LTS (Long Term Support) which is free and provides security updates during 5 years. Just check that the LTS does not end next year :) The current Ubuntu Server LTS is version 8.04 (code name Hardy Heron).

Choosing the webserver

Again, I'll not elaborate much: Apache 2 as it has proven capabilities to run high traffic websites, great documentation (official and non official), lots of modules availables... etc... There are other webservers but, unless you want to restrain it to very specific tasks, use Apache to serve your dynamic pages.

Choosing the database

I assume you want an Open Source database here. It's hard to choose here between MySQL and PostgreSQL . PostgreSQL is probably more powerful but it is also probably a little harder to get into. For example, the fulltext search in MySQL in a lot easier to use than in PostgreSQL but PostgreSQL's fulltext search allows you to perform more complex queries. In the same spirit, MySQL provides handy but limited functions (ex: GROUP_CONCAT) where PostgreSQL offers you less handy but more powerful equivalents (ex: array).

Both database has a good official documentation but I think MySQL benefits from a lot more unofficial documentation compared to PostgreSQL. In terms of proven capabilities, both have examples of usage in very demanding contexts...

Finally, the choice is up to you BUT if you use MySQL, do not forget to use InnoDB tables instead of MyISAM tables. By default, tables are created using MyISAM. Using MyISAM table can be a good choice but it must be a known choice. Indeed, MyISAM table does not support FOREIGN KEYS nor does it support sql transactions so it is very limited and will inevitably result in added programming overhead if you build a complex web application upon MyISAM tables.

Choosing a framework

Rule 1: be skeptical

First, you should not use a framework for the sake of using a framework. Instead, you should really think at what you expect from your framework. I'm personally a bit skeptical about a lot of frameworks because, in my opinion, a lot of them end up being bloated, trying to reinvent the wheel and with finally a limited number of really useful functionalities. Keep in mind that using a framework is a binding investment: First it needs time to learn how to use it efficiently and, second, it generally deeply affects your code. So you'd better choose your framework well if you don't want to loose a lot of time !

Rule 2: know why you need a framework

Beware of frameworks that try to provide a solution for every aspect of your application as they are likely to belong to the "bloated" category! I personnally prefer a framework which focuses on doing 1 thing right than a framework that tries to simplify everything. For each functionality, ask yourself : "Do I really need my framework to do this ? Will it really help me be faster in my work ?

Rule 3: Beware of marketing crap

Here are examples of marketing functionalities I always consider with skepticism :

  • i18n support (internationalization): In which way is it better than gettext ?
  • RESTful URLs: You mean Apache URL Rewriting (and optionally a routing index.php) ?
  • Easy Ajax: I did not know Ajax was hard to implement on the server side! Are you looking for a Javascript framework ?
  • Full database abstraction: This one must be considered really really carefully. Database design is one of the most crucial part when you build a web application. In the same manner, how you query your database has an enormous impact on your application performance. This is why I am really really skeptical when it comes to delegating the database management to the framework. Some frameworks even create the database for you! Unless you're talking of a very simple web app, I think it just does not make sense... The only explanation I can think of is that a lot of web developers have a very little knowledge of SQL and, for them, it's easier/better to have the framework handle the database. On the other hand, it is also true that that database abstraction (especially ORM) are useful to handle all the simple queries. As a conclusion, concerning database abstraction, I would only use one if I'm sure that it's flexible enough to :
    1. Work on top of the database schema I designed
    2. Let me perform raw SQL queries when needed

I see a framework as a toolbox to help you concentrate your efforts on implementing real business code. So what can you expect from a framework ? A templating system, a caching system, a testing / debugging framework... All of these are very useful functionalities that a framework can offer because coding them would sure be a waste of time!

Choosing a programming language

There are a lot of very good languages for web programming (PHP, Python, Ruby...) and I can see a few differences:

  • Framework compatibility: Once you've chosen your framework, you probably already know the programming language you'll be using.
  • Quality and quantity of documentation: I think PHP has the best and most comprehensive documentation, followed by Python and Ruby.
  • Library exhaustivity: Again, in my opinion, PHP has the deepest library available, followed by Python and Ruby. You should check beforehand that a language has all the required library for the app you want to implement. For example, if you plan to use SOAP based webservices, you probably want to avoid Python as (last time I checked) it has one of the most limited SOAP library. PHP, on the contrary, has a good gsoap compatible SOAP implementation.
  • Familiarity with the language: One of the most important criteria of course! How productive are you with each language ? Even if all of these languages are relatively easy to get into, mastering a language takes time... so if you're a Ruby guru and have never used PHP, then you're probably better off with Ruby...
  • Language beauty: Sorry guys but this is unfortunately NOT a good criteria. Had it been, then I would have rated Ruby first, then Python and then.... far behind, PHP :)

To conclude this long post, I think the most important rule is :

Use production ready technologies which have proven to be performant in similar contexts

Each time you have to choose a technology stack, no matter what it is and what bloggers and/or developers tell you about it, find relevant examples of web applications using this very stack to be sure, provided you do things well, the chosen stack will be able to fit your needs (performance, scalability...).

Stay tuned for the next post where I'll talk about the technology stack I've chosen for FBC!