Thibauld - Imagination and Execution -

26Jan/091

Introducing allmyapps!

RT @thibauld Introducing allmyapps!

If you follow this blog, you probably know that, last august, I accepted to help develop freelancebusinessclub's new website. Three months later, on November 3rd, the new website was launched! Since then, after having squashed the small little bugs left, I started working on another project, my project. It's something that I've been thinking about a looooong time ago, it took me several years (really!) to mature it and, having completely recovered my technical skills, I thought it was the right time to make it become a reality.

Feeling that the market has reached a point where my project finally makes sense, I'm proud to introduce you to my new baby : allmyapps, the first linux app store designed to help users find and install new applications on their Ubuntu Linux system.

To fully understand this project, you need to know that I'm a long time Linux user and evangelist. I started using Linux in 1998 (with a slackware) while still in high school and I've always been convinced that Linux would find its way to the desktop sooner or later. There are 2 factors in particular that make me think that this time has come:

  • Canonical and Mark Shuttleworth's vision is, by far, the most important point. In my opinion, the biggest reason of the Linux Desktop relative failure til now is lack of willingness from the major Linux players. Let's face it... who, today, really works to make the Linux Desktop a reality? Red Hat? They purely and simply abandonned this market. Novell? They only considers the enterprise desktop... The only company really working towards making the Linux Desktop a reality is Canonical, Mark Shuttleworth (its founder) being committed to put the Linux Desktop into everybody's hands. And it's not only words! In the past months, Canonical has hired usability experts, desktop experience engineers... and the company is really working, both technically and commercially, to make the Linux Desktop happen. Unlike everybody else, they're not waiting for the market to move but are working to make it move!
  • The netbook opportunity. Did you see the laptop market last year? People are fond of netbooks: 14 millions of netbooks were sold in 2008, with around 30% of them preinstalled with Linux! For the first time, normal people (unlike you and me) are using Linux and loving it. Do not trust the rumor that says that the return rate is higher with linux netbooks as it is simply not true. Linux price and versatility is a strong asset when it comes to run machines with low spec and sell them at a low price. Netbooks are a great market opportunity for the Linux desktop! For the first time, "normal users" are buying machines with a Linux pre-installed on it.

With the rise of the netbooks, more and more users today are confronted to a linux desktop. While using the system in itself is not a problem for most users, I've experienced that people tend to have a problem when it comes to finding new applications to install on their system. On the contraty to Windows systems, Linux systems historically have a very easy, quick and reliable installation mechanism, thanks to their centralized repositories and their associated package management systems. The only drawback of this mechanism is that it is different from the one users are used to on Windows... and it's a big drawback because users are getting lost... and lost users quickly turn into angry users! And the last thing you want is your users becoming angry! With allmyapps, we want to address this very specific issue.

I will probably not be blogging much about allmyapps on this blog. If you want to follow the development of allmyapps, I advise you to subscribe to allmyapps' blog. I will continue to write posts on my personal blog (this blog), I already have a few posts waiting about web application debugging / profiling and web application performance tuning... so stay tuned!

To conclude this post, I just wanted to say that we will be attending Recked in Amsterdam on monday 26th of january (today!) and then FOSDEM in Brussels on february 6th and 7th. If you'll be there too and want to meet, just leave me a comment!

4Jan/096

Web application step 5: implement a “relevant” search engine

RT @thibauld Web application step 5: implement a “relevant” search engine

For this very first post of 2009, let me first wish you all the best for the coming year! I worked *a lot* in 2008 and 2009 will be the year where all this underground work will finally make sense so I'm really looking forward to it... and I hope you do too! Let's make 2009 rock :)

Let's now go back to the subject of this post! Search is a very important feature for most web applications (read this book in case you need to be convinced) but, unfortunately, it is also one of the most overlooked feature. Why ? Probably because implementing a good search is a difficult thing: Placing a search form at the top right of your web app is not difficult, having it display some results when a user searches for something is not a problem either... but displaying search results which are actually relevant and useful for the end user is much harder!

Implementing a relevant search engine for your web app is a task that cannot be delegated to a third party web service or to an external library, you have to do it by yourself for a simple reason: a search can only be relevant if it leverages the information contained in your business objects with intelligence. This is why technologies like the new Google Custom Search will not help you to implement a relevant search engine. Google Custom Search is ok for your blog or any other editorial website... but what if you want, for example, your users to be able to look for the freelancers that have the longest experience with SAP FI CO ?

Implementing a good search begins with pen and paper:

  • On which business objects can the end user perform a search ?
  • And for each business objects:
    • How to rank results by relevancy for any given search ?
    • How should I weight each information ?
    • What information do I need to rank my results ?
    • Is this information already present in my database ? Where ?
    • Which filters would be relevant ?
    • What are the relevant way to sort the search results ?

Once you've thought about all this, you did the most important part... but not the most difficult. Now let's implement it!

The main problem you'll be facing when implementing a relevant search engine is performance. Being relevant means processing a lot of information, which is very time consuming, even for today's fast processors and hard drives. This is why you should let the database handle most of the searching work, using well constructed indexes and queries, and avoid as much as possible doing work outside the database.

Use full-text search

Obviously, using the LIKE sql operator (or any other regexp matching...) will not help much in implementing a good search. It is way too slow because LIKE queries cannot use table indexes and requires a full sequential scan of your table for each query. The only exception to this is when you know the first characters you're looking for (i.e. WHERE name LIKE 'query%'), in this case, the database will leverage the btree index on column name (if available) and you're likely to get acceptable performance with very small and targeted queries.

In all other cases, you will have to use fulltext search mecanisms. Fulltext search refers to special type of indexes which are specialized in string pattern matching. Fulltext search has been implemented completely differently in MySQL and PostgreSQL. I will now give you an overview on how to use fulltext search with both databases.

Full-text search with MySQL

As I've said in a previous post, MySQL's fulltext search mechanism focuses on ease-of-use. Indeed, with MySQL you just have to create a 'fulltext' index on the relevant columns to enable fulltext search. Example:

CREATE FULLTEXT INDEX description ON experiences (description);

It creates a fulltext index named description on the column description in the table experiences. To leverage this new index, you need to write your queries using the following syntax:

SELECT id, MATCH(description) AGAINST('') AS score
FROM experiences
WHERE MATCH(description) AGAINST('')
ORDER BY score DESC
OFFSET 0 LIMIT 10;

Using the fulltext index, MySQL will quickly determine the best matching descriptions for the input query. For each row, MySQL returns its relevancy score. You can influence the calculation of this score by using parameters in the AGAINST() operator.

As you can see, it is really easy to implement fulltext search in MySQL... maybe too easy as there are common search related issues that are not addressed. For example, a very classical requirement is perform a search on multiple columns of a table (i.e. title,summary and description) and with different weights assigned to each column.. Now how do I implement that in MySQL? Well... I'm sorry to announce that I did not figure out a beautiful way to address the issue. On Freelance Business Club, we solved the issue by constructing queries like the following one:

SELECT e.id, e.title, e.desc, SUM(s.score) AS score
FROM experiences AS e
INNER JOIN (
SELECT id, <title_weight>+MATCH(title) AGAINST (query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(title) AGAINST ('query' IN BOOLEAN MODE)
UNION ALL
SELECT id, <summary_weight>+MATCH(summary) AGAINST('query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(summary) AGAINST('query' IN BOOLEAN MODE)
UNION ALL
SELECT id, <description_weight>+MATCH(description) AGAINST('query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(description) AGAINST('query' IN BOOLEAN MODE)
) AS s ON a.id=s.id
ORDER BY score DESC
OFFSET 0 LIMIT 10;

Even if I'm not particularly proud of this query, it gives us the flexibility we were looking for (searching different columns with different weight for each) and offers at the same time rather good performance. I was really hoping MySQL would offer a standard way to perform this kind of search but I was unable to found an 'official' solution. I would be really interested if someone here had a more elegant / performant solution, so please let me know!

It is important to note that you cannot use fulltext indexes in MySQL on InnoDB tables, you can only use it with MyISAM tables. For more information, please check MySQL documentation for full-text search.

Full-text search with PostgreSQL

Fulltext search support has been integrated in PostgreSQL official distribution since v8.3. (before this version, it was distributed as a plugin). I personally think that PostgreSQL implementation of fulltext search is way better than the MySQL one. It may be a little harder to apprehend at first sight but it is really worth the effort!

Like MySQL, PostgreSQL provides a special index (called GIN) optimized for fulltext searching. But to leverage the real power of this index, you will need to store the text to be searched in a special column of type 'tsvector'. An example will make it easier to understand:

CREATE TABLE experiences_desc (
id_exp integer NOT NULL REFERENCES experiences(id) ON DELETE CASCADE,
locale regconfig NOT NULL DEFAULT 'english',
title text NOT NULL,
summary text NOT NULL DEFAULT '',
description text NOT NULL DEFAULT '',
search_field tsvector
);
INSERT INTO experiences (id_exp,title,summary,description,search_field) VALUES (
1,
'title',
'summary',
'description',
setweight(to_tsvector('english','title'),'A') ||
setweight(to_tsvector('english','summary'),'B') ||
setweight(to_tsvector('english','description'),'C')
);
CREATE INDEX experiences_idx ON experiences USING gin(search_field);

What is important to note here is the construction of the tsvector search_field ('||' is the string concatenation operator):

  • A tsvector tokenizes your string according to the given locale. This is why you need to pass it the locale in which your string is written. The function to_tsvector() uses a dictionary in the given locale to tokenize the input string. This way, it tokenizes intelligently the string (by detecting plurals, verbs...) and prevents common words from being tokenized. It means that the end user will not have to enter the exact words to have your search engine return relevant results.
  • You can assign different weights to the different parts of your tsvector. PostgreSQL defines 4 weights (1 >= A > B > C > D >= 0) that you can assign to a tsvector and that will help PostgreSQL rank your search results automatically and also filter your search by limiting your search to a certain weight category. Of course, you can change the default weight values assign to each letter.

Now to perform a search, you can use the following:

SELECT title, summary, description, ts_rank_cd(search_field, to_tsquery('english','search query')) AS score
FROM experiences_desc
WHERE to_tsquery('english','search query') @@ search_field
ORDER BY score DESC
OFFSET 0 LIMIT 10;

Important things to note are:

  • you need to use to_tsquery() to prepare your search query and the operator @@ to make it search using the specific GIN index.
  • ranking is done automatically by PostgreSQL thanks to the ts_rank_cd() operator. Several other ranking algorithms are available (using different functions / parameters )

This is really just a quick overview of the tools PostgreSQL provides to perform fulltext search. You can also ask PostgreSQL to highlight the matching patterns in the search results, gather documents statistics... it is really flexible and powerful! To find out more on these features, here's a link to PostgreSQL 8.3 Full Text Search documentation. Needless to say that I find PostgreSQL to be much more powerful than MySQL on the fulltext search aspect.

Use SQL jointures with caution

In order to keep performance acceptable, you should also try to keep the number of SQL jointures in the search query as low as possible. Also, don't forget to perform your jointures after you have found your results and not before. To illustrate this, consider the 2 following queries, the first one:

SELECT e.id, e.title, e.desc, SUM(s.score) AS score
FROM experiences AS e
INNER JOIN (
SELECT id, MATCH(title) AGAINST (query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(title) AGAINST ('query' IN BOOLEAN MODE)
) AS s ON a.id=s.id
INNER JOIN some_other_table AS sot ON (sot.id=s.id)
ORDER BY score DESC
OFFSET 0 LIMIT 10;

and the second one:

SELECT e.id, e.title, e.desc, SUM(s.score) AS score
FROM experiences AS e
INNER JOIN (
SELECT id, MATCH(title) AGAINST ('query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(title) AGAINST ('query' IN BOOLEAN MODE)
ORDER BY score DESC
OFFSET 0 LIMIT 10;
) AS s ON a.id=s.id
INNER JOIN some_other_table AS sot ON (sot.id=s.id)

The 2 queries give the exact same results but the second one will take only a tiny fraction of the time required by the second one to perform the search. This may seem obvious here as the query is small, but this is the kind of stupid errors that can slip easily in bigger queries (especially if queries are dynamically built).

Dedicated search tables

If you still cannot reach decent performance with all the above, you might have too many jointures in your query and should consider building a dedicated table to perform your searches. This method might also fit your needs in case you want full text search on an InnoDB table. The problem with these dedicated search tables is that they are never completely up-to-date. So it's up to you to know, given your business requirements, if you can use this method or not. We use this method in Freelance Business Club, this is why it may take up to 1 day for new members to appear in members search results.

Limit computation in your search queries

To achieve good performance, you should also limit as much as possible the use of user defined functions or calculus in your queries. Example: If you want to search the most experienced freelancers on SAP, then the total number of months a guy spent working on SAP is a relevant indicator. The problem is that you only have a start date and an end date at your disposal. In this case, instead of calculating the total number of month directly in your query, you'd better use a batch process and set up a cron job which would pre-calculate the total number of month for each experience every <put_a_frequence_here>. This way, it will not slow down every one of your search queries!

Filter and sort your search

It is very convenient for the end user to be able to filter and sort the search results. You should think about this possibility when building your search engine. Sorting is quite easy but beware of filtering. In the best case, you will just have to add another constraint in the WHERE clause or in an existing jointure but some filters might require you to add another jointure to your search which might slow it down.

Calculate the total number of results for your search

Finally, in order to paginate your search results well, you probably need the total number of results for a given search query. To get this value, I know no other method than performing a new SQL search query (without any ORDER BY / OFFSET / LIMIT and without any useless jointure) with a brutal COUNT(*) in the SELECT statement. If anybody has another / better method, please let me know!

To avoid this additional query (COUNT(*) can be a very time-consuming query!), some sites just don't give the total of number of results. See Facebook for example, they don't give the exact number of results and their pagination displays '1 2 3 next'. I guess this is one the reasons (but this is just pure speculation...).

Search implementation methodology

I wanted to finish this long post with the method I use to implement search. In my presentation layer, I build a query array which looks like this :
array(
'text'=>'<search_query>',
'filter'=>array(
'filter_type1'=>'<filter_value1',
'filter_type2'=>'<filter_value2>'
),
'order_by'=>'<order_by>',
'limit'=> <limit>,
'start'=> <start>
);

This query structure is the same no matter what type of search is performed, this is very flexible and facilitate code reuse enormously.
Then, in the business layer, I have a method <object_type>_search() which:

  • transform the array query values into values understandable by the data layer (when necessary).
  • calls <object_type>_dbSearch()
  • computes the search results to return a result array which looks like:
    array(
    'results'=> <search_results>,
    'nb_pages'=> <nb_pages>,
    'nb_results'=> <nb_results>
    );

Finally, in the data layer, I have a method  <object_type>_dbSearch() which calls a method <object_type>_buildSearchQuery() twice : the first time to build the search query and the second time to build the search count query (the one that counts the total number of results for the search query). It returns the raw search results.

This was definitely a long post but I hope nevertheless that it will be intelligible and useful for you!

See you next post!