Thibauld - Imagination and Execution -

4Jan/096

Web application step 5: implement a “relevant” search engine

RT @thibauld Web application step 5: implement a “relevant” search engine

For this very first post of 2009, let me first wish you all the best for the coming year! I worked *a lot* in 2008 and 2009 will be the year where all this underground work will finally make sense so I'm really looking forward to it... and I hope you do too! Let's make 2009 rock :)

Let's now go back to the subject of this post! Search is a very important feature for most web applications (read this book in case you need to be convinced) but, unfortunately, it is also one of the most overlooked feature. Why ? Probably because implementing a good search is a difficult thing: Placing a search form at the top right of your web app is not difficult, having it display some results when a user searches for something is not a problem either... but displaying search results which are actually relevant and useful for the end user is much harder!

Implementing a relevant search engine for your web app is a task that cannot be delegated to a third party web service or to an external library, you have to do it by yourself for a simple reason: a search can only be relevant if it leverages the information contained in your business objects with intelligence. This is why technologies like the new Google Custom Search will not help you to implement a relevant search engine. Google Custom Search is ok for your blog or any other editorial website... but what if you want, for example, your users to be able to look for the freelancers that have the longest experience with SAP FI CO ?

Implementing a good search begins with pen and paper:

  • On which business objects can the end user perform a search ?
  • And for each business objects:
    • How to rank results by relevancy for any given search ?
    • How should I weight each information ?
    • What information do I need to rank my results ?
    • Is this information already present in my database ? Where ?
    • Which filters would be relevant ?
    • What are the relevant way to sort the search results ?

Once you've thought about all this, you did the most important part... but not the most difficult. Now let's implement it!

The main problem you'll be facing when implementing a relevant search engine is performance. Being relevant means processing a lot of information, which is very time consuming, even for today's fast processors and hard drives. This is why you should let the database handle most of the searching work, using well constructed indexes and queries, and avoid as much as possible doing work outside the database.

Use full-text search

Obviously, using the LIKE sql operator (or any other regexp matching...) will not help much in implementing a good search. It is way too slow because LIKE queries cannot use table indexes and requires a full sequential scan of your table for each query. The only exception to this is when you know the first characters you're looking for (i.e. WHERE name LIKE 'query%'), in this case, the database will leverage the btree index on column name (if available) and you're likely to get acceptable performance with very small and targeted queries.

In all other cases, you will have to use fulltext search mecanisms. Fulltext search refers to special type of indexes which are specialized in string pattern matching. Fulltext search has been implemented completely differently in MySQL and PostgreSQL. I will now give you an overview on how to use fulltext search with both databases.

Full-text search with MySQL

As I've said in a previous post, MySQL's fulltext search mechanism focuses on ease-of-use. Indeed, with MySQL you just have to create a 'fulltext' index on the relevant columns to enable fulltext search. Example:

CREATE FULLTEXT INDEX description ON experiences (description);

It creates a fulltext index named description on the column description in the table experiences. To leverage this new index, you need to write your queries using the following syntax:

SELECT id, MATCH(description) AGAINST('') AS score
FROM experiences
WHERE MATCH(description) AGAINST('')
ORDER BY score DESC
OFFSET 0 LIMIT 10;

Using the fulltext index, MySQL will quickly determine the best matching descriptions for the input query. For each row, MySQL returns its relevancy score. You can influence the calculation of this score by using parameters in the AGAINST() operator.

As you can see, it is really easy to implement fulltext search in MySQL... maybe too easy as there are common search related issues that are not addressed. For example, a very classical requirement is perform a search on multiple columns of a table (i.e. title,summary and description) and with different weights assigned to each column.. Now how do I implement that in MySQL? Well... I'm sorry to announce that I did not figure out a beautiful way to address the issue. On Freelance Business Club, we solved the issue by constructing queries like the following one:

SELECT e.id, e.title, e.desc, SUM(s.score) AS score
FROM experiences AS e
INNER JOIN (
SELECT id, <title_weight>+MATCH(title) AGAINST (query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(title) AGAINST ('query' IN BOOLEAN MODE)
UNION ALL
SELECT id, <summary_weight>+MATCH(summary) AGAINST('query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(summary) AGAINST('query' IN BOOLEAN MODE)
UNION ALL
SELECT id, <description_weight>+MATCH(description) AGAINST('query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(description) AGAINST('query' IN BOOLEAN MODE)
) AS s ON a.id=s.id
ORDER BY score DESC
OFFSET 0 LIMIT 10;

Even if I'm not particularly proud of this query, it gives us the flexibility we were looking for (searching different columns with different weight for each) and offers at the same time rather good performance. I was really hoping MySQL would offer a standard way to perform this kind of search but I was unable to found an 'official' solution. I would be really interested if someone here had a more elegant / performant solution, so please let me know!

It is important to note that you cannot use fulltext indexes in MySQL on InnoDB tables, you can only use it with MyISAM tables. For more information, please check MySQL documentation for full-text search.

Full-text search with PostgreSQL

Fulltext search support has been integrated in PostgreSQL official distribution since v8.3. (before this version, it was distributed as a plugin). I personally think that PostgreSQL implementation of fulltext search is way better than the MySQL one. It may be a little harder to apprehend at first sight but it is really worth the effort!

Like MySQL, PostgreSQL provides a special index (called GIN) optimized for fulltext searching. But to leverage the real power of this index, you will need to store the text to be searched in a special column of type 'tsvector'. An example will make it easier to understand:

CREATE TABLE experiences_desc (
id_exp integer NOT NULL REFERENCES experiences(id) ON DELETE CASCADE,
locale regconfig NOT NULL DEFAULT 'english',
title text NOT NULL,
summary text NOT NULL DEFAULT '',
description text NOT NULL DEFAULT '',
search_field tsvector
);
INSERT INTO experiences (id_exp,title,summary,description,search_field) VALUES (
1,
'title',
'summary',
'description',
setweight(to_tsvector('english','title'),'A') ||
setweight(to_tsvector('english','summary'),'B') ||
setweight(to_tsvector('english','description'),'C')
);
CREATE INDEX experiences_idx ON experiences USING gin(search_field);

What is important to note here is the construction of the tsvector search_field ('||' is the string concatenation operator):

  • A tsvector tokenizes your string according to the given locale. This is why you need to pass it the locale in which your string is written. The function to_tsvector() uses a dictionary in the given locale to tokenize the input string. This way, it tokenizes intelligently the string (by detecting plurals, verbs...) and prevents common words from being tokenized. It means that the end user will not have to enter the exact words to have your search engine return relevant results.
  • You can assign different weights to the different parts of your tsvector. PostgreSQL defines 4 weights (1 >= A > B > C > D >= 0) that you can assign to a tsvector and that will help PostgreSQL rank your search results automatically and also filter your search by limiting your search to a certain weight category. Of course, you can change the default weight values assign to each letter.

Now to perform a search, you can use the following:

SELECT title, summary, description, ts_rank_cd(search_field, to_tsquery('english','search query')) AS score
FROM experiences_desc
WHERE to_tsquery('english','search query') @@ search_field
ORDER BY score DESC
OFFSET 0 LIMIT 10;

Important things to note are:

  • you need to use to_tsquery() to prepare your search query and the operator @@ to make it search using the specific GIN index.
  • ranking is done automatically by PostgreSQL thanks to the ts_rank_cd() operator. Several other ranking algorithms are available (using different functions / parameters )

This is really just a quick overview of the tools PostgreSQL provides to perform fulltext search. You can also ask PostgreSQL to highlight the matching patterns in the search results, gather documents statistics... it is really flexible and powerful! To find out more on these features, here's a link to PostgreSQL 8.3 Full Text Search documentation. Needless to say that I find PostgreSQL to be much more powerful than MySQL on the fulltext search aspect.

Use SQL jointures with caution

In order to keep performance acceptable, you should also try to keep the number of SQL jointures in the search query as low as possible. Also, don't forget to perform your jointures after you have found your results and not before. To illustrate this, consider the 2 following queries, the first one:

SELECT e.id, e.title, e.desc, SUM(s.score) AS score
FROM experiences AS e
INNER JOIN (
SELECT id, MATCH(title) AGAINST (query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(title) AGAINST ('query' IN BOOLEAN MODE)
) AS s ON a.id=s.id
INNER JOIN some_other_table AS sot ON (sot.id=s.id)
ORDER BY score DESC
OFFSET 0 LIMIT 10;

and the second one:

SELECT e.id, e.title, e.desc, SUM(s.score) AS score
FROM experiences AS e
INNER JOIN (
SELECT id, MATCH(title) AGAINST ('query' IN BOOLEAN MODE) AS score
FROM experiences
WHERE MATCH(title) AGAINST ('query' IN BOOLEAN MODE)
ORDER BY score DESC
OFFSET 0 LIMIT 10;
) AS s ON a.id=s.id
INNER JOIN some_other_table AS sot ON (sot.id=s.id)

The 2 queries give the exact same results but the second one will take only a tiny fraction of the time required by the second one to perform the search. This may seem obvious here as the query is small, but this is the kind of stupid errors that can slip easily in bigger queries (especially if queries are dynamically built).

Dedicated search tables

If you still cannot reach decent performance with all the above, you might have too many jointures in your query and should consider building a dedicated table to perform your searches. This method might also fit your needs in case you want full text search on an InnoDB table. The problem with these dedicated search tables is that they are never completely up-to-date. So it's up to you to know, given your business requirements, if you can use this method or not. We use this method in Freelance Business Club, this is why it may take up to 1 day for new members to appear in members search results.

Limit computation in your search queries

To achieve good performance, you should also limit as much as possible the use of user defined functions or calculus in your queries. Example: If you want to search the most experienced freelancers on SAP, then the total number of months a guy spent working on SAP is a relevant indicator. The problem is that you only have a start date and an end date at your disposal. In this case, instead of calculating the total number of month directly in your query, you'd better use a batch process and set up a cron job which would pre-calculate the total number of month for each experience every <put_a_frequence_here>. This way, it will not slow down every one of your search queries!

Filter and sort your search

It is very convenient for the end user to be able to filter and sort the search results. You should think about this possibility when building your search engine. Sorting is quite easy but beware of filtering. In the best case, you will just have to add another constraint in the WHERE clause or in an existing jointure but some filters might require you to add another jointure to your search which might slow it down.

Calculate the total number of results for your search

Finally, in order to paginate your search results well, you probably need the total number of results for a given search query. To get this value, I know no other method than performing a new SQL search query (without any ORDER BY / OFFSET / LIMIT and without any useless jointure) with a brutal COUNT(*) in the SELECT statement. If anybody has another / better method, please let me know!

To avoid this additional query (COUNT(*) can be a very time-consuming query!), some sites just don't give the total of number of results. See Facebook for example, they don't give the exact number of results and their pagination displays '1 2 3 next'. I guess this is one the reasons (but this is just pure speculation...).

Search implementation methodology

I wanted to finish this long post with the method I use to implement search. In my presentation layer, I build a query array which looks like this :
array(
'text'=>'<search_query>',
'filter'=>array(
'filter_type1'=>'<filter_value1',
'filter_type2'=>'<filter_value2>'
),
'order_by'=>'<order_by>',
'limit'=> <limit>,
'start'=> <start>
);

This query structure is the same no matter what type of search is performed, this is very flexible and facilitate code reuse enormously.
Then, in the business layer, I have a method <object_type>_search() which:

  • transform the array query values into values understandable by the data layer (when necessary).
  • calls <object_type>_dbSearch()
  • computes the search results to return a result array which looks like:
    array(
    'results'=> <search_results>,
    'nb_pages'=> <nb_pages>,
    'nb_results'=> <nb_results>
    );

Finally, in the data layer, I have a method  <object_type>_dbSearch() which calls a method <object_type>_buildSearchQuery() twice : the first time to build the search query and the second time to build the search count query (the one that counts the total number of results for the search query). It returns the raw search results.

This was definitely a long post but I hope nevertheless that it will be intelligible and useful for you!

See you next post!

Comments (6) Trackbacks (0)
  1. Nice writeup about using the full-text capabilities for search. For our custom search-engine we’ve used Sphinx as the backend, it is highly configurable and performs very well. You should definitely check that out as well, as there are many good interfaces for ruby and other languages available.

  2. I did not know about Sphinx, thanks for the info. It looks like an attempt to bring MySQL fulltext search capabilities close to PostgreSQL’s ;) No kidding, next time I have to deal with MySQL, I will definitely consider using it.

  3. Thi,

    I’ve worked with keyword index tables that worked pretty nice. You need to populate the index table on insert/update of the field that will be searched for.

    Working with LIKE is totally fail, as it slows down and return unrelevant data. With index tables you can have density and thus relevancy!

    It’s quite neat and easy to mantain, realtime and quick!

    I’ve wrote about it @ my blog about effective data searching, take a look if interested.

    []s!

  4. You should give Yahoo BOSS a fair try. No easiest way to search through ALL of your online content (tweets, delicious bookmarks, sites …)

  5. Yahoo BOSS looks very interesting indeed, I had never heard about it before! Thanks for the info, I will definitely investigate on this.

  6. The Yahoo BOSS team came to paris a couple months ago. http://ysearchblog.com/2008/11/17/boss-hack-world-tour-heads-to-europe/

    It was a really interesting talk & convinced me that BOSS is the best tool for data aggregation on the web right now. Google doesn’t compete yet. (besides, Google custom search is fairly poor when it comes to features … Google, your turn now !)


Leave a comment


No trackbacks yet.