Update [December 17, 2009]:
- You can now see these live applications:
- This post is the basis for a few Rails plugins:
We wanted to give users the ability to perform a search. It is fairly easily set up a small search engine on a site or allow users to perform a search from a major search engine such as Google directly from the site itself. Our site needed something a bit more sophisticated and useful.
We needed some extra functionality: we needed a single search feature to search multiple attributes of a model and, upon selection, populate a number of form fields with a number of attributes and associations. It actually gets even more complex than this, and perhaps I'll write more about it in a future article.
Typically, a search looks for exact matches of what is being typed. However, many important results are missed when an exact match is required: people often misspell, and related words that share the same word root will often not be found. So we wanted to allow users to find things that did not match perfectly with what was being typed. Furthermore, we wanted the results to be sorted such that the better the match, the higher a result is located on the drop down list.
One option for Rails developers is to use Ferret, a high-performance text search engine library based on Apache Lucene. Ferret is a reasonable option, and we may eventually go this route, but we wanted something simpler that did not require installing a library - we only needed the fuzzy search capability. Ferret's code is largely in C, rather than Ruby, and its fuzzy search is based on Levenshtein edit distance formula which calculates the "minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character".
Instead, we decided it should be rather simple to create an n-gram distance algorithm and leverage database indexing to create a fast n-gram distance fuzzy search, purely in Ruby. What is n-gram distance? It is the measure of the similarity of two strings by comparing the number or significance of matching substrings of length n. It is particularly useful when searching a database of strings, for example a directory of names, rather than lengthy text (though that may certainly be possible as well). Our focus will be on a database of strings.
As an example, let's take the most popular name in Europe, and two of it's spellings: Muhammed and Mohammed (there are also many other spellings). Typical live searches will not be able to help someone who has started to type one rather than the other, as the second letter is different. Live searches need at least the first two or three letters of the input to match. First, let's add a word start character (for code, we prepend a space) to the beginning of the words because word boundaries constitute useful information. Next, let's break each word into sliding frames of three letters (trigrams). Muhammed becomes: *mu, muh, uha, ham, amm, mme, med. Mohammed becomes *mo, moh, oha, ham, amm, mme, med. The two names share 4 out of 7 trigrams.
First, migrate a trigrams table with columns: tg (trigram, a string), person_id (the key in the people table, an integer), and score (an integer, though you may not need it). Make an index on the first two columns in that order. Next, generate a trigrams model. Then you must populate the trigrams table with trigrams from the people table. This is easy to do as a separate migration:
Here, we are are storing trigrams from each person's first and last names, and in this example, we are scoring matches to the last name at twice the score of the first name. This is just an example, you can score it any way you want. If you don't need to distinguish scores, you can eliminate the score column. Anyway, this migration will populate the trigrams table, and since the trigram and person_id are indexed, any search is easy on the database - and fast.
The autocomplete action method in the controller might look like this:
Do you see what's going on here? The entry is split into trigrams, then the scores of all matching trigrams in a person's name are summed. People are sorted based on their total score, and the top matches are sent to the partial. Very simple. The partial might look like this:
The ID is included, though hidden, so user selection of a person automatically includes the id of the person from the database table for further use.
It works beautifully, and it seems very fast so far - though with a huge number of items, database storage may become an issue. Any further ideas, suggestions, requests?