Introduction

Affinax is a novel universal targeting technology that anticipates a person’s preferences in any domain of life, with much greater speed, without popularity bias, and using far less input about users and items than existing technologies require. Much of commerce and discovery on the internet is currently powered by relatively primitive targeting technologies. With its substantial advancements, Affinax should be able to make a significant impact on internet advertising and discovery, information filtration, and online and mobile commerce. Join us.


Dec 23, 2007

Live fuzzy search using n-grams in Rails

Update [July 31, 2014]:

  • This post is the basis for a few Rails plugins and gems:
    • Fuzzily - a Ruby gem for blazing-fast fuzzy text search
    • no_fuzz - Simple Rails-plugin that provides offline fuzzy-search for ActiveRecord [see my fork of this as well]
    • act_as_fuzzy_search - add a fuzzy search to model for DataMapper
    • rails-fuzzy-search - and more recent fork of the above.
  • You can now see these live applications:

We wanted to give users the ability to perform a search. It is fairly easily set up a small search engine on a site or allow users to perform a search from a major search engine such as Google directly from the site itself. Our site needed something a bit more sophisticated and useful.

First, we wanted to allow users to start typing in the search box, and have the site suggest results from the database as the user types and display them in a drop-down list under the search box. This is sometimes referred to as auto-complete, auto-suggest or "Live Search". Rails accomplishes this typically through its JavaScript macros helpers that wrap the Prototype JavaScript framework and the Scriptaculous JavaScript controls and visual effects library. See also:

We needed some extra functionality: we needed a single search feature to search multiple attributes of a model and, upon selection, populate a number of form fields with a number of attributes and associations. It actually gets even more complex than this, and perhaps I'll write more about it in a future article.

Typically, a search looks for exact matches of what is being typed. However, many important results are missed when an exact match is required: people often misspell, and related words that share the same word root will often not be found. So we wanted to allow users to find things that did not match perfectly with what was being typed. Furthermore, we wanted the results to be sorted such that the better the match, the higher a result is located on the drop down list.

One option for Rails developers is to use Ferret, a high-performance text search engine library based on Apache Lucene. Ferret is a reasonable option, and we may eventually go this route, but we wanted something simpler that did not require installing a library - we only needed the fuzzy search capability. Ferret's code is largely in C, rather than Ruby, and its fuzzy search is based on Levenshtein edit distance formula which calculates the "minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character".

Instead, we decided it should be rather simple to create an n-gram distance algorithm and leverage database indexing to create a fast n-gram distance fuzzy search, purely in Ruby. What is n-gram distance? It is the measure of the similarity of two strings by comparing the number or significance of matching substrings of length n. It is particularly useful when searching a database of strings, for example a directory of names, rather than lengthy text (though that may certainly be possible as well). Our focus will be on a database of strings.

As an example, let's take the most popular name in Europe, and two of it's spellings: Muhammed and Mohammed (there are also many other spellings). Typical live searches will not be able to help someone who has started to type one rather than the other, as the second letter is different. Live searches need at least the first two or three letters of the input to match. First, let's add a word start character (for code, we prepend a space) to the beginning of the words because word boundaries constitute useful information. Next, let's break each word into sliding frames of three letters (trigrams). Muhammed becomes: *mu, muh, uha, ham, amm, mme, med. Mohammed becomes *mo, moh, oha, ham, amm, mme, med. The two names share 4 out of 7 trigrams.

First, migrate a trigrams table with columns: tg (trigram, a string), person_id (the key in the people table, an integer), and score (an integer, though you may not need it). Make an index on the first two columns in that order. Next, generate a trigrams model. Then you must populate the trigrams table with trigrams from the people table. This is easy to do as a separate migration:

Here, we are are storing trigrams from each person's first and last names, and in this example, we are scoring matches to the last name at twice the score of the first name. This is just an example, you can score it any way you want. If you don't need to distinguish scores, you can eliminate the score column. Anyway, this migration will populate the trigrams table, and since the trigram and person_id are indexed, any search is easy on the database - and fast.

The autocomplete action method in the controller might look like this:

Do you see what's going on here? The entry is split into trigrams, then the scores of all matching trigrams in a person's name are summed. People are sorted based on their total score, and the top matches are sent to the partial. Very simple. The partial might look like this:

The ID is included, though hidden, so user selection of a person automatically includes the id of the person from the database table for further use.

It works beautifully, and it seems very fast so far - though with a huge number of items, database storage may become an issue. Any further ideas, suggestions, requests?

10 comments:

Vitaly Gorodetsky said...

Wow man,
A really good post..
I will try your method tomorrow..
Did you implemented this somewhere?
Where from the idea for Rails?

Steven Ruttenberg said...

Thanks. If you've made an autocomplete before - it should be fairly simple to 'upgrade' it. Yes, it's implemented on our 'stealth' test site. What was your last question?

nadav said...

Cool !! Thanks.

hboon said...

How many items did you manage to take this to? I'm digging around for a lightweight n-gram lookup library/tool. Sphinx seems lightweight and scalable enough, but it doesn't do n-gram unfortunately.

Steven Ruttenberg said...

hboon: I've refactored and enhanced the autocomplete code a bit, and am now using jquery. For a current project I am testing with 54,000 trigram rows and it is blazing fast with one hit on the trigram table, and 10 hits on the object table per autocomplete ajax request. I'll test it with more data shortly.

Steven Ruttenberg said...

Anyone want to help me turn this into a plugin with proper generators to create the migration and possibly the autocomplete code?

Bjørn Arild Mæland said...

A bit late maybe, but I've created a plugin for this with generators etc. Its available at GitHub: http://github.com/Chrononaut/no_fuzz/tree/master

I'm sure you'll have ideas for improvement etc - I'll gladly add you to the repos if you want to add things.

Babar said...

Check No Fuzz, a Rails-plugin for ActiveRecord inspired by this post (
http://github.com/Chrononaut/no_fuzz/tree/master).

Sandy said...

Thanks for the great post Steve! I have a website listing music events where searching for odd band/artist names, venues and towns used to be a real problem (eg. is that 'Deadmaus' or 'Deadmau5' your looking for..?)

Anyways, using trigrams works a treat even with over 200,000 rows. I also found with some help that I could reduce the number of hits on the object table for each request down to 1 and still preserve the rank order.

You can see it in action with the 'Browse artists, bands and events' and 'Find events near you' search boxes at http://www.gigcatcher.co.uk

Cheers!

Steven Ruttenberg said...

Looks great Sandy! And great find regarding database query reduction using the detect method to sort the objects. I already tested it on an app in development and changed the sample code in the post accordingly. Awesome!