Dec 23, 2007

Live fuzzy search using n-grams in Rails

Update [July 31, 2014]:

  • This post is the basis for a few Rails plugins and gems:
    • Fuzzily - a Ruby gem for blazing-fast fuzzy text search
    • no_fuzz - Simple Rails-plugin that provides offline fuzzy-search for ActiveRecord [see my fork of this as well]
    • act_as_fuzzy_search - add a fuzzy search to model for DataMapper
    • rails-fuzzy-search - and more recent fork of the above.
  • You can now see these live applications:

We wanted to give users the ability to perform a search. It is fairly easily set up a small search engine on a site or allow users to perform a search from a major search engine such as Google directly from the site itself. Our site needed something a bit more sophisticated and useful.

First, we wanted to allow users to start typing in the search box, and have the site suggest results from the database as the user types and display them in a drop-down list under the search box. This is sometimes referred to as auto-complete, auto-suggest or "Live Search". Rails accomplishes this typically through its JavaScript macros helpers that wrap the Prototype JavaScript framework and the Scriptaculous JavaScript controls and visual effects library. See also:

We needed some extra functionality: we needed a single search feature to search multiple attributes of a model and, upon selection, populate a number of form fields with a number of attributes and associations. It actually gets even more complex than this, and perhaps I'll write more about it in a future article.

Typically, a search looks for exact matches of what is being typed. However, many important results are missed when an exact match is required: people often misspell, and related words that share the same word root will often not be found. So we wanted to allow users to find things that did not match perfectly with what was being typed. Furthermore, we wanted the results to be sorted such that the better the match, the higher a result is located on the drop down list.

One option for Rails developers is to use Ferret, a high-performance text search engine library based on Apache Lucene. Ferret is a reasonable option, and we may eventually go this route, but we wanted something simpler that did not require installing a library - we only needed the fuzzy search capability. Ferret's code is largely in C, rather than Ruby, and its fuzzy search is based on Levenshtein edit distance formula which calculates the "minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character".

Instead, we decided it should be rather simple to create an n-gram distance algorithm and leverage database indexing to create a fast n-gram distance fuzzy search, purely in Ruby. What is n-gram distance? It is the measure of the similarity of two strings by comparing the number or significance of matching substrings of length n. It is particularly useful when searching a database of strings, for example a directory of names, rather than lengthy text (though that may certainly be possible as well). Our focus will be on a database of strings.

As an example, let's take the most popular name in Europe, and two of it's spellings: Muhammed and Mohammed (there are also many other spellings). Typical live searches will not be able to help someone who has started to type one rather than the other, as the second letter is different. Live searches need at least the first two or three letters of the input to match. First, let's add a word start character (for code, we prepend a space) to the beginning of the words because word boundaries constitute useful information. Next, let's break each word into sliding frames of three letters (trigrams). Muhammed becomes: *mu, muh, uha, ham, amm, mme, med. Mohammed becomes *mo, moh, oha, ham, amm, mme, med. The two names share 4 out of 7 trigrams.

First, migrate a trigrams table with columns: tg (trigram, a string), person_id (the key in the people table, an integer), and score (an integer, though you may not need it). Make an index on the first two columns in that order. Next, generate a trigrams model. Then you must populate the trigrams table with trigrams from the people table. This is easy to do as a separate migration:

Here, we are are storing trigrams from each person's first and last names, and in this example, we are scoring matches to the last name at twice the score of the first name. This is just an example, you can score it any way you want. If you don't need to distinguish scores, you can eliminate the score column. Anyway, this migration will populate the trigrams table, and since the trigram and person_id are indexed, any search is easy on the database - and fast.

The autocomplete action method in the controller might look like this:

Do you see what's going on here? The entry is split into trigrams, then the scores of all matching trigrams in a person's name are summed. People are sorted based on their total score, and the top matches are sent to the partial. Very simple. The partial might look like this:

The ID is included, though hidden, so user selection of a person automatically includes the id of the person from the database table for further use.

It works beautifully, and it seems very fast so far - though with a huge number of items, database storage may become an issue. Any further ideas, suggestions, requests?

Dec 22, 2007

Adventures in Self-Referential Database Associations

Rails makes object associations easy and has a number of different ways to configure associations and even store extra information about each relationship. Most of the time, those relationships are between business objects that wrap rows in two separate database tables. For example, let's say we wanted to be able to record, read and update grades and attendance of students in a particular school class. We might create a student model and a school class model, and the association between them may contain the grades and attendance information. We use many of these kinds of model object associations, and the procedure is fairly straightforward and written about extensively. See the following links to learn more:

For our project, we also needed a way to store various qualities about relationships between pairs of objects of the same model type, i.e. between rows of the same database table. We would need this for a number of different purposes. Those creating 'social net' applications know this kind of association well; they must document, for example, who are the direct friends of a particular user, and possibly further information about each such relationship. All those users are stored in the same database table, so a separate "join" table is created to store the keys of both users, plus any other information about the relationship.

To accomplish this, we used self-referential has_many :through associations that use a join model and methods to simplify the reading and updating of attributes of the association. To read more about these more specialized associations, here are some sites:

More details on this later.

Dec 19, 2007

Why Ruby on Rails?

Ruby is a beautiful, purely object oriented, interpreted programming language influenced by Perl and Smalltalk, among others. Rails is a web application framework built on top of Ruby that is designed to speed and simplify the development of database-driven web applications. Rails is relatively new and not widely used relative to other web programing languages, but nonetheless its growth has been tremendous. Rails has seen so much growth because of its reputation as the fastest and easiest web development framework. Also, to achieve similar functionality Rails typically requires far fewer lines of code than the more popular web development languages.

When we first started out, we were fairly familiar with Perl and PHP - so it was a difficult decision to attempt to bring the team into the Rails community. We expected that the delay required to learn Rails would be later compensated by faster development. So we decided to learn Ruby and Rails and at least build the Builders' Site prototype with Rails.

The Israeli Rails community is rather small, but nonetheless I attempted to try to attract co-founders and developers for a previous related project through the israel.rb group. There are lots of startups in Israel, and I'm not surprised that demand exceeds supply of capable developers. At the time I was not able to interest any qualified Rails hackers to join the Israel team - part of the problem may have been that none of them knew what the project was really about, only that some bombastic American has a stealth startup.

Despite the lack of Rails experts on the team, the learning curve was not excessive and we were quickly creating some fairly complex functionality. The Rails principle of "Convention over Configuration" helps to speed development of more typical functionality. We encountered some difficulty when our functionality quickly exceeded the convention. Luckily there are some real heroes on the internet who publish their experiences with similar issues and try to help those trying to learn and develop along the same lines.

Critics of Ruby and/ or Rails argue that Ruby is slow and web applications developed in Rails do not 'scale'. Most of this criticism appears to be sour grapes. It is true that there are no blockbuster applications written in Rails, only medium sized applications, but reaching "medium size" exceeds the achievements of the vast majority of web startups. Rails defenders argue that scaling Rails applications simply involves purchasing more servers and hardware, and that the greatest limitations are database related - and all web development languages suffer from this. They argue that savings on developer-hours more than makes up for extra costs in hardware.

The debates continue, as does our development. Perhaps I will post some of the obstacles and specific solutions we found. So far we like Rails, and have no plans to go elsewhere. When we are ready to increase our development or scale the application, perhaps we will consider again all our options. The decision will, in part, depend on the availability of capable developers.

Dec 18, 2007

The Future of the Internet

I found an interesting analysis of existing applications and ideas about the future of the internet on the CNN website by Jeffrey M. O'Brien, writing for Fortune Magazine. He says:

But there is no go-to discovery engine - yet. Building a personalized discovery mechanism will mean tapping into all the manners of expression, categorization, and opinions that exist on the Web today. It's no easy feat, but if a company can pull it off and make the formula portable so it works on your mobile phone - well, such a tool could change not just marketing, but all of commerce.

"The effect of recommender systems will be one of the most important changes in the next decade," says University of Minnesota computer science professor John Riedl, who built one of the first recommendation engines in the mid-1990s. "The social web is going to be driven by these systems."

I see a growing interest in recommender systems everywhere I look, and I tend to agree with the above that the next big thing likely involves personalized discovery and recommendations. It seems clear that in order to improve the current state of the internet information overload must be reduced as well as the burden of choice - this will require advancements in recommender technology. One can also assume that the internet will continue and expand the benefits of "Web 2.0" and the collaborative internet. These benefits are, like most products and services, increased ease of use and increased usefulness.

Some feel that the "Semantic Web" or similar technologies are the future. Essentially this involves "teaching the machine" about content through the creation of universal, machine readable formats. Those critical of the limited scope of "folksonomies" champion a more standardized approach, thinking that if only people could be compelled to be disciplined in their approach toward organization of content, that some kind of utopia would emerge. There are as many critics of the Semantic Web as there are of folksonomy, and the fact that people have been talking about schemas and such for years, seems to indicate that - short of some radically new approach - there will be no utopia there. Interestingly, it may be a phenomenon called "social network fatigue" that finally forces services to adopt a standardized portable social/data/identity ontology.

The new revolution must also bring people together in much better ways than those that employ the "degrees of separation" format. Online Social Networks are extremely popular because they allow people to keep in virtual contact with their existing friends and contacts, and they can facilitate personal expression and PR. Unfortunately, social nets are not really social, and the network - as degrees of separation - serves mostly to separate. Registering for one of these sites is a rather cold experience unless you already have tons of friends and contacts already registered. Otherwise you have to spam your friends and nag them to register - something that they may not appreciate. The burden of registration and establishing relationships is compounded as more and more social nets are created, each of them requiring the time consuming input of the same personal data.

The next big thing must facilitate discovery of new people based only on similarities of their core identities, rather than focusing only on a user's primary social network. Two identical twins, separated at birth and raised in countries on opposite sides of the planet, speaking different languages, should be able to find each other. Bringing people together must include the separate objectives of: romance, friendship, business relationships, work team formation, roommates, travel and recreation buddies, etc.

In addition to people finding other people of similar core identity, the next big thing will need to do the same for non-human entities, like music, movies, TV, books, news, web pages, articles, games, products, services, vocations, jobs, travel destinations, politics, religion, advertising, or any other activity, enterprise, product, service, endeavor, idea, belief, passion or item.

In order for such matching to be accurate and satisfying for the user, there must be low false positives ("trust busters"), low false negatives ("missed opportunities") and sufficient true positives ("new favorites"). No current recommendation or recommender system comes close to even approaching this goal. Matching cannot be based on identical answers or choices, as there can never be enough questions to encompass all of human individuality. More importantly, people's existing interests in music, books, movies, etc. is not strictly linked to their core identities, but rather often more on prevailing cultural and social influences.

In our short lives, we are unlikely to ever find the people and things that we would most enjoy and appreciate. This is unfortunate.

The new internet revolution must match people to other people and things based solely on that core identity. It must do it with only a brief registration, where, upon registering, you are immediately presented with your ultimate best friends and soulmates, ideal potential business partners, as well as new favorites in every area of life. It should do it without language barriers (i.e. folksonomy or schemas), without the need for a large staff (i.e. Pandora), without a tedious registration (i.e. eHarmony), without having to observe your history (i.e. Amazon.com, Netflix, Last.fm, etc.), without requiring existing friends or contacts (i.e. Facebook, MySpace, LinkedIn, etc.), and it must be completely free for users but able to convert many of those recommendations into revenue.