Monday, February 11, 2013

Solr and Lucene Fuzzy Search - A closer look

What is Fuzzy Search?
In a sentence, Fuzzy Search allows one to submit a query against an index, and get results that are close to the desired query, but not necessarily match the query exactly. Lucene (and as such Solr) offers a very effective way (from 4.0) for quickly evaluating such fuzzy queries.
Lucene Fuzzy Search
Lucene (and as such Solr) supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:
This search will find terms like foam and roams.
Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:
The default that is used if the parameter is not given is 0.5.
Under the Hood
The Levenshtein distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Than given an unknown query, how does Lucene finds all the terms in the index that are at distance <= than the specified required similarity? Well... this depends on the Solr/Lucene version you are using.

You can take a look at the warning that appears at Lucene 3.2.0 Javadoc
Warning: this query is not very scalable with its default prefix length of 0 - in this case, *every* term will be enumerated and cause an edit score calculation.  
Moreover, prior to 4.0 release Lucene implementation to compute this distance was done for each query for EACH term in the index. You really don't want to use this. So my advice to you is to upgrade - the faster the better.

The Lucene 4.0 Fuzzy took a very different approach. The search now works with FuzzyQuery. The underlying implementation has changed in 4.0 drastically, which lead to significant complexity improvements. Current implementation uses the Levenshtein Automata. This automaton is based on the work of Klaus U. Schulz and Stoyan Mihov "Fast string correction with Levenshtein automata". To make a very long story short this paper shows how to recognize the set of all words V in an index where the Levenshtein distance between V and the query does not exceed a distance d, which is exactly what one wants with Fuzzy Search. For a deeper look see here and here.
So from 4.0 and above one can use Fuzzy Search on very large indexes and fill comfortable about it. Of course there are other ways we look for similar values in a query such as: