What
is Fuzzy Search?
In
a sentence, Fuzzy Search allows one to submit a query against an index, and get
results that are close to the desired query, but not necessarily match the query
exactly. Lucene (and as such Solr) offers a very effective way (from 4.0) for
quickly evaluating such fuzzy queries.
Lucene
Fuzzy Search
From
the Apache Lucene Documentation:
Lucene
(and as such Solr) supports fuzzy searches based on the Levenshtein Distance,
or Edit Distance algorithm. To do a fuzzy search use the tilde, "~",
symbol at the end of a Single word Term. For example to search for a term
similar in spelling to "roam" use the fuzzy search:
roam~
This
search will find terms like foam and roams.
Starting
with Lucene 1.9 an additional (optional) parameter can specify the required
similarity. The value is between 0 and 1, with a value closer to 1 only terms
with a higher similarity will be matched. For example:
roam~0.8
The
default that is used if the parameter is not given is 0.5.
Under
the Hood
The
Levenshtein distance between two words is the minimal number of insertions,
deletions or substitutions that are needed to transform one word into the
other. Than given an unknown query, how does Lucene finds all the terms in the
index that are at distance <= than the specified required similarity?
Well... this depends on the Solr/Lucene version you are using.
You
can take a look at the warning that appears at Lucene 3.2.0 Javadoc:
Warning:
this query is not very scalable with its default prefix length of 0 - in this
case, *every* term will be enumerated and cause an edit score calculation.
Moreover,
prior to 4.0 release Lucene implementation to compute this distance was done
for each query for EACH term in the index. You really don't want to use this.
So my advice to you is to upgrade - the faster the better.
The
Lucene 4.0 Fuzzy took a very different approach. The search now works with FuzzyQuery. The underlying
implementation has changed in 4.0 drastically, which lead to significant
complexity improvements. Current implementation uses the Levenshtein Automata. This
automaton is based on the work of Klaus U. Schulz and Stoyan Mihov "Fast string correction with
Levenshtein automata". To make a very long story short this
paper shows how to recognize the set of all words V in an index where
the Levenshtein distance between V and the query does not exceed
a distance d, which is exactly what one wants with Fuzzy Search. For
a deeper look see here and here.
Conclusion
So
from 4.0 and above one can use Fuzzy Search on very large indexes and fill
comfortable about it. Of course there are other ways we look for similar values
in a query such as:
- Solr SpellCheckComponent
- Solr Suggester
- Use phonetic search by elevating DoubleMetaphoneFilterFactory.
Nice to see your blog and most beautiful images. Am waiting your next valuable blog.
ReplyDeleteI want like to share some information with you About the On Job Support.