you - spell correction algorithm python




How does the Google “Did you mean?” Algorithm work? (12)

Apart from the above answers, in case you want to implement something by yourself quickly, here is a suggestion -

Algorithm

You can find the implementation and detailed documentation of this algorithm on GitHub.

  • Create a Priority Queue with a comparator.
  • Create a Ternay Search Tree and insert all english words (from Norvig's post) along with their frequencies.
  • Start traversing the TST and for every word encountered in TST, calculate its Levenshtein Distance(LD) from input_word
  • If LD ≤ 3 then put it in a Priority Queue.
  • At Last extract 10 words from the Priority Queue and display.

I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly respond to queries with "Did you mean: xxxx".

I need to be able to intelligently take a user query and respond with not only raw search results but also with a "Did you mean?" response when there is a highly likely alternative answer etc

[I'm developing in ASP.NET (VB - don't hold it against me! )]

UPDATE: OK, how can I mimic this without the millions of 'unpaid users'?

  • Generate typos for each 'known' or 'correct' term and perform lookups?
  • Some other more elegant method?

As a guess... it could

  1. search for words
  2. if it is not found use some algorithm to try to "guess" the word.

Could be something from AI like Hopfield network or back propagation network, or something else "identifying fingerprints", restoring broken data, or spelling corrections as Davide mentioned already ...


For the theory of "did you mean" algorithm you can refer to Chapter 3 of Introduction to Information Retrieval. It is available online for free. Section 3.3 (page 52) exactly answers your question. And to specifically answer your update you only need a dictionary of words and nothing else (including millions of users).


Google apparently suggests queries with best results, not with those which are spelled correctly. But in this case, probably a spell-corrector would be more feasible, Of course you could store some value for every query, based on some metric of how good results it returns.

So,

  1. You need a dictionary (english or based on your data)

  2. Generate a word trellis and calculate probabilities for the transitions using your dictionary.

  3. Add a decoder to calculate minimum error distance using your trellis. Of course you should take care of insertions and deletions when calculating distances. Fun thing is that QWERTY keyboard maximizes the distance if you hit keys close to each other.(cae would turn car, cay would turn cat)

  4. Return the word which has the minimum distance.

  5. Then you could compare that to your query database and check if there is better results for other close matches.


Here's the explanation directly from the source ( almost )

Search 101!

at min 22:03

Worth watching!

Basically and according to Douglas Merrill former CTO of Google it is like this:

1) You write a ( misspelled ) word in google

2) You don't find what you wanted ( don't click on any results )

3) You realize you misspelled the word so you rewrite the word in the search box.

4) You find what you want ( you click in the first links )

This pattern multiplied millions of times, shows what are the most common misspells and what are the most "common" corrections.

This way Google can almost instantaneously, offer spell correction in every language.

Also this means if overnight everyone start to spell night as "nigth" google would suggest that word instead.

EDIT

@ThomasRutter: Douglas describe it as "statistical machine learning".

They know who correct the query, because they know which query comes from which user ( using cookies )

If the users perform a query, and only 10% of the users click on a result and 90% goes back and type another query ( with the corrected word ) and this time that 90% clicks on a result, then they know they have found a correction.

They can also know if those are "related" queries of two different, because they have information of all the links they show.

Furthermore, they are now including the context into the spell check, so they can even suggest different word depending on the context.

See this demo of google wave ( @ 44m 06s ) that shows how the context is taken into account to automatically correct the spelling.

Here it is explained how that natural language processing works.

And finally here is an awesome demo of what can be done adding automatic machine translation ( @ 1h 12m 47s ) to the mix.

I've added anchors of minute and seconds to the videos to skip directly to the content, if they don't work, try reloading the page or scrolling by hand to the mark.


Hmm... I thought that google used their vast corpus of data (the internet) to do some serious NLP (Natural Language Processing).

For example, they have so much data from the entire internet that they can count the number of times a three-word sequence occurs (known as a trigram). So if they see a sentence like: "pink frugr concert", they could see it has few hits, then find the most likely "pink * concert" in their corpus.

They apparently just do a variation of what Davide Gualano was saying, though, so definitely read that link. Google does of course use all web-pages it knows as a corpus, so that makes its algorithm particularly effective.


I saw something on this a few years back, so may have changed since, but apparently they started it by analysing their logs for the same users submitting very similar queries in a short space of time, and used machine learning based on how users had corrected themselves.


My guess is that they use a combination of a Levenshtein distance algorithm and the masses of data they collect regarding the searches that are run. They could pull a set of searches that have the shortest Levenshtein distance from the entered search string, then pick the one with the most results.


Simple. They have tons of data. They have statistics for every possible term, based on how often it is queried, and what variations of it usually yield results the users click... so, when they see you typed a frequent misspelling for a search term, they go ahead and propose the more usual answer.

Actually, if the misspelling is in effect the most frequent searched term, the algorythm will take it for the right one.


There is a specific data structure - ternary search tree - that naturally supports partial matches and near-neighbor matches.


Use Levenshtein distance, then create a Metric Tree (or Slim tree) to index words. Then run a 1-Nearest Neighbour query, and you got the result.


You mean to say spell checker? If it is a spell checker rather than a whole phrase then I've got a link about the spell checking where the algorithm is developed in python. Check this link

Meanwhile, I am also working on project that includes searching databases using text. I guess this would solve your problem





text-search