Have you ever wondered how Google Spell corrections correct most of our spelling mistakes very easily? Just by a simple right click, we can see the list of suitable corrected words of the misspelled word. Well, in simple terms, it's all about statistics. After reading this blog, you will be able to understand how spelling correction can be done, and how probabilistic models can be used to solve this problem.
Following topics are being addressed in this blog
Introduction
Why Do Spelling Mistakes Occur
A Probability Model
How the spell correction works
Java Code for Spell Corrector
Python Code for Spell Corrector
1) Introduction
When it comes to online resources, there is no clear mechanism mentioned for programmers to understand and develop their own spell correction systems. Until 2005, Google had provided a spell checking API, that can easily detect spelling mistakes with few lines of code. It would print out all the corrections available for the input text. But soon after it was shutdown in 2005, Peter Norvig's Spelling Corrector system became very popular, and many variations of it were published. This blog addresses the important concepts behind Norvig's spelling corrector system, and tries to give the reader a mathematical understanding on how it works.
2) Why Do Spelling Mistakes Occur?
Most of the spelling mistakes happen due to the confusion between the pronunciation and how it is actually being spelled. Even since the time we were kids, it was meant to be constructive for us to think and write the words that we hear, which was also known as dictation in classrooms. Grade one and two, encouraged constructive spelling from sounds, even if they were wrong. But with time, spellings became important, and as our vocabulary grew, spellings became a concept that defined the language's semantic and syntactic structure. With each sound and syllable we pronounce, we always try to match them with the corresponding letters in order to identify the correct spellings of those words. These are generally called homophonic mistakes.
Sometimes, the lack of concentration can cause spelling mistakes in our documents. The faster we create documents, the less focus we give on spellings and more focus on getting the actual content out. This can be reduced by having proper practice and by leaving some time at the end, for proof reading the document you created.
When it comes to the field of information technology and computers, people make huge amounts of mistakes in spellings due to various reasons. I myself make quite a lot of spelling mistakes due to fast typing, and expect my automatic spelling correction system to fix them as I type. Similarly, some others make mistakes due to slips happening on the keyboard button and ending up typing multiple letters or wrong letters at various times. These kind of issues can be fixed using the "edit distances." The edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.
3) A Probability Model
Statistics and Probability are two powerful tools. What if we can say that the likelihood of something occurring is greater than the other. For example, if we have some statistics that show, what i eat on rainy days and on sunny days. If it is a sunny day, i would eat ice-cream and if it is a rainy day, I would eat a pudding. Based on this information, I have a higher likelihood of eating pudding on a day that is cloudy and dark.
Similarly, if certain information is given (statistics), we can give a likelihood (probability) of a spell correction for a given word. Before moving on, let's get familiar with the notation I will be using in this section.
w = The original word (this is the word that needs to be checked)
c = The candidate for the spell correction of word w.
Example:
w = wird
c = word , weird, wired etc.
The problem with probabilistic models is that, we can never know for sure, the correct spelling correction needed for a given word. It is always about the most likelihood that we consider in most of these cases. So, we are supposed to pick the correction (c) out of all possible candidates, which would maximize the probability.
Hope you are familiar with the Bayes' Theorem. If we apply the Bayes' Theorem to this scenario, we need to figure out how to get the maximum out of P (c|w), which means, the probability of C being the correction term, given w.
So what we need to calculate is the following probability.
Considering P(c|w) is harder, rather than separating it into two models like P(c) and P(w|c), as given in the Bayes' Theorem. For example, consider the word w = "thew". The candidate corrections would be c = "the" and c = "thaw". "Thaw" seems acceptable because then the edit distance between c and w would be simply one. But on the other hand, since c= "the" is a very common word, "the" could be a much better candidate. But estimating P(c|w) is a bit harder than it seems.
Due to this reason, we decompose the equation into two models, using the Bayes' Theorem. This is as follows.
No matter which candidate correction term we use, P(w) remains the same as a constant. So we can omit this out from the above equation and rewrite the equation as follows.
4) How the spell correction works
The final expression derived above, has the following important components. In this model, we will be using a text corpus, known as Big Text. This contains a huge set of English words that can easily represent the English Language. The Big Text document will be herein after referred to as English text.
P(c) - Language Model
This is the probability of "c" appearing in the English Text. If the word "butter" covers up 2 percent of the English text, P(butter) = 0.02
P (w|c) - Error Model
Peter Norvig explains that the error model is the probability that w would be types, when the author actually intended to type c. For example, when you actually want to type the word "you," we might type "yoi" as a typo. Then the probability of this happening is represented here.
Candidates
This is the set of candidate words that can be considered as correct words for w.
5) Java Code for Spell Corrector
Steps:
Download the Big Text file using this link.
Clone the Github Project. (Using a variation of Rael Gugelmin Cunha's Code)
Copy the downloaded Big Text file to the main Github Project Folder. (Java-Spell-Corrector/)
Open the "Spelling.java" file from the "src" folder. (Java-Spell-Corrector/src/)
Locate the main method and add any misspelled word you want, for String variable named "wrongWord".
Now compile and run the java file. The corrected word will be printed out.
6) Python Code for Spell Corrector
Steps:
Download the Big Text file using this link.
Clone the Github Project. (Using a variation of Peter Norvigs's Code)
Copy the downloaded Big Text file to the main Github Project Folder. (Python-Spell-Corrector/)
Open the "Spelling.py" file. (Python-Spell-Corrector/)
Locate the last line and add any misspelled word you want.
Now run the python file. The corrected word will be printed out.
References