1. Distances
Compute the Hamming and Levenshtein (edit) distances between these word pairs, first by hand and then using a Java (or Javascript) implementation:
Further Reading
- https://www.geeksforgeeks.org/dynamic-programming-set-5-edit-distance/
- https://www.geeksforgeeks.org/?p=12635
- https://www.geeksforgeeks.org/?p=12819
2. Spelling Correction
For spelling correction, we will use prior knowledge, to put some learning into our system.
The underlying idea is the Noisy Channel Model, that is: The user intends to write a word w
, but through some noise in the process, happens to type the word x
.
The correct word $\hat{w}$ is that word, that is a valid candidate and has the highest probability:
\[\begin{eqnarray} \DeclareMathOperator*{\argmax}{argmax} \hat{w} & = & \argmax_{w \in V} P(w | x) \\ & = & \argmax_{w \in V} \frac{P(x|w) P(w)}{P(x)} \\ & = & \argmax_{w \in V} P(x|w) P(w) \end{eqnarray}\]- The candidates $V$ can be obtained from a vocabulary.
- The probability $P(w)$ of a word $w$ can be learned (counted) from data.
- The probability $P(x|w)$ is more complicated… It could be learned from data, but we could also use a heuristic that relates to the edit distance, e.g. rank by distance.
You can find word statistics and training data at: http://norvig.com/ngrams/
Further Reading
- http://norvig.com/spell-correct.html
- Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991. Context based spelling correction. Information Processing and Management, 23(5), 517–522. (IBM)
- Kernighan, Mark D., Kenneth W. Church, and William A. Gale. 1990. A spelling correction program based on a noisy channel model. Proceedings of COLING 1990, 205-210. (Bell Labs)
3. Efficient Implementation
Start working on this in class, finish at home.
How can you implement a spell correction so that it works fast even for large lexica? For now, restrict to isolated words.
Food for thought:
- How could you optimize the distance computation for words that share a prefix?
- What about unknown words?
- How would you couple your algorithm to a user interface? Would functional programming be an option?