1. Distances

Compute the Hamming and Levenshtein (edit) distances between these word pairs, first by hand and then using a Java (or Javascript) implementation:

kühler schrank, schüler krank

nicht ausgeloggt, licht ausgenockt

gurken schaben, schurkengaben

Further Reading

2. Spelling Correction

For spelling correction, we will use prior knowledge, to put some learning into our system.

The underlying idea is the Noisy Channel Model, that is: The user intends to write a word w, but through some noise in the process, happens to type the word x.

The correct word $\hat{w}$ is that word, that is a valid candidate and has the highest probability:

  1. The candidates $V$ can be obtained from a vocabulary.
  2. The probability $P(w)$ of a word $w$ can be learned (counted) from data.
  3. The probability $P(x|w)$ is more complicated… It could be learned from data, but we could also use a heuristic that relates to the edit distance, e.g. rank by distance.

You can find word statistics and training data at: http://norvig.com/ngrams/

Further Reading

  • http://norvig.com/spell-correct.html
  • Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991. Context based spelling correction. Information Processing and Management, 23(5), 517–522. (IBM)
  • Kernighan, Mark D., Kenneth W. Church, and William A. Gale. 1990. A spelling correction program based on a noisy channel model. Proceedings of COLING 1990, 205-210. (Bell Labs)

3. Efficient Implementation

Start working on this in class, finish at home.

How can you implement a spell correction so that it works fast even for large lexica? For now, restrict to isolated words.

Food for thought:

  1. How could you optimize the distance computation for words that share a prefix?
  2. What about unknown words?
  3. How would you couple your algorithm to a user interface? Would functional programming be an option?