# 1. Distances

Compute the Hamming and Levenshtein (edit) distances between these word pairs, first *by hand* and then using a Java (or Javascript) implementation:

### Further Reading

- https://www.geeksforgeeks.org/dynamic-programming-set-5-edit-distance/
- https://www.geeksforgeeks.org/?p=12635
- https://www.geeksforgeeks.org/?p=12819

# 2. Spelling Correction

For spelling correction, we will use prior knowledge, to put some *learning* into our system.

The underlying idea is the *Noisy Channel Model*, that is: The user *intends* to write a word `w`

, but through some noise in the process, happens to type the word `x`

.

The correct word $\hat{w}$ is that word, that is a valid candidate and has the highest probability:

- The candidates $V$ can be obtained from a
*vocabulary*. - The probability $P(w)$ of a word $w$ can be
*learned (counted) from data*. - The probability $P(x|w)$ is more complicated… It could be learned from data, but we could also use a
*heuristic*that relates to the edit distance, e.g. rank by distance.

You can find word statistics and training data at: http://norvig.com/ngrams/

### Further Reading

- http://norvig.com/spell-correct.html
- Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991. Context based spelling correction.
*Information Processing and Management*, 23(5), 517–522. (IBM) - Kernighan, Mark D., Kenneth W. Church, and William A. Gale. 1990. A spelling correction program based on a noisy channel model.
*Proceedings of COLING*1990, 205-210. (Bell Labs)

# 3. Efficient Implementation

Start working on this in class, finish at home.

How can you implement a spell correction so that it works fast even for large lexica? For now, restrict to isolated words.

Food for thought:

- How could you optimize the distance computation for words that share a prefix?
- What about unknown words?
- How would you couple your algorithm to a user interface? Would functional programming be an option?