Cost and States

In this assignment, we will explore different cost functions and ways to model state.

Keyboard Aware Auto-Correct

In the previous assignment, we applied uniform cost to all substitutions. This does not really make sense if you look at a keyboard: the QWERTY layout will favor certain substitutions (eg. a and s), while others are fairly unlikely (eg. a and k).

Implement a distance metric that computes a weight for a given character substitution. Hint: Think of the keyboard as a checkerboard, where the keys are uniformly distributed; a simple heuristic is to map the keys to their offsets on the board, and compute the euclidean distance (compare eg. here).
Integrate this distance to the basic edit distance from last week.

Discuss your Implementation

What could be better heuristics for cost of substitution than the one above?
What about swipe-to-type?
What about capitalization, punctiation and special characters?

Isolated Word Recognition

Acutally, this is really old, but worth the exercise :-)

Getting Started with JSTK (again)

For the following two assignments, we will need the JSTK up and running, ideally with the libraries added to your classpath:

$ export CLASSPATH=$(find path/to/jstk/jstk/build/libs -name '*.jar' | xargs echo | tr ' ' ':')
$ java com.github.sikoried.jstk.app.Version  # does this work?!

Extracting Features

DP can be used to implemented isolated word recognition. However, due to the relatively large sample number (e.g. 8kHz), performing DTW (edit distance with uniform cost) on the raw audio signal is not advised (feel free to try!). A better solution is to compute a set of features; here we will extract mel-frequency cepstral coefficients over windows of 25ms length, shifted by 10ms.

Here’s an example to record audio from using the JSTK, convert it to WAV, and then compute a feature file (in frame format):

# query for audio mixers
$ java com.github.sikoried.jstk.sampled.AudioCapture -L
Default Audio Device
Built-in Microphone

# reord 16khz/mono, convert to WAV
$ java com.github.sikoried.jstk.sampled.AudioCapture -m "Built-in Microphone" | sox -t s16 -r 16000 - rec.wav

# Features berechnen
for i in *.wav; do
	o=${i%%.wav}.ft
	java com.github.sikoried.jstk.app.Mfcc -f t:wav/16 -w hamm,25,10 --turn-wise-mvn -i $i -o $o
done

# ...alternativ: in-out-liste (deutlich schneller!)
for i in `/bin/ls *.wav`; do echo $i ${i%%.wav}.ft; done > inout
java com.github.sikoried.jstk.app.Mfcc -f t:wav/16 -w hamm,25,10 --turn-wise-mvn --in-out-list inout

These feature files can be read using the class com.github.sikoried.jstk.io.FrameInputStream, for example

FrameSource fs = new FrameInputStream(inFile);
int fd = fs.getFrameSize();

double[] buf = new double [fd];

while (fs.read(buf)) {
	// ...
}

Implement Isolated Word Recognitio Using DP

Record the numbers from one to ten to form a training set; try to articulate clearly!
Record a few numbers to form a test set.
Implement isolated word recognition. a. Adapt your DTW implementation from the previous exercise, if necessary. b. For each test word, compute the (normalized) distance to the training words c. Select the closest word as classification result.

# Dateilisten erstellen: "training" (Referenz) und test
for i in one two three four five six seven eight nine zero yes no; do 
	/bin/ls ${i}_*.ft | head -n 9 > $i.lst.train
	/bin/ls ${i}_*.ft | tail -n 1 > $i.lst.test
done

Discuss your Implementation

How can you extend this idea to continuous speech?
How does this algorithm scale with a larger vocabulary, how can it be improved?
What are inherent issues of this approach?

DP and States: DTMF Decoding

You probably know about Dual-tone multi-frequencey signaling (DTMF), which is (was?) used to transmit numbers over the phone in a machine understandable way.

Getting Familiar

Go to audiocheck.net and create a few sequences. An example DTMF sequence (123A456B789C*0#D) is provided here: dtmf_123A_456B_789C_s0pD.wav.

Use a tool such as Wavesurfer or Audacity to get an idea on how the spectrum looks like.

Decoding DTMF Sequences

The idea is now to model DTMF sequences a small, fully corrected graph, that has 13 states: 0-9, A-D, *, # and silence. The mapping from symbol to frequency pair is given as [(1,697,1209), (2,697,1336), (3,697,1477), (A,697,1633), (4,770,1209), (5,770,1336), (6,770,1477), (B,770,1633), (7,852,1209), (8,852,1336), (9,852,1477), (C,852,1633), (*,941,1209), (0,941,1336), (#,941,1477), (D,941,1633)] (see also com.github.sikoried.jstk.framed.DTMF).

Use the DTMF program to extract DTMF features, either using it as a FrameSource, or using its main function:

$ java com.github.sikoried.jstk.framed.DTMF dtmf_123A_456B_789C_s0pD.wav | head -n10
[main] INFO com.github.sikoried.jstk.framed.DTMF  - FFT resolution (Hz/bin) = 15.625
[main] INFO com.github.sikoried.jstk.framed.DTMF  - frequencies = [697, 770, 852, 941, 1209, 1336, 1477, 1633]
[main] INFO com.github.sikoried.jstk.framed.DTMF  - corresponding FFT bins = [45, 49, 55, 60, 77, 86, 95, 105]
88 0.50 0.01 0.00 0.00 0.50 0.00 0.00 0.00 1
80 0.50 0.00 0.00 0.00 0.50 0.00 0.00 0.00 1
79 0.50 0.00 0.00 0.00 0.50 0.00 0.00 0.00 1
37 0.50 0.00 0.00 0.00 0.50 0.00 0.00 0.00 1
66 0.50 0.00 0.00 0.00 0.50 0.00 0.00 0.00 1
33 0.50 0.00 0.00 0.00 0.50 0.00 0.00 0.00 1
45 0.50 0.00 0.00 0.00 0.50 0.00 0.00 0.00 1
...
# energy, 697Hz, 770Hz, ..., dialed key (energy + max-search)

As you can see, it’s not really a hard problem, since a simple maximum search already finds the correct keys most of the time. As a matter of fact, if you filter by a minimal energy (first column), and uniq the output, you get the punched keys:

$ java de.fau.cs.jstk.framed.DTMF dtmf_123A_456B_789C_s0pD.wav | awk '{if ($1 > 100) print $NF}' | uniq
1
2
3
A
4
5
6
B
7
8
9
C
*
0
#
D

Implementing DP with States

We will now abstract edit distance in a way, that we compute the “distance” of an observation sequence to a series of state occupation (ie. key presses).

For the decoding matrix, you will use the number of feature vectors in one dimension, and the number of states in the other.
Define prototype vectors for each state; see DTMF table, use 0.5 for each of the active frequencies, or all 0 for silence.
Remember: you can now “go backwards” on one axis, thus you need to compute the minimum for the complete row (or column, depending on your choice).
Use uniform costs and the distance between the input vector and the state’s prototype vector.
Compute the backtrace and uniq the sequence to get the punched keys.

Discuss your Implementation:

How could it be improved in terms of computation and memory?
Discuss limits of this modeling approach.
Could you extend your isolated word recognition algorithm to this approach? Would it scale?