Hidden Markov Models

In this exercise, we’ll build an isolated word recognizer using Hidden Markov Models. We will use the following classes of the package com.github.sikoried.jstk.stat:

Mixture: This will serve as the implementation for the emission densitites. There is a constructor that takes an InputStream, as well as a write method to save models to file.
hmm.HMM: As the basic implementation of hidden Markov models. There is a constructor that takes an InputStream, as well as a write method to save models to file.
hmm.SCState and hmm.CState as implementations of states with semi- and continuous emission probabilities.
hmm.Alignment to compute and store HMM state alignments.

Preliminaries

We’ll be using JSTK as well as our own code, so make sure to set up your CLASSPATH:

cd path/to/sl-examples
$(gradle -q env)  # will exec export CLASSPATH=...

Data Prep

Clone the Free Spoken Digit Dataset.
Create file lists for training and test; we’ll use the *_?.wav files for test, the *_??.wav files for train.
Compute MFCC features, using first derivatives and per-file normalization.
Train a Gaussian mixture model (128 diagonal densities), which we will be using as codebook later

# clone data set
git clone https://github.com/Jakobovski/free-spoken-digit-dataset.git
cd free-spoken-digit-dataset/recordings

# make lists
/bin/ls [0-9]_{jackson,theo,nicolas,yweweler}_??.wav > list.train
/bin/ls [0-9]_{jackson,theo,nicolas,yweweler}_?.wav > list.test
cat list.{train,test} > list.all

# compute features
mkdir ft
java com.github.sikoried.jstk.app.Mfcc \
	-f t:wav/8 -w hamm,25,10 \
	-b 0,4000,-1,24 -d 5:1 \
	--turn-wise-mvn \
	--in-list list.all ft

# init and train GMM
N=128
mkdir mdl
java com.github.sikoried.jstk.app.Initializer \
	--list list.train --dir ft \
	--gmm mdl/init${N}.mdl -n $N -s g-ev
java com.github.sikoried.jstk.app.GaussEM \
	-i mdl/init${N}.mdl \
	-o mdl/em${N}.mdl \
	-l list.train -d ft

Training

To keep things simple, we’ll make a few assumptions:

The classes (words) are the numbers 0 through 9.
Filenames always follow the scheme {class}_{speaker}_{rec-id}.wav, ie. the first part is the class label.
Feature files and models are stored in ft/ and mdl/, respectively (hard-code path names).

The overall training routine will be:

Align all examples using a linear alignment
Accumulate the statistics and re-estimate
Force-align the training data
Accumulate the statistics and re-estimate; optionally repeat this step
Repeat at 3.

Complete the binary iw.Trainer, that accepts property file, to be read via Commons Configuration, containing the following variables (defaults in parentheses):

list that contains the training files
number of states per class/HMM (4)
classes (0, 1, 2, 3, 4, 5, 6, 7, 8, 9; retrieve with getStringArray); models will be stored as <name>.mdl[.<iter>]
directory where to store the model files (mdl/)
directory where to find the feature files (ft/)
number of overall iterations (10)
iterations when to re-align (1, 2, 5, 8; retrieve with getIntArray)
CState: number of densities (1)
SCState: codebook to use (null; note: copy will be written to model directory)

An example properties file could look like

iw.list = list.train
iw.states = 4
iw.classes = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
iw.mdldir = mdl-batch1
iw.ftdir = ft
iw.iterations = 10
iw.realign = 1, 2, 5, 8
iw.codebook = mdl/em10-128.mdl

Implement the Training Routine

for each class, allocate a HMM; use SCState if codebook is specified, CState otherwise
compute an initial estimate, by creating linear alignments for each file and class; save the initial estimates as <name>.mdl.0
reset the accumulators
if (cur_iter in realign): compute the forced alignment
accumulate according to alignments (eg. accumulateVT)
re-estimate the parameters (.reestimate()), save current estimate
if (cur_iter < num_iters): goto 2

Classification

Complete the binary iw.Classifier, which accepts a properties file with the following settings:

file list containing the test data
list of classes to load
directory where to store the model files (mdl/)
directory where to find the feature files (ft/)
if required, the codebook (relative to model directory)

iw.list = list.test
iw.classes = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
iw.mdldir = mdl-batch1
iw.ftdir = ft
iw.codebook = em10-128.mdl

Implement the Classification Routine

Load the model files (and codebook)
For each feature file, align each of the models
Normalize the scores to a soft-max (to get probabilities for each class)
Output lines of <file> <best-class> <class-scores ...>

Evaluation

How does your classifier perform?
Run experiments with different settings (iterations, states, classes)
Can you see patterns of classes that get mixed up?

Adjust for Silence

Change your trainer and classifier, so that it allows for leading and trailing silence
Use 3 states for silence; make sure that all recordings contribute to a single silence model
How does modeling silence affect performance?

Outlook: Decoding

The current classifier allows only single word decisions.
How would you handle word sequences?
Recall the DTMF tone decoder. How would you transfer the idea to isolated word recognition?
Outline an algorithm that allows you to decode arbitrary sequences of digits.