Alignment techniques

Warning

This page is currently out of date with respect to 1.1 and needs updating. Many of the general statements remain true, but the general pipeline for training was significantly overhauled. See What’s new in 1.1 for more details.

This page outlines the actual functioning of the Montreal Forced Aligner, for academics and developers interested in modeling techniques.

The Montreal Forced Aligner by default uses a Hidden Markov Model-Gaussian Mixture Model (HMM-GMM) technique to perform acoustic model training and subsequent alignment. This consists of three training passes.

  1. First, using monophone models, where each phone is modelled the same regardless of phonological context.
  2. Second, using triphone models, where context on either side of a phone is taken into account for acoustic models.
  3. Third, using speaker-adapted triphone models which take into account speaker differences and calculates an fMLLR transformation of the features for each speaker.

A schematic diagram of this technique can be found below:

_images/MFA_default.svg

Inputs

The user interacts with the Montreal Forced Aligner through the command line, where the argument structure determines whether the user wants to train an acoustic model on the test corpus or use a pretrained model. This page will assume the former, although the latter’s structure besides the actual training is largely similar.

The user’s command to train and align on a dataset is parsed by aligner/command_line/train_and_align.py, whose function align_corpus() instantiates a variety of objects:

  • A Corpus (Corpus API), which contains information about the speech dataset, including Mel-frequency cepstral coefficient features (MFCCs), according to the audio provided;
  • A Dictionary (Dictionary API), which contains pronunciation and orthographic information about the language, according to the dictionary provided;
  • A TrainableAligner (Aligner API), whose functions will perform training and alignment using the Corpus and the Dictionary.

The TrainableAligner then performs passes of training, exporting output TextGrids at the end of each pass.

Note

For the pretrained case, the Aligner object created is instead a Pretrained Aligner. In addition, the pretrained acoustic model is instantiated as an AcousticModel object (Model API).

First (Monophone) Pass

The TrainableAligner’s function aligner.aligner.TrainableAligner.train_mono() executes the monophone training.

The monophone training is initialized by aligner.aligner.TrainableAligner._init_mono(), which uses the following multiprocessing functions to set up the monophone system, compile its training graphs, and produce a zeroth alignment, which simply gives equal length to each segment.

compile_train_graphs(directory, …[, debug]) Multiprocessing function that compiles training graphs for utterances
mono_align_equal(mono_directory, …) Multiprocessing function that creates equal alignments for base monophone training

Next, monophone training is continued by aligner.aligner.TrainableAligner._do_mono_training(). This function itself calls aligner.aligner.BaseAligner._do_training() with the appropriate parameters for the monophone pass, which uses the following multiprocessing functions to train the monophone system in a loop and then produce a first alignment.

align(iteration, directory, split_directory, …) Multiprocessing function that aligns based on the current model
acc_stats(iteration, directory, …) Multiprocessing function that computes stats for GMM training

Finally, aligner.aligner.TrainableAligner.export_textgrids() exports the output aligned TextGrids.

Second (Triphone) Pass

The TrainableAligner’s function aligner.aligner.TrainableAligner.train_tri() executes the triphone training.

The triphone training is initialized by aligner.aligner.TrainableAligner._init_tri(), which uses the following multiprocessing functions to set up the triphone system, construct a decision tree (since not all possible triphones will appear in the dataset), and prepare the alignments from the first (monophone) pass for use in training:

tree_stats(directory, align_directory, …) Multiprocessing function that computes stats for decision tree training
compile_train_graphs(directory, …[, debug]) Multiprocessing function that compiles training graphs for utterances
convert_alignments(directory, …) Multiprocessing function that converts alignments from previous training

Next, triphone training is continued by aligner.aligner.TrainableAligner._do_tri_training(). This function itself calls aligner.aligner.BaseAligner._do_training() with the appropriate parameters for the triphone pass, which uses the following multiprocessing functions to train the triphone system in a loop and then produce a second alignment.

align(iteration, directory, split_directory, …) Multiprocessing function that aligns based on the current model
acc_stats(iteration, directory, …) Multiprocessing function that computes stats for GMM training

Finally, aligner.aligner.TrainableAligner.export_textgrids() exports the output aligned TextGrids.

Third (Speaker-Adapted Triphone) Pass

The TrainableAligner’s function aligner.aligner.TrainableAligner.train_tri_fmllr() executes the speaker-adapted triphone training.

The speaker-adapted triphone training is initialized by aligner.aligner.TrainableAligner._init_tri() with fmllr=True, which uses the folowing multiprocessing functions to set up the triphone system, construct a decision tree, and prepare the alignments from the second (triphone) pass for use in training:

tree_stats(directory, align_directory, …) Multiprocessing function that computes stats for decision tree training
compile_train_graphs(directory, …[, debug]) Multiprocessing function that compiles training graphs for utterances
convert_alignments(directory, …) Multiprocessing function that converts alignments from previous training

Next, speaker-adapted triphone training is continued by aligner.aligner.TrainableAligner._do_tri_training() with fmllr=True. This function itself calls aligner.aligner.BaseAligner._do_training() with the appropriate parameters for the speaker-adapted triphone pass, which uses the following multiprocessing functions to calculate the fMLLR transform, train the speaker-adapted triphone system in a loop, and then produce a third alignment.

align(iteration, directory, split_directory, …) Multiprocessing function that aligns based on the current model
acc_stats(iteration, directory, …) Multiprocessing function that computes stats for GMM training
calc_fmllr(directory, split_directory, …) Multiprocessing function that computes speaker adaptation (fMLLR)

Finally, aligner.aligner.TrainableAligner.export_textgrids() exports the output aligned TextGrids.

Normally, this is the end of the pipeline: the corpus has now been aligned according to the HMM-GMM framework.

Deep Neural Networks (DNNs)

Note

The DNN framework for the Montreal Forced aligner is operational, but may not give a better result than the alignments produced by the standard HMM-GMM pipeline. Preliminary experiments suggest that results may improve when the DNN model used to produce alignments is pre-trained on a corpus similar in quality (conversational vs. clean speech) and longer in length than the test corpus.

Since the code is newly developed, if you run into any issues, please contact us on the mailing list or on GitHub.

The Montreal Forced Aligner also has the capacity to use DNNs for training, thus creating an HMM-DNN framework on top of the existing HMM-GMM framework. This functionality is based on Kaldi’s nnet2 recipes.

The basic idea behind this functionality is to train a DNN using a subset of the HMM-GMM alignments as the gold-standard data. Despite the fact that the DNN is technically trained to make the same predictions as the third pass alignments, due to the nature of DNNs, they may be able to learn a better acoustic model representation than what the GMMs were able to achieve.

The analogue to speaker adaptation in the HMM-DNN framework is the use of i-vectors, which can be thought of as “speaker embeddings” that we append to the acoustic features. These are calculated by an i-vector extractor, which has its own training pipeline. For now, only a pretrained model is available: the i-vector extractor used is by default trained on a 100-hour subset of Librispeech.

A schematic diagram of this technique can be found below:

_images/MFA_dnn.svg

Fourth (DNN) Pass

The TrainableAligner’s function aligner.aligner.BaseAligner.train_nnet_basic() executes the DNN training.

First, i-vectors are extracted from the test corpus by aligner.aligner.BaseAligner._extract_ivectors(), which uses the following multiprocessing function:

extract_ivectors(config, num_jobs) Multiprocessing function that extracts i-vectors.

Next, an LDA-like transform is applied to the features with appended i-vectors, in order to decorrelate them, which uses the following multiprocessing function:

get_lda_nnet(config, align_directory, num_jobs) Multiprocessing function that extracts training examples and does LDA transformation

Then, a subset of training examples is amassed, using the following multiprocessing function:

get_egs(config, ali_dir, valid_uttlist, …) Multiprocessing function that gets training examples for the neural net

Then the DNN is in initialized. Following nnet2, it is a DNN with a p-norm activation function with online preconditioning, and its output component is a softmax nonlinearity. About halfway through training, the DNN mixes up, copying components of the weight matrix and allowing them to independently develop probability distributions for different realizations of the same phone. The main training loop works through stochastic gradient descent and uses the following multiprocessing functions:

nnet_train_trans(nnet_dir, align_dir, …) Multiprocessing function that trains transition prbabilities and sets priors.
nnet_train(nnet_dir, egs_dir, mdl, i, num_jobs) Multiprocessing function that trains the neural net.
get_average_posteriors(i, nnet_dir, …) Multiprocessing function that gets average posterior for purposes of computing priors (for nnet)
relabel_egs(i, nnet_dir, egs_in, alignments, …) Multiprocessing function that relabels training examples

Finally, an alignment is generated, using the following multiprocessing functions:

compile_train_graphs(directory, …[, debug]) Multiprocessing function that compiles training graphs for utterances
nnet_align(i, config, train_directory, …) Multiprocessing function that generates an nnet alignment

The output TextGrids from the DNN alignment are exported by aligner.aligner.TrainableAligner.export_textgrids().

Appendix: I-Vector Extractor Training

Note

This appendix describes the training pipeline for the i-vector extractor. Currently this is not configurable from the command line, and only pretrained models are available. However, for the sake of completeness, its structure is outlined here.

The pipeline consists of three steps:

  1. An LDA + MLLT (Maximum Likelihood Linear Transform) transformation is applied to the features of a corpus.
  2. A diagonal UBM (Universal Background Model) is generated from several GMMs fit to these features.
  3. An i-vector extractor is trained from the corpus data and the UBM.

Then, the i-vector extractor is used during DNN training to extract i-vectors representing the properties of the speaker.

A schematic diagram of this technique can be found below:

_images/MFA_dnn_ivectors.svg

LDA + MLLT

The TrainableAligner’s function aligner.aligner.BaseAligner.train_lda_mllt() executes the LDA + MLLT transformation.

The LDA + MLLT transformation is initialized by aligner.aligner.TrainableAligner._init_lda_mllt(), which uses the following multiprocessing functions to set up the system, construct a decision tree, and prepare the alignments from the previous pass:

lda_acc_stats(directory, split_dir, …) Multiprocessing function that accumulates LDA statistics
tree_stats(directory, align_directory, …) Multiprocessing function that computes stats for decision tree training
convert_alignments(directory, …) Multiprocessing function that converts alignments from previous training
compile_train_graphs(directory, …[, debug]) Multiprocessing function that compiles training graphs for utterances

Next, training is continued by aligner.aligner.TrainableAligner._do_lda_mllt_training() with lda_mllt=True. This function itself calls aligner.aligner.BaseAligner._do_training() with the appropriate parameters for the LDA + MLLT pass, which uses the following multiprocessing functions to calculate the LDA + MLLT transform.

align(iteration, directory, split_directory, …) Multiprocessing function that aligns based on the current model
acc_stats(iteration, directory, …) Multiprocessing function that computes stats for GMM training
calc_lda_mllt(directory, split_directory, …) Multiprocessing function that calculates LDA+MLLT transformations

Diagonal UBM

The TrainableAligner’s function aligner.aligner.TrainableAligner.train_diagonal_ubm() executes the Diagonal UBM training, using the following multiprocessing functions:

gmm_gselect(config, num_jobs) Multiprocessing function that stores Gaussian selection indices on disk
acc_global_stats(config, num_jobs, iteration) Multiprocessing function that accumulates global GMM stats

I-Vector Extractor

The TrainableAligner’s function aligner.aligner.TrainableAligner.ivector_extractor() executes the i-vector extractor training.

The i-vector extractor training is initialized and continued by aligner.aligner.TrainableAligner._train_ivector_extractor(), which uses the following multiprocessing functions:

gauss_to_post(config, num_jobs) Multiprocessing function that does Gaussian selection and posterior extraction
acc_ivector_stats(config, num_jobs, iteration) Multiprocessing function that calculates i-vector extractor stats

Note

The i-vector extractor is represented as an IVectorExtractor object (Model API).