Alignment techniques¶
Warning
This page is currently out of date with respect to 1.1 and needs updating. Many of the general statements remain true, but the general pipeline for training was significantly overhauled. See What’s new in 1.1 for more details.
This page outlines the actual functioning of the Montreal Forced Aligner, for academics and developers interested in modeling techniques.
The Montreal Forced Aligner by default uses a Hidden Markov Model-Gaussian Mixture Model (HMM-GMM) technique to perform acoustic model training and subsequent alignment. This consists of three training passes.
- First, using monophone models, where each phone is modelled the same regardless of phonological context.
- Second, using triphone models, where context on either side of a phone is taken into account for acoustic models.
- Third, using speaker-adapted triphone models which take into account speaker differences and calculates an fMLLR transformation of the features for each speaker.
A schematic diagram of this technique can be found below:
Inputs¶
The user interacts with the Montreal Forced Aligner through the command line, where the argument structure determines whether the user wants to train an acoustic model on the test corpus or use a pretrained model. This page will assume the former, although the latter’s structure besides the actual training is largely similar.
The user’s command to train and align on a dataset is parsed by aligner/command_line/train_and_align.py
,
whose function align_corpus()
instantiates a variety of objects:
- A Corpus (Corpus API), which contains information about the speech dataset, including Mel-frequency cepstral coefficient features (MFCCs), according to the audio provided;
- A Dictionary (Dictionary API), which contains pronunciation and orthographic information about the language, according to the dictionary provided;
- A TrainableAligner (Aligner API), whose functions will perform training and alignment using the Corpus and the Dictionary.
The TrainableAligner then performs passes of training, exporting output TextGrids at the end of each pass.
Note
For the pretrained case, the Aligner object created is instead a Pretrained Aligner. In addition, the pretrained acoustic model is instantiated as an AcousticModel object (Model API).
First (Monophone) Pass¶
The TrainableAligner’s function aligner.aligner.TrainableAligner.train_mono()
executes the monophone training.
The monophone training is initialized by aligner.aligner.TrainableAligner._init_mono()
, which uses the following
multiprocessing functions to set up the monophone system, compile its training graphs, and produce a zeroth alignment,
which simply gives equal length to each segment.
compile_train_graphs (directory, …[, debug]) |
Multiprocessing function that compiles training graphs for utterances |
mono_align_equal (mono_directory, …) |
Multiprocessing function that creates equal alignments for base monophone training |
Next, monophone training is continued by aligner.aligner.TrainableAligner._do_mono_training()
. This function itself calls aligner.aligner.BaseAligner._do_training()
with the appropriate parameters for the monophone pass, which uses the following multiprocessing functions to train the monophone system in a loop and then produce a first alignment.
align (iteration, directory, split_directory, …) |
Multiprocessing function that aligns based on the current model |
acc_stats (iteration, directory, …) |
Multiprocessing function that computes stats for GMM training |
Finally, aligner.aligner.TrainableAligner.export_textgrids()
exports the output aligned TextGrids.
Second (Triphone) Pass¶
The TrainableAligner’s function aligner.aligner.TrainableAligner.train_tri()
executes the triphone training.
The triphone training is initialized by aligner.aligner.TrainableAligner._init_tri()
, which uses the following
multiprocessing functions to set up the triphone system, construct a decision tree (since not all possible triphones
will appear in the dataset), and prepare the alignments from the first (monophone) pass for use in training:
tree_stats (directory, align_directory, …) |
Multiprocessing function that computes stats for decision tree training |
compile_train_graphs (directory, …[, debug]) |
Multiprocessing function that compiles training graphs for utterances |
convert_alignments (directory, …) |
Multiprocessing function that converts alignments from previous training |
Next, triphone training is continued by aligner.aligner.TrainableAligner._do_tri_training()
. This function itself calls
aligner.aligner.BaseAligner._do_training()
with the appropriate parameters for the triphone pass, which uses the following
multiprocessing functions to train the triphone system in a loop and then produce a second alignment.
align (iteration, directory, split_directory, …) |
Multiprocessing function that aligns based on the current model |
acc_stats (iteration, directory, …) |
Multiprocessing function that computes stats for GMM training |
Finally, aligner.aligner.TrainableAligner.export_textgrids()
exports the output aligned TextGrids.
Third (Speaker-Adapted Triphone) Pass¶
The TrainableAligner’s function aligner.aligner.TrainableAligner.train_tri_fmllr()
executes the speaker-adapted
triphone training.
The speaker-adapted triphone training is initialized by aligner.aligner.TrainableAligner._init_tri()
with
fmllr=True
, which uses the folowing multiprocessing functions to set up the triphone system, construct a decision
tree, and prepare the alignments from the second (triphone) pass for use in training:
tree_stats (directory, align_directory, …) |
Multiprocessing function that computes stats for decision tree training |
compile_train_graphs (directory, …[, debug]) |
Multiprocessing function that compiles training graphs for utterances |
convert_alignments (directory, …) |
Multiprocessing function that converts alignments from previous training |
Next, speaker-adapted triphone training is continued by aligner.aligner.TrainableAligner._do_tri_training()
with
fmllr=True
. This function itself calls aligner.aligner.BaseAligner._do_training()
with the appropriate
parameters for the speaker-adapted triphone pass, which uses the following multiprocessing functions to calculate the
fMLLR transform, train the speaker-adapted triphone system in a loop, and then produce a third alignment.
align (iteration, directory, split_directory, …) |
Multiprocessing function that aligns based on the current model |
acc_stats (iteration, directory, …) |
Multiprocessing function that computes stats for GMM training |
calc_fmllr (directory, split_directory, …) |
Multiprocessing function that computes speaker adaptation (fMLLR) |
Finally, aligner.aligner.TrainableAligner.export_textgrids()
exports the output aligned TextGrids.
Normally, this is the end of the pipeline: the corpus has now been aligned according to the HMM-GMM framework.
Deep Neural Networks (DNNs)¶
Note
The DNN framework for the Montreal Forced aligner is operational, but may not give a better result than the alignments produced by the standard HMM-GMM pipeline. Preliminary experiments suggest that results may improve when the DNN model used to produce alignments is pre-trained on a corpus similar in quality (conversational vs. clean speech) and longer in length than the test corpus.
Since the code is newly developed, if you run into any issues, please contact us on the mailing list or on GitHub.
The Montreal Forced Aligner also has the capacity to use DNNs for training, thus creating an HMM-DNN framework on top of the existing HMM-GMM framework. This functionality is based on Kaldi’s nnet2 recipes.
The basic idea behind this functionality is to train a DNN using a subset of the HMM-GMM alignments as the gold-standard data. Despite the fact that the DNN is technically trained to make the same predictions as the third pass alignments, due to the nature of DNNs, they may be able to learn a better acoustic model representation than what the GMMs were able to achieve.
The analogue to speaker adaptation in the HMM-DNN framework is the use of i-vectors, which can be thought of as “speaker embeddings” that we append to the acoustic features. These are calculated by an i-vector extractor, which has its own training pipeline. For now, only a pretrained model is available: the i-vector extractor used is by default trained on a 100-hour subset of Librispeech.
A schematic diagram of this technique can be found below:
Fourth (DNN) Pass¶
The TrainableAligner’s function aligner.aligner.BaseAligner.train_nnet_basic()
executes the DNN training.
First, i-vectors are extracted from the test corpus by aligner.aligner.BaseAligner._extract_ivectors()
, which
uses the following multiprocessing function:
extract_ivectors (config, num_jobs) |
Multiprocessing function that extracts i-vectors. |
Next, an LDA-like transform is applied to the features with appended i-vectors, in order to decorrelate them, which uses the following multiprocessing function:
get_lda_nnet (config, align_directory, num_jobs) |
Multiprocessing function that extracts training examples and does LDA transformation |
Then, a subset of training examples is amassed, using the following multiprocessing function:
get_egs (config, ali_dir, valid_uttlist, …) |
Multiprocessing function that gets training examples for the neural net |
Then the DNN is in initialized. Following nnet2, it is a DNN with a p-norm activation function with online preconditioning, and its output component is a softmax nonlinearity. About halfway through training, the DNN mixes up, copying components of the weight matrix and allowing them to independently develop probability distributions for different realizations of the same phone. The main training loop works through stochastic gradient descent and uses the following multiprocessing functions:
nnet_train_trans (nnet_dir, align_dir, …) |
Multiprocessing function that trains transition prbabilities and sets priors. |
nnet_train (nnet_dir, egs_dir, mdl, i, num_jobs) |
Multiprocessing function that trains the neural net. |
get_average_posteriors (i, nnet_dir, …) |
Multiprocessing function that gets average posterior for purposes of computing priors (for nnet) |
relabel_egs (i, nnet_dir, egs_in, alignments, …) |
Multiprocessing function that relabels training examples |
Finally, an alignment is generated, using the following multiprocessing functions:
compile_train_graphs (directory, …[, debug]) |
Multiprocessing function that compiles training graphs for utterances |
nnet_align (i, config, train_directory, …) |
Multiprocessing function that generates an nnet alignment |
The output TextGrids from the DNN alignment are exported by aligner.aligner.TrainableAligner.export_textgrids()
.
Appendix: I-Vector Extractor Training¶
Note
This appendix describes the training pipeline for the i-vector extractor. Currently this is not configurable from the command line, and only pretrained models are available. However, for the sake of completeness, its structure is outlined here.
The pipeline consists of three steps:
- An LDA + MLLT (Maximum Likelihood Linear Transform) transformation is applied to the features of a corpus.
- A diagonal UBM (Universal Background Model) is generated from several GMMs fit to these features.
- An i-vector extractor is trained from the corpus data and the UBM.
Then, the i-vector extractor is used during DNN training to extract i-vectors representing the properties of the speaker.
A schematic diagram of this technique can be found below:
LDA + MLLT¶
The TrainableAligner’s function aligner.aligner.BaseAligner.train_lda_mllt()
executes the LDA + MLLT
transformation.
The LDA + MLLT transformation is initialized by aligner.aligner.TrainableAligner._init_lda_mllt()
, which
uses the following multiprocessing functions to set up the system, construct a decision tree, and prepare the
alignments from the previous pass:
lda_acc_stats (directory, split_dir, …) |
Multiprocessing function that accumulates LDA statistics |
tree_stats (directory, align_directory, …) |
Multiprocessing function that computes stats for decision tree training |
convert_alignments (directory, …) |
Multiprocessing function that converts alignments from previous training |
compile_train_graphs (directory, …[, debug]) |
Multiprocessing function that compiles training graphs for utterances |
Next, training is continued by aligner.aligner.TrainableAligner._do_lda_mllt_training()
with lda_mllt=True
.
This function itself calls aligner.aligner.BaseAligner._do_training()
with the appropriate parameters for the
LDA + MLLT pass, which uses the following multiprocessing functions to calculate the LDA + MLLT transform.
align (iteration, directory, split_directory, …) |
Multiprocessing function that aligns based on the current model |
acc_stats (iteration, directory, …) |
Multiprocessing function that computes stats for GMM training |
calc_lda_mllt (directory, split_directory, …) |
Multiprocessing function that calculates LDA+MLLT transformations |
Diagonal UBM¶
The TrainableAligner’s function aligner.aligner.TrainableAligner.train_diagonal_ubm()
executes the Diagonal
UBM training, using the following multiprocessing functions:
gmm_gselect (config, num_jobs) |
Multiprocessing function that stores Gaussian selection indices on disk |
acc_global_stats (config, num_jobs, iteration) |
Multiprocessing function that accumulates global GMM stats |
I-Vector Extractor¶
The TrainableAligner’s function aligner.aligner.TrainableAligner.ivector_extractor()
executes the i-vector
extractor training.
The i-vector extractor training is initialized and continued by aligner.aligner.TrainableAligner._train_ivector_extractor()
,
which uses the following multiprocessing functions:
gauss_to_post (config, num_jobs) |
Multiprocessing function that does Gaussian selection and posterior extraction |
acc_ivector_stats (config, num_jobs, iteration) |
Multiprocessing function that calculates i-vector extractor stats |
Note
The i-vector extractor is represented as an IVectorExtractor object (Model API).