Alignment techniques¶
Warning
This page is currently out of date with respect to 1.1 and 2.0 and needs updating. Many of the general statements remain true, but the general pipeline for training was significantly overhauled. See News for more details.
This page outlines the actual functioning of the Montreal Forced Aligner, for academics and developers interested in modeling techniques.
The Montreal Forced Aligner by default uses a Hidden Markov Model-Gaussian Mixture Model (HMM-GMM) technique to perform acoustic model training and subsequent alignment. This consists of three training passes.
- First, using monophone models, where each phone is modelled the same regardless of phonological context.
- Second, using triphone models, where context on either side of a phone is taken into account for acoustic models.
- Third, using speaker-adapted triphone models which take into account speaker differences and calculates an fMLLR transformation of the features for each speaker.
A schematic diagram of this technique can be found below:
Inputs¶
The user interacts with the Montreal Forced Aligner through the command line, where the argument structure determines whether the user wants to train an acoustic model on the test corpus or use a pretrained model. This page will assume the former, although the latter’s structure besides the actual training is largely similar.
The user’s command to train and align on a dataset is parsed by montreal_forced_aligner/command_line/train_and_align.py
,
whose function align_corpus()
instantiates a variety of objects:
- A Corpus (Corpus API), which contains information about the speech dataset, including Mel-frequency cepstral coefficient features (MFCCs), according to the audio provided;
- A Dictionary (Dictionary API), which contains pronunciation and orthographic information about the language, according to the dictionary provided;
- A TrainableAligner (Aligner API), whose functions will perform training and alignment using the Corpus and the Dictionary.
The TrainableAligner then performs passes of training, exporting output TextGrids at the end of each pass.
Note
For the pretrained case, the Aligner object created is instead a Pretrained Aligner. In addition, the pretrained acoustic model is instantiated as an AcousticModel object (Model API).
First (Monophone) Pass¶
The TrainableAligner’s function montreal_forced_aligner.aligner.TrainableAligner.train_mono()
executes the monophone training.
The monophone training is initialized by montreal_forced_aligner.aligner.TrainableAligner._init_mono()
, which uses the following
multiprocessing functions to set up the monophone system, compile its training graphs, and produce a zeroth alignment,
which simply gives equal length to each segment.
compile_train_graphs (directory, …[, debug]) |
Multiprocessing function that compiles training graphs for utterances |
mono_align_equal (mono_directory, …) |
Multiprocessing function that creates equal alignments for base monophone training |
Next, monophone training is continued by montreal_forced_aligner.aligner.TrainableAligner._do_mono_training()
. This function itself calls montreal_forced_aligner.aligner.BaseAligner._do_training()
with the appropriate parameters for the monophone pass, which uses the following multiprocessing functions to train the monophone system in a loop and then produce a first alignment.
align (iteration, directory, split_directory, …) |
Multiprocessing function that aligns based on the current model |
acc_stats (iteration, directory, …) |
Multiprocessing function that computes stats for GMM training |
Finally, montreal_forced_aligner.aligner.TrainableAligner.export_textgrids()
exports the output aligned TextGrids.
Second (Triphone) Pass¶
The TrainableAligner’s function montreal_forced_aligner.aligner.TrainableAligner.train_tri()
executes the triphone training.
The triphone training is initialized by montreal_forced_aligner.aligner.TrainableAligner._init_tri()
, which uses the following
multiprocessing functions to set up the triphone system, construct a decision tree (since not all possible triphones
will appear in the dataset), and prepare the alignments from the first (monophone) pass for use in training:
tree_stats (directory, align_directory, …) |
Multiprocessing function that computes stats for decision tree training |
compile_train_graphs (directory, …[, debug]) |
Multiprocessing function that compiles training graphs for utterances |
convert_alignments (directory, …) |
Multiprocessing function that converts alignments from previous training |
Next, triphone training is continued by montreal_forced_aligner.aligner.TrainableAligner._do_tri_training()
. This function itself calls
montreal_forced_aligner.aligner.BaseAligner._do_training()
with the appropriate parameters for the triphone pass, which uses the following
multiprocessing functions to train the triphone system in a loop and then produce a second alignment.
align (iteration, directory, split_directory, …) |
Multiprocessing function that aligns based on the current model |
acc_stats (iteration, directory, …) |
Multiprocessing function that computes stats for GMM training |
Finally, montreal_forced_aligner.aligner.TrainableAligner.export_textgrids()
exports the output aligned TextGrids.
Third (Speaker-Adapted Triphone) Pass¶
The TrainableAligner’s function montreal_forced_aligner.aligner.TrainableAligner.train_tri_fmllr()
executes the speaker-adapted
triphone training.
The speaker-adapted triphone training is initialized by montreal_forced_aligner.aligner.TrainableAligner._init_tri()
with
fmllr=True
, which uses the folowing multiprocessing functions to set up the triphone system, construct a decision
tree, and prepare the alignments from the second (triphone) pass for use in training:
tree_stats (directory, align_directory, …) |
Multiprocessing function that computes stats for decision tree training |
compile_train_graphs (directory, …[, debug]) |
Multiprocessing function that compiles training graphs for utterances |
convert_alignments (directory, …) |
Multiprocessing function that converts alignments from previous training |
Next, speaker-adapted triphone training is continued by montreal_forced_aligner.aligner.TrainableAligner._do_tri_training()
with
fmllr=True
. This function itself calls montreal_forced_aligner.aligner.BaseAligner._do_training()
with the appropriate
parameters for the speaker-adapted triphone pass, which uses the following multiprocessing functions to calculate the
fMLLR transform, train the speaker-adapted triphone system in a loop, and then produce a third alignment.
align (iteration, directory, split_directory, …) |
Multiprocessing function that aligns based on the current model |
acc_stats (iteration, directory, …) |
Multiprocessing function that computes stats for GMM training |
calc_fmllr (directory, split_directory, …) |
Multiprocessing function that computes speaker adaptation (fMLLR) |
Finally, montreal_forced_aligner.aligner.TrainableAligner.export_textgrids()
exports the output aligned TextGrids.
Normally, this is the end of the pipeline: the corpus has now been aligned according to the HMM-GMM framework.
Appendix: I-Vector Extractor Training¶
Note
This appendix describes the training pipeline for the i-vector extractor. Currently this is not configurable from the command line, and only pretrained models are available. However, for the sake of completeness, its structure is outlined here.
The pipeline consists of three steps:
- An LDA + MLLT (Maximum Likelihood Linear Transform) transformation is applied to the features of a corpus.
- A diagonal UBM (Universal Background Model) is generated from several GMMs fit to these features.
- An i-vector extractor is trained from the corpus data and the UBM.
Then, the i-vector extractor is used during DNN training to extract i-vectors representing the properties of the speaker.
A schematic diagram of this technique can be found below:
LDA + MLLT¶
The TrainableAligner’s function montreal_forced_aligner.aligner.BaseAligner.train_lda_mllt()
executes the LDA + MLLT
transformation.
The LDA + MLLT transformation is initialized by montreal_forced_aligner.aligner.TrainableAligner._init_lda_mllt()
, which
uses the following multiprocessing functions to set up the system, construct a decision tree, and prepare the
alignments from the previous pass:
lda_acc_stats (directory, split_directory, …) |
Multiprocessing function that accumulates LDA statistics |
tree_stats (directory, align_directory, …) |
Multiprocessing function that computes stats for decision tree training |
convert_alignments (directory, …) |
Multiprocessing function that converts alignments from previous training |
compile_train_graphs (directory, …[, debug]) |
Multiprocessing function that compiles training graphs for utterances |
Next, training is continued by montreal_forced_aligner.aligner.TrainableAligner._do_lda_mllt_training()
with lda_mllt=True
.
This function itself calls montreal_forced_aligner.aligner.BaseAligner._do_training()
with the appropriate parameters for the
LDA + MLLT pass, which uses the following multiprocessing functions to calculate the LDA + MLLT transform.
align (iteration, directory, split_directory, …) |
Multiprocessing function that aligns based on the current model |
acc_stats (iteration, directory, …) |
Multiprocessing function that computes stats for GMM training |
calc_lda_mllt (directory, split_directory, …) |
Multiprocessing function that calculates LDA+MLLT transformations |
Diagonal UBM¶
The TrainableAligner’s function montreal_forced_aligner.aligner.TrainableAligner.train_diagonal_ubm()
executes the Diagonal
UBM training, using the following multiprocessing functions:
gmm_gselect (config, num_jobs) |
Multiprocessing function that stores Gaussian selection indices on disk |
acc_global_stats (config, num_jobs, iteration) |
Multiprocessing function that accumulates global GMM stats |
I-Vector Extractor¶
The TrainableAligner’s function montreal_forced_aligner.aligner.TrainableAligner.ivector_extractor()
executes the i-vector
extractor training.
The i-vector extractor training is initialized and continued by montreal_forced_aligner.aligner.TrainableAligner._train_ivector_extractor()
,
which uses the following multiprocessing functions:
gauss_to_post (config, num_jobs) |
Multiprocessing function that does Gaussian selection and posterior extraction |
acc_ivector_stats (config, num_jobs, iteration) |
Multiprocessing function that calculates i-vector extractor stats |
Note
The i-vector extractor is represented as an IVectorExtractor object (Model API).