Acoustic model training options#
Note
See Global Options for options relating to the alignment steps
Global alignment options can be overwritten for each trainer (i.e., different beam settings at different stages of training).
Note
Subsets are created by sorting the utterances by length, taking a larger subset (10 times the specified subset amount) and then randomly sampling the specified subset amount from this larger subset. Utterances with transcriptions that are only one word long are ignored.
Monophone Configuration#
For the Kaldi recipe that monophone training is based on, see train_mono.sh.
Parameter |
Default value |
Notes |
---|---|---|
subset |
10000 |
Number of utterances to use (0 uses the full corpus) |
initial_beam |
6 |
Initial beam width for first alignment iteration |
beam |
10 |
Beam width for alignment iterations other than the first |
num_iterations |
40 |
Number of training iterations |
max_gaussians |
1000 |
Total number of gaussians |
boost_silence |
1.25 |
Factor to boost silence probabilities |
power |
0.25 |
Exponent for gaussians based on occurrence counts |
Realignment iterations for training are calculated based on splitting the number of iterations into quarters. The first quarter of training will perform realignment every iteration, the second quarter will perform realignment every other iteration, and the final two quarters will perform realignment every third iteration.
Triphone training options#
For the Kaldi recipe that triphone training is based on, see train_deltas.sh.
Parameter |
Default value |
Notes |
---|---|---|
subset |
0 |
Number of utterances to use (0 uses the full corpus) |
num_iterations |
35 |
Number of training iterations |
max_gaussians |
10000 |
Total number of gaussians |
power |
0.25 |
Exponent for gaussians based on occurrence counts |
num_leaves |
1000 |
Number of states in the decision tree |
cluster_threshold |
-1 |
Threshold for clustering leaves in decision tree |
LDA training options#
For the Kaldi recipe that LDA training is based on, see train_lda_mllt.sh.
Parameter |
Default value |
Notes |
---|---|---|
subset |
0 |
Number of utterances to use (0 uses the full corpus) |
num_iterations |
35 |
Number of training iterations |
max_gaussians |
10000 |
Total number of gaussians |
power |
0.25 |
Exponent for gaussians based on occurrence counts |
num_leaves |
1000 |
Number of states in the decision tree |
cluster_threshold |
-1 |
Threshold for clustering leaves in decision tree |
lda_dimension |
40 |
Dimension of resulting LDA features |
random_prune |
4.0 |
Ratio of random pruning to speed up MLLT |
LDA estimation will be performed every other iteration for the first quarter of iterations, and then one final LDA estimation will be performed halfway through the training iterations.
Speaker-adapted training (SAT) options#
For the Kaldi recipe that SAT training is based on, see train_sat.sh.
Parameter |
Default value |
Notes |
---|---|---|
subset |
0 |
Number of utterances to use (0 uses the full corpus) |
num_iterations |
35 |
Number of training iterations |
max_gaussians |
10000 |
Total number of gaussians |
power |
0.25 |
Exponent for gaussians based on occurrence counts |
num_leaves |
1000 |
Number of states in the decision tree |
cluster_threshold |
-1 |
Threshold for clustering leaves in decision tree |
silence_weight |
0.0 |
Weight on silence in fMLLR estimation |
fmllr_update_type |
full |
Type of fMLLR estimation |
optional |
False |
Flag for whether a training block will be skipped if the size of the corpus is smaller than the subset |
quick |
False |
Based on train_quick.sh, performs fewer rounds of fMLLR estimation |
fMLLR estimation will be performed every other iteration for the first quarter of iterations, and then one final fMLLR estimation will be performed halfway through the training iterations.
Pronunciation probability modeling options#
For the Kaldi recipe that pronunciation probability training is based on, see get_prons.sh. Dictionaries can be trained on new datasets using pretrained models as well. The current default training regime does two rounds of pronunciation probability modeling, after the second and third SAT blocks.
Parameter |
Default value |
Notes |
---|---|---|
silence_probabilities |
True |
Flag for whether to compute probabilities of silence before and after each word’s pronunciation, in addition to the pronunciation probability |
Default training config file#
The below configuration file shows the equivalent of the current 2.0 training regime, mostly as an example of what configuration options are available and how they progress through the overall training.
beam: 10
retry_beam: 40
features:
type: "mfcc"
use_energy: false
use_pitch: true
frame_shift: 10
training:
- monophone:
subset: 10000
num_iterations: 40
max_gaussians: 1000
boost_silence: 1.25
- triphone:
subset: 20000
num_iterations: 35
num_leaves: 2000
max_gaussians: 10000
cluster_threshold: -1
boost_silence: 1.25
power: 0.25
- lda:
subset: 20000
num_leaves: 2500
max_gaussians: 15000
num_iterations: 35
- sat:
subset: 20000
num_leaves: 2500
max_gaussians: 15000
power: 0.2
silence_weight: 0.0
fmllr_update_type: "full"
- sat:
subset: 50000
num_leaves: 4200
max_gaussians: 40000
power: 0.2
silence_weight: 0.0
fmllr_update_type: "full"
- pronunciation_probabilities:
subset: 50000
silence_probabilities: true
- sat:
subset: 150000
num_leaves: 5000
max_gaussians: 100000
power: 0.2
silence_weight: 0.0
fmllr_update_type: "full"
- pronunciation_probabilities:
subset: 150000
silence_probabilities: true
optional: true # Skipped if the corpus is smaller than the subset
- sat:
subset: 0
quick: true # Performs fewer fMLLR estimation
num_iterations: 20
num_leaves: 7000
max_gaussians: 150000
power: 0.2
silence_weight: 0.0
fmllr_update_type: "full"
optional: true # Skipped if the corpus is smaller than the previous subset
Training configuration for 1.0#
The below configuration matches the training procedure used in models trained in version 1.0. Note the lack of an LDA block, and only one SAT training block, as well as the lack of subsets in initial training blocks.
beam: 10
retry_beam: 40
features:
type: "mfcc"
use_energy: false
frame_shift: 10
training:
- monophone:
num_iterations: 40
max_gaussians: 1000
boost_silence: 1.0
- triphone:
num_iterations: 35
num_leaves: 3100
max_gaussians: 50000
cluster_threshold: 100
boost_silence: 1.0
power: 0.25
- sat:
num_leaves: 3100
max_gaussians: 50000
power: 0.2
silence_weight: 0.0
cluster_threshold: 100
fmllr_update_type: "full"