Configuration¶
Global options¶
These options are used for aligning the full dataset (and as part of training). Increasing the values of them will allow for more relaxed restrictions on alignment. Relaxing these restrictions can be particularly helpful for certain kinds of files that are quite different from the training dataset (i.e., single word production data from experiments, or longer stretches of audio).
Parameter | Default value | Notes |
---|---|---|
beam | 10 | Initial beam width to use for alignment |
retry_beam | 40 | Beam width to use if initial alignment fails |
transition_scale | 1.0 | Multiplier to scale transition costs |
acoustic_scale | 0.1 | Multiplier to scale acoustic costs |
self_loop_scale | 0.1 | Multiplier to scale self loop costs |
boost_silence | 1.0 | 1.0 is the value that does not affect probabilities |
Feature Configuration¶
This section is only relevant for training, as the trained model will contain extractors and feature specification for what it requires.
Parameter | Default value | Notes |
---|---|---|
type | mfcc | Currently only MFCCs are supported |
use_energy | False | Use energy in place of first MFCC |
frame_shift | 10 | In milliseconds, determines time resolution |
Training configuration¶
Global alignment options can be overwritten for each trainer (i.e., different beam settings at different stages of training).
Note
Subsets are created by sorting the utterances by length, taking a larger subset (10 times the specified subset amount) and then randomly sampling the specified subset amount from this larger subset. Utterances with transcriptions that are only one word long are ignored.
Monophone Configuration¶
For the Kaldi recipe that monophone training is based on, see https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/train_mono.sh
Parameter | Default value | Notes |
---|---|---|
subset | 0 | Number of utterances to use (0 uses the full corpus) |
num_iterations | 40 | Number of training iterations |
max_gaussians | 40 | Total number of gaussians |
power | 0.25 | Exponent for gaussians based on occurrence counts |
Realignment iterations for training are calculated based on splitting the number of iterations into quarters. The first quarter of training will perform realignment every iteration, the second quarter will perform realignment every other iteration, and the final two quarters will perform realignment every third iteration.
Triphone Configuration¶
For the Kaldi recipe that triphone training is based on, see https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/train_deltas.sh
Parameter | Default value | Notes |
---|---|---|
subset | 0 | Number of utterances to use (0 uses the full corpus) |
num_iterations | 40 | Number of training iterations |
max_gaussians | 40 | Total number of gaussians |
power | 0.25 | Exponent for gaussians based on occurrence counts |
num_leaves | 1000 | Number of states in the decision tree |
cluster_threshold | -1 | Threshold for clustering leaves in decision tree |
LDA Configuration¶
For the Kaldi recipe that LDA training is based on, see https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/train_lda_mllt.sh
Parameter | Default value | Notes |
---|---|---|
subset | 0 | Number of utterances to use (0 uses the full corpus) |
num_iterations | 40 | Number of training iterations |
max_gaussians | 40 | Total number of gaussians |
power | 0.25 | Exponent for gaussians based on occurrence counts |
num_leaves | 1000 | Number of states in the decision tree |
cluster_threshold | -1 | Threshold for clustering leaves in decision tree |
lda_dimension | 40 | Dimension of resulting LDA features |
random_prune | 4.0 | Ratio of random pruning to speed up MLLT |
LDA estimation will be performed every other iteration for the first quarter of iterations, and then one final LDA estimation will be performed halfway through the training iterations.
Speaker-adapted training (SAT) configuration¶
For the Kaldi recipe that SAT training is based on, see https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/train_sat.sh
Parameter | Default value | Notes |
---|---|---|
subset | 0 | Number of utterances to use (0 uses the full corpus) |
num_iterations | 40 | Number of training iterations |
max_gaussians | 1000 | Total number of gaussians |
power | 0.25 | Exponent for gaussians based on occurrence counts |
num_leaves | 1000 | Number of states in the decision tree |
cluster_threshold | -1 | Threshold for clustering leaves in decision tree |
silence_weight | 0.0 | Weight on silence in fMLLR estimation |
fmllr_update_type | full | Type of fMLLR estimation |
fMLLR estimation will be performed every other iteration for the first quarter of iterations, and then one final fMLLR estimation will be performed halfway through the training iterations.
Default training config file¶
beam: 10
retry_beam: 40
features:
type: "mfcc"
use_energy: false
frame_shift: 10
training:
- monophone:
num_iterations: 40
max_gaussians: 1000
subset: 2000
boost_silence: 1.25
- triphone:
num_iterations: 35
num_leaves: 2000
max_gaussians: 10000
cluster_threshold: -1
subset: 5000
boost_silence: 1.25
power: 0.25
- lda:
num_leaves: 2500
max_gaussians: 15000
subset: 10000
num_iterations: 35
features:
splice_left_context: 3
splice_right_context: 3
- sat:
num_leaves: 2500
max_gaussians: 15000
fmllr_power: 0.2
silence_weight: 0.0
fmllr_update_type: "diag"
subset: 10000
features:
lda: true
- sat:
num_leaves: 4200
max_gaussians: 40000
fmllr_power: 0.2
silence_weight: 0.0
fmllr_update_type: "diag"
subset: 30000
features:
lda: true
fmllr: true
Align configuration¶
beam: 10
retry_beam: 40
Transcriber configuration¶
Parameter | Default value | Notes |
---|---|---|
beam | 13 | Beam for decoding |
max_active | 7000 | Max active for decoding |
lattice_beam | 6 | Beam width for decoding lattices |
acoustic_scale | 0.083333 | Multiplier to scale acoustic costs |
silence_weight | 0.01 | Weight on silence in fMLLR estimation |
fmllr | true | Flag for whether to perform speaker adaptation |
first_beam | 10.0 | Beam for decoding in initial speaker-independent pass, only used if fmllr is true |
first_max_active | 2000 | Max active for decoding in initial speaker-independent pass, only used if fmllr is true |
fmllr_update_type | full | Type of fMLLR estimation |
Default transcriber config¶
beam: 13
max_active: 7000
lattice_beam: 6
acoustic_scale: 0.083333
silence_weight: 0.01
fmllr: true
first_beam: 10.0 # Beam used in initial, speaker-indep. pass
first_max_active: 2000 # max-active used in initial pass.
fmllr_update_type: full
Language model configuration¶
Parameter | Default value | Notes |
---|---|---|
order | 3 | Order of language model |
method | kneser_ney | Method for smoothing |
prune | false | Flag for whether to output pruned models as well |
prune_thresh_small | 0.0000003 | Threshold for pruning a small model, only used if prune is true |
prune_thresh_medium | 0.0000001 | Threshold for pruning a medium model, only used if prune is true |
Default language model config¶
order: 3
method: kneser_ney
prune: false
prune_thresh_small: 0.0000003
prune_thresh_medium: 0.0000001