Acoustic model training options#

Note

See Global Options for options relating to the alignment steps

Global alignment options can be overwritten for each trainer (i.e., different beam settings at different stages of training).

Note

Subsets are created by sorting the utterances by length, taking a larger subset (10 times the specified subset amount) and then randomly sampling the specified subset amount from this larger subset. Utterances with transcriptions that are only one word long are ignored.

Monophone Configuration#

For the Kaldi recipe that monophone training is based on, see train_mono.sh.

Parameter	Default value	Notes
subset	10000	Number of utterances to use (0 uses the full corpus)
initial_beam	6	Initial beam width for first alignment iteration
beam	10	Beam width for alignment iterations other than the first
num_iterations	40	Number of training iterations
max_gaussians	1000	Total number of gaussians
boost_silence	1.25	Factor to boost silence probabilities
power	0.25	Exponent for gaussians based on occurrence counts

Realignment iterations for training are calculated based on splitting the number of iterations into quarters. The first quarter of training will perform realignment every iteration, the second quarter will perform realignment every other iteration, and the final two quarters will perform realignment every third iteration.

Triphone training options#

For the Kaldi recipe that triphone training is based on, see train_deltas.sh.

Parameter	Default value	Notes
subset	0	Number of utterances to use (0 uses the full corpus)
num_iterations	35	Number of training iterations
max_gaussians	10000	Total number of gaussians
power	0.25	Exponent for gaussians based on occurrence counts
num_leaves	1000	Number of states in the decision tree
cluster_threshold	-1	Threshold for clustering leaves in decision tree

LDA training options#

For the Kaldi recipe that LDA training is based on, see train_lda_mllt.sh.

Parameter	Default value	Notes
subset	0	Number of utterances to use (0 uses the full corpus)
num_iterations	35	Number of training iterations
max_gaussians	10000	Total number of gaussians
power	0.25	Exponent for gaussians based on occurrence counts
num_leaves	1000	Number of states in the decision tree
cluster_threshold	-1	Threshold for clustering leaves in decision tree
lda_dimension	40	Dimension of resulting LDA features
random_prune	4.0	Ratio of random pruning to speed up MLLT

LDA estimation will be performed every other iteration for the first quarter of iterations, and then one final LDA estimation will be performed halfway through the training iterations.

Speaker-adapted training (SAT) options#

For the Kaldi recipe that SAT training is based on, see train_sat.sh.

Parameter	Default value	Notes
subset	0	Number of utterances to use (0 uses the full corpus)
num_iterations	35	Number of training iterations
max_gaussians	10000	Total number of gaussians
power	0.25	Exponent for gaussians based on occurrence counts
num_leaves	1000	Number of states in the decision tree
cluster_threshold	-1	Threshold for clustering leaves in decision tree
silence_weight	0.0	Weight on silence in fMLLR estimation
fmllr_update_type	full	Type of fMLLR estimation
optional	False	Flag for whether a training block will be skipped if the size of the corpus is smaller than the subset
quick	False	Based on train_quick.sh, performs fewer rounds of fMLLR estimation

fMLLR estimation will be performed every other iteration for the first quarter of iterations, and then one final fMLLR estimation will be performed halfway through the training iterations.

Pronunciation probability modeling options#

For the Kaldi recipe that pronunciation probability training is based on, see get_prons.sh. Dictionaries can be trained on new datasets using pretrained models as well. The current default training regime does two rounds of pronunciation probability modeling, after the second and third SAT blocks.

Parameter	Default value	Notes
silence_probabilities	True	Flag for whether to compute probabilities of silence before and after each word’s pronunciation, in addition to the pronunciation probability

Default training config file#

The below configuration file shows the equivalent of the current 2.0 training regime, mostly as an example of what configuration options are available and how they progress through the overall training.

beam: 10
retry_beam: 40

features:
  type: "mfcc"
  use_energy: false
  use_pitch: true
  frame_shift: 10

training:
  - monophone:
      subset: 10000
      num_iterations: 40
      max_gaussians: 1000
      boost_silence: 1.25

  - triphone:
      subset: 20000
      num_iterations: 35
      num_leaves: 2000
      max_gaussians: 10000
      cluster_threshold: -1
      boost_silence: 1.25
      power: 0.25

  - lda:
      subset: 20000
      num_leaves: 2500
      max_gaussians: 15000
      num_iterations: 35

  - sat:
      subset: 20000
      num_leaves: 2500
      max_gaussians: 15000
      power: 0.2
      silence_weight: 0.0
      fmllr_update_type: "full"

  - sat:
      subset: 50000
      num_leaves: 4200
      max_gaussians: 40000
      power: 0.2
      silence_weight: 0.0
      fmllr_update_type: "full"

  - pronunciation_probabilities:
      subset: 50000
      silence_probabilities: true

  - sat:
      subset: 150000
      num_leaves: 5000
      max_gaussians: 100000
      power: 0.2
      silence_weight: 0.0
      fmllr_update_type: "full"

  - pronunciation_probabilities:
      subset: 150000
      silence_probabilities: true
      optional: true # Skipped if the corpus is smaller than the subset

  - sat:
      subset: 0
      quick: true # Performs fewer fMLLR estimation
      num_iterations: 20
      num_leaves: 7000
      max_gaussians: 150000
      power: 0.2
      silence_weight: 0.0
      fmllr_update_type: "full"
      optional: true # Skipped if the corpus is smaller than the previous subset

Training configuration for 1.0#

The below configuration matches the training procedure used in models trained in version 1.0. Note the lack of an LDA block, and only one SAT training block, as well as the lack of subsets in initial training blocks.

beam: 10
retry_beam: 40

features:
  type: "mfcc"
  use_energy: false
  frame_shift: 10

training:
  - monophone:
      num_iterations: 40
      max_gaussians: 1000
      boost_silence: 1.0

  - triphone:
      num_iterations: 35
      num_leaves: 3100
      max_gaussians: 50000
      cluster_threshold: 100
      boost_silence: 1.0
      power: 0.25

  - sat:
      num_leaves: 3100
      max_gaussians: 50000
      power: 0.2
      silence_weight: 0.0
      cluster_threshold: 100
      fmllr_update_type: "full"