Configuration

Global options

These options are used for aligning the full dataset (and as part of training). Increasing the values of them will allow for more relaxed restrictions on alignment. Relaxing these restrictions can be particularly helpful for certain kinds of files that are quite different from the training dataset (i.e., single word production data from experiments, or longer stretches of audio).

Parameter Default value Notes
beam 10 Initial beam width to use for alignment
retry_beam 40 Beam width to use if initial alignment fails
transition_scale 1.0 Multiplier to scale transition costs
acoustic_scale 0.1 Multiplier to scale acoustic costs
self_loop_scale 0.1 Multiplier to scale self loop costs
boost_silence 1.0 1.0 is the value that does not affect probabilities

Feature Configuration

This section is only relevant for training, as the trained model will contain extractors and feature specification for what it requires.

Parameter Default value Notes
type mfcc Currently only MFCCs are supported
use_energy False Use energy in place of first MFCC
frame_shift 10 In milliseconds, determines time resolution

Training configuration

Global alignment options can be overwritten for each trainer (i.e., different beam settings at different stages of training).

Note

Subsets are created by sorting the utterances by length, taking a larger subset (10 times the specified subset amount) and then randomly sampling the specified subset amount from this larger subset. Utterances with transcriptions that are only one word long are ignored.

Monophone Configuration

For the Kaldi recipe that monophone training is based on, see https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/train_mono.sh

Parameter Default value Notes
subset 0 Number of utterances to use (0 uses the full corpus)
num_iterations 40 Number of training iterations
max_gaussians 40 Total number of gaussians
power 0.25 Exponent for gaussians based on occurrence counts

Realignment iterations for training are calculated based on splitting the number of iterations into quarters. The first quarter of training will perform realignment every iteration, the second quarter will perform realignment every other iteration, and the final two quarters will perform realignment every third iteration.

Triphone Configuration

For the Kaldi recipe that triphone training is based on, see https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/train_deltas.sh

Parameter Default value Notes
subset 0 Number of utterances to use (0 uses the full corpus)
num_iterations 40 Number of training iterations
max_gaussians 40 Total number of gaussians
power 0.25 Exponent for gaussians based on occurrence counts
num_leaves 1000 Number of states in the decision tree
cluster_threshold -1 Threshold for clustering leaves in decision tree

LDA Configuration

For the Kaldi recipe that LDA training is based on, see https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/train_lda_mllt.sh

Parameter Default value Notes
subset 0 Number of utterances to use (0 uses the full corpus)
num_iterations 40 Number of training iterations
max_gaussians 40 Total number of gaussians
power 0.25 Exponent for gaussians based on occurrence counts
num_leaves 1000 Number of states in the decision tree
cluster_threshold -1 Threshold for clustering leaves in decision tree
lda_dimension 40 Dimension of resulting LDA features
random_prune 4.0 Ratio of random pruning to speed up MLLT

LDA estimation will be performed every other iteration for the first quarter of iterations, and then one final LDA estimation will be performed halfway through the training iterations.

Speaker-adapted training (SAT) configuration

For the Kaldi recipe that SAT training is based on, see https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/train_sat.sh

Parameter Default value Notes
subset 0 Number of utterances to use (0 uses the full corpus)
num_iterations 40 Number of training iterations
max_gaussians 1000 Total number of gaussians
power 0.25 Exponent for gaussians based on occurrence counts
num_leaves 1000 Number of states in the decision tree
cluster_threshold -1 Threshold for clustering leaves in decision tree
silence_weight 0.0 Weight on silence in fMLLR estimation
fmllr_update_type full Type of fMLLR estimation

fMLLR estimation will be performed every other iteration for the first quarter of iterations, and then one final fMLLR estimation will be performed halfway through the training iterations.

Default training config file

beam: 10
retry_beam: 40

features:
  type: "mfcc"
  use_energy: false
  frame_shift: 10

training:
  - monophone:
      num_iterations: 40
      max_gaussians: 1000
      subset: 2000
      boost_silence: 1.25

  - triphone:
      num_iterations: 35
      num_leaves: 2000
      max_gaussians: 10000
      cluster_threshold: -1
      subset: 5000
      boost_silence: 1.25
      power: 0.25

  - lda:
      num_leaves: 2500
      max_gaussians: 15000
      subset: 10000
      num_iterations: 35
      features:
          splice_left_context: 3
          splice_right_context: 3

  - sat:
      num_leaves: 2500
      max_gaussians: 15000
      fmllr_power: 0.2
      silence_weight: 0.0
      fmllr_update_type: "diag"
      subset: 10000
      features:
          lda: true

  - sat:
      num_leaves: 4200
      max_gaussians: 40000
      fmllr_power: 0.2
      silence_weight: 0.0
      fmllr_update_type: "diag"
      subset: 30000
      features:
          lda: true
          fmllr: true

Align configuration

beam: 10
retry_beam: 40

Transcriber configuration

Parameter Default value Notes
beam 13 Beam for decoding
max_active 7000 Max active for decoding
lattice_beam 6 Beam width for decoding lattices
acoustic_scale 0.083333 Multiplier to scale acoustic costs
silence_weight 0.01 Weight on silence in fMLLR estimation
fmllr true Flag for whether to perform speaker adaptation
first_beam 10.0 Beam for decoding in initial speaker-independent pass, only used if fmllr is true
first_max_active 2000 Max active for decoding in initial speaker-independent pass, only used if fmllr is true
fmllr_update_type full Type of fMLLR estimation

Default transcriber config

beam: 13
max_active: 7000
lattice_beam: 6
acoustic_scale: 0.083333
silence_weight: 0.01
fmllr: true
first_beam: 10.0 # Beam used in initial, speaker-indep. pass
first_max_active: 2000 # max-active used in initial pass.
fmllr_update_type: full

Language model configuration

Parameter Default value Notes
order 3 Order of language model
method kneser_ney Method for smoothing
prune false Flag for whether to output pruned models as well
prune_thresh_small 0.0000003 Threshold for pruning a small model, only used if prune is true
prune_thresh_medium 0.0000001 Threshold for pruning a medium model, only used if prune is true

Default language model config

order: 3
method: kneser_ney
prune: false
prune_thresh_small: 0.0000003
prune_thresh_medium: 0.0000001