.. _configuration_acoustic_modeling:

*******************************
Acoustic model training options
*******************************

.. note::

   See :ref:`configuration_global` for options relating to the alignment steps

Global alignment options can be overwritten for each trainer (i.e., different beam settings at different stages of training).

.. note::

   Subsets are created by sorting the utterances by length, taking a larger subset (10 times the specified subset amount) and then randomly sampling the specified subset amount from this larger subset.  Utterances with transcriptions that are only one word long are ignored.

Monophone Configuration
-----------------------

For the Kaldi recipe that monophone training is based on, see :kaldi_steps:`train_mono`.


.. csv-table::
   :widths: 20, 20, 60
   :header: "Parameter", "Default value", "Notes"

   "subset", 10000, "Number of utterances to use (0 uses the full corpus)"
   "initial_beam", 6, "Initial beam width for first alignment iteration"
   "beam", 10, "Beam width for alignment iterations other than the first"
   "num_iterations", 40, "Number of training iterations"
   "max_gaussians", 1000, "Total number of gaussians"
   "boost_silence", 1.25, "Factor to boost silence probabilities"
   "power", 0.25, "Exponent for gaussians based on occurrence counts"


Realignment iterations for training are calculated based on splitting the number of iterations into quarters.  The first
quarter of training will perform realignment every iteration, the second quarter will perform realignment every other iteration,
and the final two quarters will perform realignment every third iteration.


Triphone training options
-------------------------

For the Kaldi recipe that triphone training is based on, see :kaldi_steps:`train_deltas`.

.. csv-table::
   :widths: 20, 20, 60
   :header: "Parameter", "Default value", "Notes"

   "subset", 0, "Number of utterances to use (0 uses the full corpus)"
   "num_iterations", 35, "Number of training iterations"
   "max_gaussians", 10000, "Total number of gaussians"
   "power", 0.25, "Exponent for gaussians based on occurrence counts"
   "num_leaves", 1000, "Number of states in the decision tree"
   "cluster_threshold", -1, "Threshold for clustering leaves in decision tree"


LDA training options
--------------------

For the Kaldi recipe that LDA training is based on, see :kaldi_steps:`train_lda_mllt`.

.. csv-table::
   :widths: 20, 20, 60
   :header: "Parameter", "Default value", "Notes"

   "subset", 0, "Number of utterances to use (0 uses the full corpus)"
   "num_iterations", 35, "Number of training iterations"
   "max_gaussians", 10000, "Total number of gaussians"
   "power", 0.25, "Exponent for gaussians based on occurrence counts"
   "num_leaves", 1000, "Number of states in the decision tree"
   "cluster_threshold", -1, "Threshold for clustering leaves in decision tree"
   "lda_dimension", 40, "Dimension of resulting LDA features"
   "random_prune", 4.0, "Ratio of random pruning to speed up MLLT"


LDA estimation will be performed every other iteration for the first quarter of iterations, and then one final LDA estimation
will be performed halfway through the training iterations.

Speaker-adapted training (SAT) options
--------------------------------------

For the Kaldi recipe that SAT training is based on, see :kaldi_steps:`train_sat`.

.. csv-table::
   :widths: 20, 20, 60
   :header: "Parameter", "Default value", "Notes"

   "subset", 0, "Number of utterances to use (0 uses the full corpus)"
   "num_iterations", 35, "Number of training iterations"
   "max_gaussians", 10000, "Total number of gaussians"
   "power", 0.25, "Exponent for gaussians based on occurrence counts"
   "num_leaves", 1000, "Number of states in the decision tree"
   "cluster_threshold", -1, "Threshold for clustering leaves in decision tree"
   "silence_weight", 0.0, "Weight on silence in fMLLR estimation"
   "fmllr_update_type", "full", "Type of fMLLR estimation"
   "optional", "False", "Flag for whether a training block will be skipped if the size of the corpus is smaller than the subset"
   "quick", "False", "Based on :kaldi_steps:`train_quick`, performs fewer rounds of fMLLR estimation"


fMLLR estimation will be performed every other iteration for the first quarter of iterations, and then one final fMLLR estimation
will be performed halfway through the training iterations.

Pronunciation probability modeling options
-------------------------------------------

For the Kaldi recipe that pronunciation probability training is based on, see :kaldi_steps:`get_prons`.  Dictionaries can be trained on new datasets using pretrained models as well.  The current default training regime does two rounds of pronunciation probability modeling, after the second and third SAT blocks.

.. csv-table::
   :widths: 20, 20, 60
   :header: "Parameter", "Default value", "Notes"

   "silence_probabilities", "True", "Flag for whether to compute probabilities of silence before and after each word's pronunciation, in addition to the pronunciation probability"


.. _default_training_config:

Default training config file
----------------------------

The below configuration file shows the equivalent of the current 2.0 training regime, mostly as an example of what configuration options are available and how they progress through the overall training.

.. code-block:: yaml

   beam: 10
   retry_beam: 40

   features:
     type: "mfcc"
     use_energy: false
     use_pitch: true
     frame_shift: 10

   training:
     - monophone:
         subset: 10000
         num_iterations: 40
         max_gaussians: 1000
         boost_silence: 1.25

     - triphone:
         subset: 20000
         num_iterations: 35
         num_leaves: 2000
         max_gaussians: 10000
         cluster_threshold: -1
         boost_silence: 1.25
         power: 0.25

     - lda:
         subset: 20000
         num_leaves: 2500
         max_gaussians: 15000
         num_iterations: 35

     - sat:
         subset: 20000
         num_leaves: 2500
         max_gaussians: 15000
         power: 0.2
         silence_weight: 0.0
         fmllr_update_type: "full"

     - sat:
         subset: 50000
         num_leaves: 4200
         max_gaussians: 40000
         power: 0.2
         silence_weight: 0.0
         fmllr_update_type: "full"

     - pronunciation_probabilities:
         subset: 50000
         silence_probabilities: true

     - sat:
         subset: 150000
         num_leaves: 5000
         max_gaussians: 100000
         power: 0.2
         silence_weight: 0.0
         fmllr_update_type: "full"

     - pronunciation_probabilities:
         subset: 150000
         silence_probabilities: true
         optional: true # Skipped if the corpus is smaller than the subset

     - sat:
         subset: 0
         quick: true # Performs fewer fMLLR estimation
         num_iterations: 20
         num_leaves: 7000
         max_gaussians: 150000
         power: 0.2
         silence_weight: 0.0
         fmllr_update_type: "full"
         optional: true # Skipped if the corpus is smaller than the previous subset

.. _1.0_training_config:

Training configuration for 1.0
------------------------------

The below configuration matches the training procedure used in models trained in version 1.0.  Note the lack of an LDA block, and only one SAT training block, as well as the lack of subsets in initial training blocks.

.. code-block:: yaml

   beam: 10
   retry_beam: 40

   features:
     type: "mfcc"
     use_energy: false
     frame_shift: 10

   training:
     - monophone:
         num_iterations: 40
         max_gaussians: 1000
         boost_silence: 1.0

     - triphone:
         num_iterations: 35
         num_leaves: 3100
         max_gaussians: 50000
         cluster_threshold: 100
         boost_silence: 1.0
         power: 0.25

     - sat:
         num_leaves: 3100
         max_gaussians: 50000
         power: 0.2
         silence_weight: 0.0
         cluster_threshold: 100
         fmllr_update_type: "full"