.. _train_acoustic_model: Train a new acoustic model ``(mfa train)`` ****************************************** You can train new :term:`acoustic models` from scratch using MFA, and export the final alignments as :term:`TextGrids` at the end. You don't need a ton of data to generate decent alignments (see `the blog post comparing alignments trained on various corpus sizes `_). At the end of the day, it comes down to trial and error, so I would recommend trying different workflows of pretrained models vs training your own or adapting a model to your data to see what performs best. Phone set ========= .. note:: See :doc:`phone groups <../implementations/phone_groups>` for how to customize phone groups to your specific needs rather than using the preset phone groups of the defined phone sets in this section. The type of phone set can be specified through ``--phone_set``. Currently only ``IPA``, ``ARPA``, and ``PINYIN`` are supported, but I plan to make it more customizable in the future. The primary benefit of specifying the phone set is to create phone topologies that are more sensible than the defaults. The default phone model uses 3 HMM states to represent phones, as that generally does a decent job of capturing the dynamic nature of phones. Something like an aspirated stop typically has three clear states, a closure, a burst, and an aspiration period. However, other phones like a tap, glottal stop, or unstressed schwa are so short that they can cause misalignment errors. For these, a single HMM state is more sensible, so they have a shorter minimum duration (each HMM state has a minimum 10ms duration). For vowels, 3 states generally makes sense for monophthongs, where one state corresponds to the onset, one to the "steady state", and one to the offset. For diphthongs and triphthongs, three states doesn't map as clearly to the states, as you'll have an onset, a first target, a transition, a second target, and an offset (and a third target for tiphthongs). Specifying phone sets will use preset stops, affricates, diphthongs, triphthongs and extra short segments. Certain diacritics (``ʱʼʰʲʷⁿˠ``) will result in one more state being added, as they represent quite different acoustics from the base phone. An additional benefit is in guiding the decision tree clustering for triphone modeling, where using phone sets will add extra questions for allophonic variation, as well as for general classes of sounds (sibilant sounds, places of articulation, rhotics, groups of vowels, etc). These questions should be more appropriate than the default setting. .. tab-set:: .. tab-item:: IPA :sync: ipa The topology generated by IPA phone set generates base phone classes for extra short phones, stop phones, affricate phones, diphthongs, and triphthongs. Any phones below not used in the dictionary will be ignored. .. list-table:: Non-default IPA Topologies :header-rows: 1 * - Phone class - HMM states - Phones * - Extra short phones - 1 - ``ʔ ə ɚ ɾ p̚ t̚ k̚`` * - Stop phones - 2 - ``p b t d ʈ ɖ c ɟ k ɡ q ɢ`` * - Affricate phones - 4 - ``pf ts dz tʃ dʒ tɕ dʑ tʂ ʈʂ dʐ ɖʐ cç ɟʝ kx ɡɣ tç dʝ`` * - Diphthongs - 5 - Two of: ``i u e ə a o y ɔ j w ɪ ʊ w ʏ ɯ ɤ ɑ æ ɐ ɚ ɵ ɘ ɛ ɜ ɝ ɛ ɞ ɑ ɨ ɪ̈ œ ɒ ɶ ø ʉ ʌ`` * - Triphthongs - 6 - Three of: ``i u e ə a o y ɔ j w ɪ ʊ w ʏ ɯ ɤ ɑ æ ɐ ɚ ɵ ɘ ɛ ɜ ɝ ɛ ɞ ɑ ɨ ɪ̈ œ ɒ ɶ ø ʉ ʌ`` .. tab-item:: ARPA :sync: arpa For ARPA, we use the following topology calculation. Additionally, stress-marked vowels are collected under a single base phone (i.e., ``AA0 AA1 AA2`` are collected under ``AA``), so they will share states during training. .. list-table:: Non-default ARPA Topologies :header-rows: 1 * - Phone class - HMM states - Phones * - Extra short phones - 1 - ``AH0 IH0 ER0 UH0`` * - Stop phones - 2 - ``B D G`` (``P T K`` not included because they include aspiration) * - Affricate phones - 4 - ``CH JH`` * - Diphthongs - 5 - ``AY0 AY1 AY2 AW0 AW1 AW2 OY0 OY1 OY2 EY0 EY1 EY2 OW0 OW1 OW2`` .. list-table:: ARPA Extra Questions :header-rows: 1 * - Question Group - Phones - Notes * - Bilabial stops - ``B P`` - * - Dentals - ``D DH`` - ``/ð/`` often is realized as ``/d/`` for high frequency words in many dialects of American English * - Flapping - ``D T`` - * - Nasals - ``M N NG`` - * - Voiceless sibilants - ``CH SH S`` - * - Voiced sibilants - ``JH ZH Z`` - * - Voiceless fricatives - ``F TH HH K`` - ``K`` is included for reductions to a more fricative realization ``/x/`` in casual speech * - Voiced fricatives - ``V DH HH G`` - G included for the same reason as above * - Dorsals - ``K G HH`` - * - Rhotics - ``ER0 ER1 ER2 R`` - ``ER`` vowels are really just ``/ɹ̩/`` * - Low back vowels - ``AO0 AO1 AO2 AA0 AA1 AA2`` - Cot-caught merger * - Central vowels - ``ER0 ER1 ER2 AH0 AH1 AH2 UH0 UH1 UH2 IH0 IH1 IH2`` - * - High back vowels - ``UW1 UW2 UW0 UH1 UH2 UH0`` - * - High front vowels - ``IY1 IY2 IY0 IH0 IH1 IH2`` - * - Mid front vowels - ``EY1 EY2 EY0 EH0 EH1 EH2`` - * - Primary stressed vowels - ``AA1 AE1 AH1 AO1 AW1 AY1 EH1 ER1 EY1 IH1 IY1 OW1 OY1 UH1 UW1`` - Following the `Kaldi LibriSpeech recipe `_ * - Secondary stressed vowels - ``AA2 AE2 AH2 AO2 AW2 AY2 EH2 ER2 EY2 IH2 IY2 OW2 OY2 UH2 UW2`` - * - Unstressed vowels - ``AA0 AE0 AH0 AO0 AW0 AY0 EH0 ER0 EY0 IH0 IY0 OW0 OY0 UH0 UW0`` - .. tab-item:: PINYIN :sync: pinyin .. list-table:: Non-default Pinyin Topologies :header-rows: 1 * - Phone class - HMM states - Phones * - Stop phones - 2 - ``b d g`` (``p t k`` not included because they're aspirated) * - Affricate phones - 4 - ``z zh j`` * - Aspirated affricate phones - 5 - ``c ch q`` * - Diphthongs - 5 - Two of: ``i u y e w a o e ü`` * - Triphthongs - 6 - Three of: ``i u y e w a o e ü`` .. list-table:: Pinyin Extra Questions :header-rows: 1 * - Question Group - Phones - Notes * - Bilabial stops - ``b p`` - * - Alveolar stops - ``d t`` - * - Nasals - ``m n ng`` - * - Voiceless sibilants - ``z zh j c ch q s sh x`` - * - Dorsals - ``k g h`` - Pinyin ``h`` is a velar fricative ``/x/`` * - Rhotics - ``r sh e`` - ``e`` is included to capture instances of ``ɚ`` * - Approximants - ``l r y w`` - * - Tone 1 - All monophthong, diphthongs, triphthongs with tone 1 - * - Tone 2 - All monophthong, diphthongs, triphthongs with tone 2 - * - Tone 3 - All monophthong, diphthongs, triphthongs with tone 3 - * - Tone 4 - All monophthong, diphthongs, triphthongs with tone 4 - * - Tone 5 - All monophthong, diphthongs, triphthongs with tone 5 - Pronunciation modeling ====================== For the default configuration, pronunciation probabilities are estimated following the second and third SAT blocks. See :ref:`training_dictionary` for more details. A recent experimental feature for training acoustic models is the ``--train_g2p`` flag which changes the pronunciation probability estimation from a lexicon based estimation to instead using a G2P model as the lexicon. The idea here is that we have pronunciations generated by the initial blocks much like for the standard lexicon-based approach, but instead of estimating probabilities for individual word/pronunciation pairs and the likelihood of surrounding silence, it learns a mapping between the graphemes of the input texts and the phones. .. note:: See :doc:`phonological rules <../implementations/phonological_rules>` for how to specify regular expression-like phonological rules so you don't have to code every form for a regular rule. Command reference ================= .. click:: montreal_forced_aligner.command_line.train_acoustic_model:train_acoustic_model_cli :prog: mfa train :nested: full Configuration reference ======================= - :ref:`configuration_acoustic_modeling` API reference ------------- - :ref:`acoustic_modeling_api` - :ref:`acoustic_model_training_api`