Train a new acoustic model (mfa train)#

You can train new acoustic models from scratch using MFA, and export the final alignments as TextGrids at the end. You don’t need a ton of data to generate decent alignments (see the blog post comparing alignments trained on various corpus sizes). At the end of the day, it comes down to trial and error, so I would recommend trying different workflows of pretrained models vs training your own or adapting a model to your data to see what performs best.

Phone set#

Note

See phone groups for how to customize phone groups to your specific needs rather than using the preset phone groups of the defined phone sets in this section.

The type of phone set can be specified through --phone_set. Currently only IPA, ARPA, and PINYIN are supported, but I plan to make it more customizable in the future. The primary benefit of specifying the phone set is to create phone topologies that are more sensible than the defaults.

The default phone model uses 3 HMM states to represent phones, as that generally does a decent job of capturing the dynamic nature of phones. Something like an aspirated stop typically has three clear states, a closure, a burst, and an aspiration period. However, other phones like a tap, glottal stop, or unstressed schwa are so short that they can cause misalignment errors. For these, a single HMM state is more sensible, so they have a shorter minimum duration (each HMM state has a minimum 10ms duration). For vowels, 3 states generally makes sense for monophthongs, where one state corresponds to the onset, one to the “steady state”, and one to the offset. For diphthongs and triphthongs, three states doesn’t map as clearly to the states, as you’ll have an onset, a first target, a transition, a second target, and an offset (and a third target for tiphthongs). Specifying phone sets will use preset stops, affricates, diphthongs, triphthongs and extra short segments. Certain diacritics (ʱʼʰʲʷⁿˠ) will result in one more state being added, as they represent quite different acoustics from the base phone.

An additional benefit is in guiding the decision tree clustering for triphone modeling, where using phone sets will add extra questions for allophonic variation, as well as for general classes of sounds (sibilant sounds, places of articulation, rhotics, groups of vowels, etc). These questions should be more appropriate than the default setting.

The topology generated by IPA phone set generates base phone classes for extra short phones, stop phones, affricate phones, diphthongs, and triphthongs. Any phones below not used in the dictionary will be ignored.

Non-default IPA Topologies#

Phone class

HMM states

Phones

Extra short phones

1

ʔ ə ɚ ɾ

Stop phones

2

p b t d ʈ ɖ c ɟ k ɡ q ɢ

Affricate phones

4

pf ts dz ʈʂ ɖʐ ɟʝ kx ɡɣ

Diphthongs

5

Two of: i u e ə a o y ɔ j w ɪ ʊ w ʏ ɯ ɤ ɑ æ ɐ ɚ ɵ ɘ ɛ ɜ ɝ ɛ ɞ ɑ ɨ ɪ̈ œ ɒ ɶ ø ʉ ʌ

Triphthongs

6

Three of: i u e ə a o y ɔ j w ɪ ʊ w ʏ ɯ ɤ ɑ æ ɐ ɚ ɵ ɘ ɛ ɜ ɝ ɛ ɞ ɑ ɨ ɪ̈ œ ɒ ɶ ø ʉ ʌ

For ARPA, we use the following topology calculation. Additionally, stress-marked vowels are collected under a single base phone (i.e., AA0 AA1 AA2 are collected under AA), so they will share states during training.

Non-default ARPA Topologies#

Phone class

HMM states

Phones

Extra short phones

1

AH0 IH0 ER0 UH0

Stop phones

2

B D G (P T K not included because they include aspiration)

Affricate phones

4

CH JH

Diphthongs

5

AY0 AY1 AY2 AW0 AW1 AW2 OY0 OY1 OY2 EY0 EY1 EY2 OW0 OW1 OW2

ARPA Extra Questions#

Question Group

Phones

Notes

Bilabial stops

B P

Dentals

D DH

/ð/ often is realized as /d/ for high frequency words in many dialects of American English

Flapping

D T

Nasals

M N NG

Voiceless sibilants

CH SH S

Voiced sibilants

JH ZH Z

Voiceless fricatives

F TH HH K

K is included for reductions to a more fricative realization /x/ in casual speech

Voiced fricatives

V DH HH G

G included for the same reason as above

Dorsals

K G HH

Rhotics

ER0 ER1 ER2 R

ER vowels are really just /ɹ̩/

Low back vowels

AO0 AO1 AO2 AA0 AA1 AA2

Cot-caught merger

Central vowels

ER0 ER1 ER2 AH0 AH1 AH2 UH0 UH1 UH2 IH0 IH1 IH2

High back vowels

UW1 UW2 UW0 UH1 UH2 UH0

High front vowels

IY1 IY2 IY0 IH0 IH1 IH2

Mid front vowels

EY1 EY2 EY0 EH0 EH1 EH2

Primary stressed vowels

AA1 AE1 AH1 AO1 AW1 AY1 EH1 ER1 EY1 IH1 IY1 OW1 OY1 UH1 UW1

Following the Kaldi LibriSpeech recipe

Secondary stressed vowels

AA2 AE2 AH2 AO2 AW2 AY2 EH2 ER2 EY2 IH2 IY2 OW2 OY2 UH2 UW2

Unstressed vowels

AA0 AE0 AH0 AO0 AW0 AY0 EH0 ER0 EY0 IH0 IY0 OW0 OY0 UH0 UW0

Non-default Pinyin Topologies#

Phone class

HMM states

Phones

Stop phones

2

b d g (p t k not included because they’re aspirated)

Affricate phones

4

z zh j

Aspirated affricate phones

5

c ch q

Diphthongs

5

Two of: i u y e w a o e ü

Triphthongs

6

Three of: i u y e w a o e ü

Pinyin Extra Questions#

Question Group

Phones

Notes

Bilabial stops

b p

Alveolar stops

d t

Nasals

m n ng

Voiceless sibilants

z zh j c ch q s sh x

Dorsals

k g h

Pinyin h is a velar fricative /x/

Rhotics

r sh e

e is included to capture instances of ɚ

Approximants

l r y w

Tone 1

All monophthong, diphthongs, triphthongs with tone 1

Tone 2

All monophthong, diphthongs, triphthongs with tone 2

Tone 3

All monophthong, diphthongs, triphthongs with tone 3

Tone 4

All monophthong, diphthongs, triphthongs with tone 4

Tone 5

All monophthong, diphthongs, triphthongs with tone 5

Pronunciation modeling#

For the default configuration, pronunciation probabilities are estimated following the second and third SAT blocks. See Add probabilities to a dictionary (mfa train_dictionary) for more details.

A recent experimental feature for training acoustic models is the --train_g2p flag which changes the pronunciation probability estimation from a lexicon based estimation to instead using a G2P model as the lexicon. The idea here is that we have pronunciations generated by the initial blocks much like for the standard lexicon-based approach, but instead of estimating probabilities for individual word/pronunciation pairs and the likelihood of surrounding silence, it learns a mapping between the graphemes of the input texts and the phones.

Note

See phonological rules for how to specify regular expression-like phonological rules so you don’t have to code every form for a regular rule.

Command reference#

mfa train#

Train a new acoustic model on a corpus and optionally export alignments

mfa train [OPTIONS] CORPUS_DIRECTORY DICTIONARY_PATH OUTPUT_MODEL_PATH

Options

--output_directory <output_directory>#

Path to save alignments.

-c, --config_path <config_path>#

Path to config file to use for training.

-s, --speaker_characters <speaker_characters>#

Number of characters of file names to use for determining speaker, default is to use directory names.

-a, --audio_directory <audio_directory>#

Audio directory root to use for finding audio files.

--phone_set <phone_set_type>#

Enable extra decision tree modeling based on the phone set.

Options:

UNKNOWN | AUTO | MFA | IPA | ARPA | PINYIN

--phone_groups_path <phone_groups_path>#

Path to yaml file defining phone groups.

--rules_path <rules_path>#

Path to yaml file defining phonological rules.

--output_format <output_format>#

Format for aligned output files (default is long_textgrid).

Options:

long_textgrid | short_textgrid | json | csv

--include_original_text#

Flag to include original utterance text in the output.

-p, --profile <profile>#

Configuration profile to use, defaults to “global”

-t, --temporary_directory <temporary_directory>#

Set the default temporary directory, default is /home/docs/Documents/MFA

-j, --num_jobs <num_jobs>#

Set the number of processes to use by default, defaults to 3

--clean, --no_clean#

Remove files from previous runs, default is False

-v, --verbose, -nv, --no_verbose#

Output debug messages, default is False

-q, --quiet, -nq, --no_quiet#

Suppress all output messages (overrides verbose), default is False

--overwrite, --no_overwrite#

Overwrite output files when they exist, default is False

--use_mp, --no_use_mp#

Turn on/off multiprocessing. Multiprocessing is recommended will allow for faster executions.

-d, --debug, -nd, --no_debug#

Run extra steps for debugging issues, default is False

--use_postgres, --no_use_postgres#

Use postgres instead of sqlite for extra functionality, default is False

--single_speaker#

Single speaker mode creates multiprocessing splits based on utterances rather than speakers.

--textgrid_cleanup, --no_textgrid_cleanup#

Turn on/off post-processing of TextGrids that cleans up silences and recombines compound words and clitics.

-h, --help#

Show this message and exit.

Arguments

CORPUS_DIRECTORY#

Required argument

DICTIONARY_PATH#

Required argument

OUTPUT_MODEL_PATH#

Required argument

Configuration reference#

API reference#