Train a new acoustic model (mfa train)#
You can train new acoustic models from scratch using MFA, and export the final alignments as TextGrids at the end. You don’t need a ton of data to generate decent alignments (see the blog post comparing alignments trained on various corpus sizes). At the end of the day, it comes down to trial and error, so I would recommend trying different workflows of pretrained models vs training your own or adapting a model to your data to see what performs best.
Phone set#
Note
See phone groups for how to customize phone groups to your specific needs rather than using the preset phone groups of the defined phone sets in this section.
The type of phone set can be specified through --phone_set. Currently only IPA, ARPA, and PINYIN are supported, but I plan to make it more customizable in the future. The primary benefit of specifying the phone set is to create phone topologies that are more sensible than the defaults.
The default phone model uses 3 HMM states to represent phones, as that generally does a decent job of capturing the dynamic nature of phones. Something like an aspirated stop typically has three clear states, a closure, a burst, and an aspiration period. However, other phones like a tap, glottal stop, or unstressed schwa are so short that they can cause misalignment errors. For these, a single HMM state is more sensible, so they have a shorter minimum duration (each HMM state has a minimum 10ms duration). For vowels, 3 states generally makes sense for monophthongs, where one state corresponds to the onset, one to the “steady state”, and one to the offset. For diphthongs and triphthongs, three states doesn’t map as clearly to the states, as you’ll have an onset, a first target, a transition, a second target, and an offset (and a third target for tiphthongs). Specifying phone sets will use preset stops, affricates, diphthongs, triphthongs and extra short segments. Certain diacritics (ʱʼʰʲʷⁿˠ) will result in one more state being added, as they represent quite different acoustics from the base phone.
An additional benefit is in guiding the decision tree clustering for triphone modeling, where using phone sets will add extra questions for allophonic variation, as well as for general classes of sounds (sibilant sounds, places of articulation, rhotics, groups of vowels, etc). These questions should be more appropriate than the default setting.
The topology generated by IPA phone set generates base phone classes for extra short phones, stop phones, affricate phones, diphthongs, and triphthongs. Any phones below not used in the dictionary will be ignored.
Phone class |
HMM states |
Phones |
|---|---|---|
Extra short phones |
1 |
|
Stop phones |
2 |
|
Affricate phones |
4 |
|
Diphthongs |
5 |
Two of: |
Triphthongs |
6 |
Three of: |
For ARPA, we use the following topology calculation. Additionally, stress-marked vowels are collected under a single base phone (i.e., AA0 AA1 AA2 are collected under AA), so they will share states during training.
Phone class |
HMM states |
Phones |
|---|---|---|
Extra short phones |
1 |
|
Stop phones |
2 |
|
Affricate phones |
4 |
|
Diphthongs |
5 |
|
Question Group |
Phones |
Notes |
|---|---|---|
Bilabial stops |
|
|
Dentals |
|
|
Flapping |
|
|
Nasals |
|
|
Voiceless sibilants |
|
|
Voiced sibilants |
|
|
Voiceless fricatives |
|
|
Voiced fricatives |
|
G included for the same reason as above |
Dorsals |
|
|
Rhotics |
|
|
Low back vowels |
|
Cot-caught merger |
Central vowels |
|
|
High back vowels |
|
|
High front vowels |
|
|
Mid front vowels |
|
|
Primary stressed vowels |
|
Following the Kaldi LibriSpeech recipe |
Secondary stressed vowels |
|
|
Unstressed vowels |
|
Phone class |
HMM states |
Phones |
|---|---|---|
Stop phones |
2 |
|
Affricate phones |
4 |
|
Aspirated affricate phones |
5 |
|
Diphthongs |
5 |
Two of: |
Triphthongs |
6 |
Three of: |
Question Group |
Phones |
Notes |
|---|---|---|
Bilabial stops |
|
|
Alveolar stops |
|
|
Nasals |
|
|
Voiceless sibilants |
|
|
Dorsals |
|
Pinyin |
Rhotics |
|
|
Approximants |
|
|
Tone 1 |
All monophthong, diphthongs, triphthongs with tone 1 |
|
Tone 2 |
All monophthong, diphthongs, triphthongs with tone 2 |
|
Tone 3 |
All monophthong, diphthongs, triphthongs with tone 3 |
|
Tone 4 |
All monophthong, diphthongs, triphthongs with tone 4 |
|
Tone 5 |
All monophthong, diphthongs, triphthongs with tone 5 |
Pronunciation modeling#
For the default configuration, pronunciation probabilities are estimated following the second and third SAT blocks. See Add probabilities to a dictionary (mfa train_dictionary) for more details.
A recent experimental feature for training acoustic models is the --train_g2p flag which changes the pronunciation probability estimation from a lexicon based estimation to instead using a G2P model as the lexicon. The idea here is that we have pronunciations generated by the initial blocks much like for the standard lexicon-based approach, but instead of estimating probabilities for individual word/pronunciation pairs and the likelihood of surrounding silence, it learns a mapping between the graphemes of the input texts and the phones.
Note
See phonological rules for how to specify regular expression-like phonological rules so you don’t have to code every form for a regular rule.
Command reference#
mfa train#
Train a new acoustic model on a corpus and optionally export alignments
mfa train [OPTIONS] CORPUS_DIRECTORY DICTIONARY_PATH OUTPUT_MODEL_PATH
Options
- --output_directory <output_directory>#
Path to save alignments.
- -c, --config_path <config_path>#
Path to config file to use for training.
- -s, --speaker_characters <speaker_characters>#
Number of characters of file names to use for determining speaker, default is to use directory names.
- -a, --audio_directory <audio_directory>#
Audio directory root to use for finding audio files.
- --phone_set <phone_set_type>#
Enable extra decision tree modeling based on the phone set.
- Options:
UNKNOWN | AUTO | MFA | IPA | ARPA | PINYIN
- --phone_groups_path <phone_groups_path>#
Path to yaml file defining phone groups.
- --rules_path <rules_path>#
Path to yaml file defining phonological rules.
- --output_format <output_format>#
Format for aligned output files (default is long_textgrid).
- Options:
long_textgrid | short_textgrid | json | csv
- --include_original_text#
Flag to include original utterance text in the output.
- -p, --profile <profile>#
Configuration profile to use, defaults to “global”
- -t, --temporary_directory <temporary_directory>#
Set the default temporary directory, default is /home/docs/Documents/MFA
- -j, --num_jobs <num_jobs>#
Set the number of processes to use by default, defaults to 3
- --clean, --no_clean#
Remove files from previous runs, default is False
- -v, --verbose, -nv, --no_verbose#
Output debug messages, default is False
- -q, --quiet, -nq, --no_quiet#
Suppress all output messages (overrides verbose), default is False
- --overwrite, --no_overwrite#
Overwrite output files when they exist, default is False
- --use_mp, --no_use_mp#
Turn on/off multiprocessing. Multiprocessing is recommended will allow for faster executions.
- -d, --debug, -nd, --no_debug#
Run extra steps for debugging issues, default is False
- --use_postgres, --no_use_postgres#
Use postgres instead of sqlite for extra functionality, default is False
- --single_speaker#
Single speaker mode creates multiprocessing splits based on utterances rather than speakers.
- --textgrid_cleanup, --no_textgrid_cleanup#
Turn on/off post-processing of TextGrids that cleans up silences and recombines compound words and clitics.
- -h, --help#
Show this message and exit.
Arguments
- CORPUS_DIRECTORY#
Required argument
- DICTIONARY_PATH#
Required argument
- OUTPUT_MODEL_PATH#
Required argument