Train a new acoustic model (mfa train)#

You can train new acoustic models from scratch using MFA, and export the final alignments as TextGrids at the end. You don’t need a ton of data to generate decent alignments (see the blog post comparing alignments trained on various corpus sizes). At the end of the day, it comes down to trial and error, so I would recommend trying different workflows of pretrained models vs training your own or adapting a model to your data to see what performs best.

Phone topology#

The phone topology that MFA uses is different from the standard 3-state HMMs. Each phone can have a maximum of 5 states, but allows for early exiting, so each phone has a minimum duration of 10ms (one MFCC frame) rather than 30ms for the 3-state HMM (three MFCC frames).

Phone groups#

By default each phone is treated independently of one another, which can lead to data sparsity issues or worse contextual modeling for clearly related phones when modeling triphones (i.e., long/short vowels ɑ/ɑː, stressed/unstressed versions OY1/OY2/OY0). Phone groups can be specified via the --phone_groups_path flag. See phone groups for more information.

Deprecated since version 3.0.0: Using the --phone_set flag to generate phone groups is deprecated as of MFA 3.0, please refer to using --phone_groups_path flag to specify a phone groups configuration file instead.

Pronunciation modeling#

For the default configuration, pronunciation probabilities are estimated following the second and third SAT blocks. See Add probabilities to a dictionary (mfa train_dictionary) for more details.

A recent experimental feature for training acoustic models is the --train_g2p flag which changes the pronunciation probability estimation from a lexicon based estimation to instead using a G2P model as the lexicon. The idea here is that we have pronunciations generated by the initial blocks much like for the standard lexicon-based approach, but instead of estimating probabilities for individual word/pronunciation pairs and the likelihood of surrounding silence, it learns a mapping between the graphemes of the input texts and the phones.


See phonological rules for how to specify regular expression-like phonological rules so you don’t have to code every form for a regular rule.

Language tokenization#

By specifying a language via the --language flag, tokenization will occur as part of text normalization. This functionality is primarily useful for languages that do not rely on spaces to delimit words like Japanese, Thai, or Chinese languages. If you’re also using --g2p_model_path to generate pronunciations during training, note that the language setting will require G2P models trained on specific orthographies (i.e., using mfa model download g2p korean_jamo_mfa instead of mfa model download g2p korean_mfa).


Pronunciation orthography




G2P model




コレ ワ ニホンゴ デス


Katakana G2P



이건 한국어야

이건 한국어 야

python-mecab-ko, jamo

Jamo G2P




zhèshì zhōngwén

spacy-pkuseg, hanziconv, dragonmapper

Pinyin G2P


Thai script


นี่ คือ ภาษาไทย


Thai G2P

Command reference#

mfa train#

Train a new acoustic model on a corpus and optionally export alignments



--output_directory <output_directory>#

Path to save alignments.

-c, --config_path <config_path>#

Path to config file to use for training. See MontrealCorpusTools/mfa-models for examples.

-s, --speaker_characters <speaker_characters>#

Number of characters of file names to use for determining speaker, default is to use directory names.

-a, --audio_directory <audio_directory>#

Audio directory root to use for finding audio files.

--phone_set <phone_set_type>#

DEPRECATED, please use –phone_groups_path to specify phone groups instead.



--phone_groups_path <phone_groups_path>#

Path to yaml file defining phone groups. See MontrealCorpusTools/mfa-models for examples.

--rules_path <rules_path>#

Path to yaml file defining phonological rules. See MontrealCorpusTools/mfa-models for examples.

--output_format <output_format>#

Format for aligned output files (default is long_textgrid).


long_textgrid | short_textgrid | json | csv


Flag to include original utterance text in the output.

--language <language>#

Language to use for spacy tokenizers and other preprocessing of language data.


unknown | catalan | chinese | croatian | danish | dutch | english | finnish | french | german | greek | italian | japanese | korean | lithuanian | macedonian | multilingual | norwegian | polish | portuguese | romanian | russian | slovenian | spanish | swedish | thai | ukrainian

--g2p_model_path <g2p_model_path>#

Path to G2P model to use for OOV items.

-p, --profile <profile>#

Configuration profile to use, defaults to “global”

-t, --temporary_directory <temporary_directory>#

Set the default temporary directory, default is /home/docs/Documents/MFA

-j, --num_jobs <num_jobs>#

Set the number of processes to use by default, defaults to 3

--clean, --no_clean#

Remove files from previous runs, default is False

--final_clean, --no_final_clean#

Remove temporary files at the end of run, default is False

-v, --verbose, -nv, --no_verbose#

Output debug messages, default is False

-q, --quiet, -nq, --no_quiet#

Suppress all output messages (overrides verbose), default is False

--overwrite, --no_overwrite#

Overwrite output files when they exist, default is False

--use_mp, --no_use_mp#

Turn on/off multiprocessing. Multiprocessing is recommended will allow for faster executions.

--use_threading, --no_use_threading#

Use threading library rather than multiprocessing library. Multiprocessing is recommended will allow for faster executions.

-d, --debug, -nd, --no_debug#

Run extra steps for debugging issues, default is False

--use_postgres, --no_use_postgres#

Use postgres instead of sqlite for extra functionality, default is False


Single speaker mode creates multiprocessing splits based on utterances rather than speakers. This mode also disables speaker adaptation equivalent to --uses_speaker_adaptation false.

--textgrid_cleanup, --cleanup_textgrids, --no_textgrid_cleanup, --no_cleanup_textgrids#

Turn on/off post-processing of TextGrids that cleans up silences and recombines compound words and clitics.

-h, --help#

Show this message and exit.



Required argument


Required argument


Required argument

Configuration reference#

API reference#