Train a new G2P model (mfa train_g2p)#

Another tool included with MFA allows you to train a G2P model from a given pronunciation dictionary. This type of model can be used for Generate pronunciations for words (mfa g2p). It requires a pronunciation dictionary with each line consisting of the orthographic transcription followed by the phonetic transcription. The model is generated using the Pynini package, which generates FST (finite state transducer) files. The implementation is based on that in the Sigmorphon 2020 G2P task baseline. The G2P model output will be a .zip file like the acoustic model generated from alignment.

See Example 3: Train Mandarin G2P model for an example of how to train a G2P model with a premade toy example.

Note

As of version 2.0.6, users on Windows can run this command natively without requiring Windows Subsystem for Linux, see Installation for more details.

Phonetisaurus style models#

As of MFA release 2.0, Phonetisaurus style G2P models are trainable! The default Pynini implementation is based off of a general pair ngram model, however this has the assumption that there is a reasonable one-to-one correspondence between graphemes and phones, with allowances for deletion/insertions covering some one-to-many correspondences. This works reasonably well for languages that use some form of alphabet, even more non-transparent orthographies like English and French.

However, the basic pair ngram implementation struggles with languages that use syllabaries or logographic systems like Japanese and Chinese. MFA 1.0 used Phonetisaurus as the backend for G2P that had better support for one-to-many mappings. The Pynini 2.0 implementation encodes strings as paired linear FSAs, so each grapheme leads to the next one, and the model that’s being optimized has mappings from all graphemes to all phones and learns their weights. Phonetisaurus does not encode the input as separate linear FSAs, but rather has single FSTs that cover all the alignments between graphemes and phonemes and then optimizes the FSTs from there. Phonetisaurus has explicit support for windows of surrounding graphemes and and phonemes, allowing for more efficient learning of patterns like [ɲ i].

The original Pynini implementation of pair ngram models for G2P was motivated primarily by ease of installation, as Phonetisaurus is not on Conda Forge or has easily installable binaries, so the --phonetisaurus flag implements the Phonetisaurus algorithm using Pynini.

Command reference#

mfa train_g2p#

Train a G2P model from a pronunciation dictionary.

mfa train_g2p [OPTIONS] DICTIONARY_PATH OUTPUT_MODEL_PATH

Options

-c, --config_path <config_path>#

Path to config file to use for training.

--phonetisaurus#

Flag for using Phonetisaurus-style models.

--evaluate, --validate#

Perform an analysis of accuracy training on most of the data and validating on an unseen subset.

-p, --profile <profile>#

Configuration profile to use, defaults to “global”

-t, --temporary_directory <temporary_directory>#

Set the default temporary directory, default is /home/docs/Documents/MFA

-j, --num_jobs <num_jobs>#

Set the number of processes to use by default, defaults to 3

--clean, --no_clean#

Remove files from previous runs, default is False

-v, --verbose, -nv, --no_verbose#

Output debug messages, default is False

-q, --quiet, -nq, --no_quiet#

Suppress all output messages (overrides verbose), default is False

--overwrite, --no_overwrite#

Overwrite output files when they exist, default is False

--use_mp, --no_use_mp#

Turn on/off multiprocessing. Multiprocessing is recommended will allow for faster executions.

--use_threading, --no_use_threading#

Use threading library rather than multiprocessing library. Multiprocessing is recommended will allow for faster executions.

-d, --debug, -nd, --no_debug#

Run extra steps for debugging issues, default is False

--use_postgres, --no_use_postgres#

Use postgres instead of sqlite for extra functionality, default is False

--single_speaker#

Single speaker mode creates multiprocessing splits based on utterances rather than speakers. This mode also disables speaker adaptation equivalent to --uses_speaker_adaptation false.

--textgrid_cleanup, --cleanup_textgrids, --no_textgrid_cleanup, --no_cleanup_textgrids#

Turn on/off post-processing of TextGrids that cleans up silences and recombines compound words and clitics.

-h, --help#

Show this message and exit.

Arguments

DICTIONARY_PATH#

Required argument

OUTPUT_MODEL_PATH#

Required argument

Configuration reference#

API reference#