Generating a dictionary

We have trained several G2P models that are available for download (Pretrained G2P models).

Warning

Please note that G2P models trained prior to 2.0 cannot be used with MFA 2.0. If you would like to use these models, please use the the 1.0.1 or 1.1 g2p utilities or retrain a new G2P model following Training a new G2P model.

To construct a pronunciation dictionary from your .lab or .TextGrid files, simply input:

mfa g2p g2p_model_path input_path output_path

The argument g2p_model_path can either be a fully specified path to a G2P model you’ve trained previously or one that you’ve downloaded via the mfa download g2p command (see Pretrained G2P models). The input_path argument can either be a text file of words to generate transcriptions for or a corpus directory that will be inspected for text transcripts and a word list will be compiled and pronunciations generated. The output_path argument is the full path to where the resulting pronunciation dictionary should be saved.

Note

Generating pronunciations to supplement your existing pronunciation dictionary can be done by running the validation utility (see Running the validation utility), and then use the path to the oovs_found.txt file that it generates.

Pronunciation dictionaries can also be generated from the orthographies of the words themselves, rather than relying on a trained G2P model. This functionality should be reserved for languages with transparent orthographies, close to 1-to-1 grapheme-to-phoneme mapping.

mfa g2p input_path output_path

Extra options (see G2P Configuration for full configuration details):

-t DIRECTORY
--temp_directory DIRECTORY

Temporary directory root to use for generating dictionary, default is ~/Documents/MFA

-j NUMBER
--num_jobs NUMBER

Number of jobs to use; defaults to 3, set higher if you have more processors available and would like to generate pronunciations faster

-c
--clean

Forces removal of temporary files under ~/Documents/MFA or the specified temporary directory prior to generating the dictionary.

--config_path

Path to a configuration yaml for G2P generation (see G2P generation configuration file for an example yaml file)

-n NUMBER
--num_pronunciations NUMBER

Number of pronunciation variants to generate per word, the default is 1

--include_bracketed

Flag for whether to generate pronunciations for words that are enclosed in brackets (i.e., […], (…), <…>)

See Example 2: Generate Mandarin dictionary for an example of how to use G2P functionality with a premade example.