Generating a dictionary

We have trained several G2P models that are available for download (Pretrained G2P models).


To reconstruct a pronunciation dictionary from your .lab or .TextGrid files, simply input:

bin/mfa_generate_dict /path/to/model/ /path/to/corpus


In examples/example_labs you will find several sample .lab files (orthographic transcriptions) from the THCHS-30 corpus. These are organized much as they would be for any alignment task. The dictionary reconstructor will create a word list of all the orthographic word-forms in the files, and will build a pronunciation dictionary with a phonetic transcription for each one of these words, which it will write to a file. Let’s start by running the reconstructor, as before:

bin/mfa_generate_dict examples/CH_models examples/CH chinese_dict.txt

This should take no more than a few seconds. Open the output file, and check that all the words are there. The accuracy of the transcription should be near 100%. You can now use this to align your mini corpus:

bin/mfa_train_and_align examples/CH  examples/chinese_dict.txt examples/aligned_output

Since there are very few files (i.e. small training set), the alignment will be suboptimal. This example is intended more to give a sense of the pipeline for generating a dictionary and using it for alignment.