First steps#

The mfa command line utility has grown over the years to encompass a number of utility functions. This section aims to provide a path for first-time users to figure out the workflow that works best for them.

Also check out External tutorials for external tutorials or blog posts on specific topics.

Use cases#

There are several broad use cases that you might want to use MFA for. Take a look below and if any are close matches, you should be able to apply the linked instructions to your data.

Use case 1: You have a speech corpus, the language has a pretrained acoustic model and pretrained dictionary.
1. Follow Aligning a speech corpus with existing pronunciation dictionary and acoustic model to generate aligned TextGrids
Use case 2: You have a speech corpus, the language has a pretrained acoustic model and pretrained dictionary, but the coverage of the dictionary for your corpus is not great, but the language has a pretrained G2P model.
1. Follow Generating pronunciations for OOV items in a corpus to generate pronunciations for OOV words in the corpus
2. Use the generated dictionary in Aligning a speech corpus with existing pronunciation dictionary and acoustic model to generate aligned TextGrids
Use case 3: You have a speech corpus, the language has a pretrained acoustic model and pretrained G2P model, but it doesn’t have a pretrained dictionary.
1. Follow Generating a pronunciation dictionary with a pretrained G2P model to generate a dictionary
2. Use the generated dictionary in Aligning a speech corpus with existing pronunciation dictionary and acoustic model to generate aligned TextGrids
Use case 4: You have a speech corpus and your own pronunciation dictionary, but there is no pretrained acoustic model for the language (or none that have the same phones as the pronunciation dictionary).
1. Follow Training a new acoustic model on a corpus to generate aligned TextGrids
Use case 5: You have a speech corpus and your own pronunciation dictionary, but it does not have great coverage of the words in the corpus.
1. Follow Training a G2P model from a pronunciation dictionary to train a G2P model
2. Use the trained G2P model in Generating a pronunciation dictionary with a pretrained G2P model to generate a pronunciation dictionary
3. Use the generated pronunciation dictionary in Training a new acoustic model on a corpus to generate aligned TextGrids

Aligning a speech corpus with existing pronunciation dictionary and acoustic model#

For the purposes of this example, we’ll use the “english_us_arpa” model, but the instructions will be applicable to any pretrained acoustic model/pronunciation dictionary pairing. We’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. Finally, we’ll assume that your speech corpus is stored in the folder ~/mfa_data/my_corpus, so when working with your data, this will be the main thing to update.

First we’ll need the pretrained models and dictionary. These are installed via the mfa model download command:

mfa model download acoustic english_us_arpa
mfa model download dictionary english_us_arpa

You should be able to run mfa model inspect acoustic english_us_arpa and it will output information about the english_us_arpa acoustic model.

Next, we want to make sure that the dataset is in the proper format for MFA, which is what the mfa validate command does:

mfa validate ~/mfa_data/my_corpus english_us_arpa english_us_arpa

This command will look through the corpus and make sure that MFA is parsing everything correctly. There are couple of different types of Corpus formats and structure that MFA supports, but in general the core requirement is that you should have pairs of sound files and transcription files with the same name (except for the extension). Take a look over the validator output and make sure that the number of speakers and number of files and utterances match your expectations, and that the number of Out of Vocabulary (OOV) items is not too high. If you want to generate transcriptions for these words so that they can be aligned, see Generating a pronunciation dictionary with a pretrained G2P model to make a new dictionary. The validator will also attempt to run feature generation and train a simple monophone model to make sure that everything works within Kaldi.

Once we’ve validated the data, we can align it via the mfa align command:

mfa align ~/mfa_data/my_corpus english_us_arpa english_us_arpa ~/mfa_data/my_corpus_aligned

If alignment is successful, you’ll see TextGrid files containing the aligned words and phones in the output directory (here ~/mfa_data/my_corpus_aligned). If there were issues in exporting the TextGrids, you’ll see them listed in the output directory. If your corpus is large, you’ll likely want to increase the number of jobs that MFA uses. For that and more advanced configuration, see Align with an acoustic model (mfa align).

Note

Please see Example: Aligning a demo corpus for an example using toy data.

Generating a pronunciation dictionary with a pretrained G2P model#

For the purposes of this example, we’ll use the “english_us_arpa” model, but the instructions will be applicable to any pretrained G2P model. We’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. Finally, we’ll assume that your corpus is stored in the folder ~/mfa_data/my_corpus, so when working with your data, this will be the main thing to update.

First we’ll need the pretrained G2P model. These are installed via the mfa model download command:

mfa model download g2p english_us_arpa

You should be able to run mfa model inspect g2p english_us_arpa and it will output information about the english_us_arpa G2P model.

Depending on your use case, you might have a list of words to run G2P over, or just a corpus of sound and transcription files. The mfa g2p command can process either:

mfa g2p ~/mfa_data/my_corpus english_us_arpa ~/mfa_data/new_dictionary.txt  # If using a corpus
mfa g2p ~/mfa_data/my_word_list.txt english_us_arpa ~/mfa_data/new_dictionary.txt  # If using a word list

Running one of the above will output a text file pronunciation dictionary in the MFA dictionary format. I recommend looking over the pronunciations generated and make sure that they look sensible. For languages where the orthography is not transparent, it may be helpful to include --num_pronunciations 3 so that more pronunciations are generated than just the most likely one. For more details on running G2P, see Generate pronunciations for words (mfa g2p).

From here you can use this dictionary file as input to any MFA command that uses dictionaries, i.e.

mfa align ~/mfa_data/my_corpus ~/mfa_data/new_dictionary.txt english_us_arpa ~/mfa_data/my_corpus_aligned

Generating pronunciations for OOV items in a corpus#

For the purposes of this example, we’ll use the “english_us_arpa” model, but the instructions will be applicable to any pretrained G2P model. We’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. Finally, we’ll assume that your corpus is stored in the folder ~/mfa_data/my_corpus, so when working with your data, this will be the main thing to update.

First we’ll need the pretrained G2P model. These are installed via the mfa model download command:

mfa model download g2p english_us_arpa

You should be able to run mfa model inspect g2p english_us_arpa and it will output information about the english_us_arpa G2P model.

Depending on your use case, you might have a list of words to run G2P over, or just a corpus of sound and transcription files. The mfa g2p command can process either:

mfa g2p ~/mfa_data/my_corpus english_us_arpa ~/mfa_data/g2pped_oovs.txt --dictionary_path english_us_arpa

Running the above will output a text file in the format that MFA uses (Pronunciation dictionary format) with all the OOV words (ignoring bracketed words like <cutoff>). I recommend looking over the pronunciations generated and make sure that they look sensible. For languages where the orthography is not transparent, it may be helpful to include --num_pronunciations 3 so that more pronunciations are generated than just the most likely one. For more details on running G2P, see Generate pronunciations for words (mfa g2p).

Once you have looked over the dictionary, you can save the new pronunciations via:

mfa model add_words english_us_arpa ~/mfa_data/g2pped_oovs.txt

The new pronunciations will be available when you use english_us_arpa as the dictionary path in an MFA command, i.e. the modified command from Aligning a speech corpus with existing pronunciation dictionary and acoustic model:

mfa align ~/mfa_data/my_corpus english_us_arpa english_us_arpa ~/mfa_data/my_corpus_aligned

Warning

Please do look over the G2P results before adding them to the dictionary, at the very least to spot check. Especially for non-transparent orthography systems, words with unseen graphemes, homographs, etc, G2P can generate phonotactically illegal forms, so I do not recommend piping G2P output to alignment without human spot checking.

Remap a dictionary to use the phone set of a pretrained acoustic model#

For the purposes of this example, we’ll use the “english_us_arpa” acoustic model and the “english_us_mfa” dictionary, but the instructions will be applicable to any pretrained acoustic model/pronunciation dictionary pairing. We’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. Finally, we’ll assume that your speech corpus is stored in the folder ~/mfa_data/my_corpus, so when working with your data, this will be the main thing to update.

First we’ll need the pretrained models and dictionary. These are installed via the mfa model download command:

mfa model download acoustic english_us_arpa
mfa model download dictionary english_us_mfa

You should be able to run mfa model inspect acoustic english_us_arpa and it will output information about the english_us_arpa acoustic model.

We’ll also need a mapping file that maps phones in the dictionary (in this case the english_us_mfa phone set) to the phones in the acoustic model (in this case the Arpabet phone set). Mapping files look something like:

aj: AY1
aw: AW1
b: B
...
m̩: AH0 M
...

You can download the mapping from english_us_mfa phone set to english_us_arpa phone set here. For the purposes of this tutorial, we assume you’ve saved it to ~/mfa_data/english_to_arpa_phone_mapping.yaml.

Once we have a phone set mapping defined, we can use it to remap our dictionary to the new phone set via the mfa remap_dictionary command:

mfa remap_dictionary english_us_mfa english_us_arpa ~/mfa_data/english_to_arpa_phone_mapping.yaml ~/mfa_data/remapped_english_us_arpa.dict

The remapped dictionary in ~/mfa_data/remapped_english_us_arpa.dict can now be used to specify the dictionary path for Aligning a speech corpus with existing pronunciation dictionary and acoustic model or Adapting a pretrained model with existing pronunciation dictionary to a dataset.

Adapting a pretrained model with existing pronunciation dictionary to a dataset#

For the purposes of this example, we’ll use the “english_us_arpa” model, but the instructions will be applicable to any pretrained acoustic model/pronunciation dictionary pairing. We’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. Finally, we’ll assume that your speech corpus is stored in the folder ~/mfa_data/my_corpus, so when working with your data, this will be the main thing to update.

First we’ll need the pretrained models and dictionary. These are installed via the mfa model download command:

mfa model download acoustic english_us_arpa
mfa model download dictionary english_us_arpa

You should be able to run mfa model inspect acoustic english_us_arpa and it will output information about the english_us_arpa acoustic model.

Next, we want to make sure that the dataset is in the proper format for MFA, which is what the mfa validate command does:

mfa validate ~/mfa_data/my_corpus english_us_arpa english_us_arpa

This command will look through the corpus and make sure that MFA is parsing everything correctly. There are couple of different types of Corpus formats and structure that MFA supports, but in general the core requirement is that you should have pairs of sound files and transcription files with the same name (except for the extension). Take a look over the validator output and make sure that the number of speakers and number of files and utterances match your expectations, and that the number of Out of Vocabulary (OOV) items is not too high. If you want to generate transcriptions for these words so that they can be aligned, see Generating a pronunciation dictionary with a pretrained G2P model to make a new dictionary. The validator will also attempt to run feature generation and train a simple monophone model to make sure that everything works within Kaldi.

Once we’ve validated the data, we can use it to adapt our pretrained model via the mfa adapt command:

mfa adapt ~/mfa_data/my_corpus english_us_arpa english_us_arpa ~/mfa_data/english_us_arpa_adapted.zip

This model can now be used as input for Aligning a speech corpus with existing pronunciation dictionary and acoustic model.

Training a new acoustic model on a corpus#

For the purposes of this example, we’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. We’ll assume that your speech corpus is stored in the folder ~/mfa_data/my_corpus and that you have a pronunciation dictionary at ~/mfa_data/my_dictionary.txt, so when working with your data, these paths will be the main thing to update.

The first thing we want to do is to make sure that the dataset is in the proper format for MFA, which is what the mfa validate command does:

mfa validate ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt

This command will look through the corpus and make sure that MFA is parsing everything correctly. There are couple of different types of Corpus formats and structure that MFA supports, but in general the core requirement is that you should have pairs of sound files and transcription files with the same name (except for the extension). Take a look over the validator output and make sure that the number of speakers and number of files and utterances match your expectations, and that the number of Out of Vocabulary (OOV) items is not too high. If you want to generate transcriptions for these words so that they can be aligned, see Training a G2P model from a pronunciation dictionary and Generating a pronunciation dictionary with a pretrained G2P model to make a new dictionary. The validator will also attempt to run feature generation and train a simple monophone model to make sure that everything works within Kaldi.

Once we’ve validated the data, we can train an acoustic model (and output the aligned TextGrids if we want) it via the mfa train command:

mfa train ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt ~/mfa_data/new_acoustic_model.zip  # Export just the trained acoustic model
mfa train ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt ~/mfa_data/my_corpus_aligned  # Export just the training alignments
mfa train ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt ~/mfa_data/new_acoustic_model.zip --output_directory ~/mfa_data/my_corpus_aligned  # Export both trained model and alignments

As for other commands, if your data is large, you’ll likely want to increase the number of jobs that MFA uses. For that and more advanced configuration of the training command, see Train a new acoustic model (mfa train).

If training was successful, you’ll now see the TextGrids in the output directory, assuming you wanted to export them. The TextGrid export is identical to if you had run mfa align with the trained acoustic model.

If you choose export the acoustic model, you can now use this model for other utilities and use cases, such as refining your pronunciation dictionary through Add probabilities to a dictionary (mfa train_dictionary) or Transcribe audio files (mfa transcribe) for new data. If you would like to store the exported acoustic model for easy reference like the downloaded pretrained models, you can save it via mfa model save:

mfa model save acoustic ~/mfa_data/new_acoustic_model.zip

You can then run mfa model inspect on it:

mfa model inspect acoustic new_acoustic_model

Or use it as a reference in other MFA commands.

Training a G2P model from a pronunciation dictionary#

For the purposes of this example, we’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. Finally, we’ll assume that your pronunciation dictionary is stored as ~/mfa_data/my_dictionary.txt and that it fits the Pronunciation dictionary format.

To train the G2P model, we use the mfa train_g2p:

mfa train_g2p ~/mfa_data/my_dictionary.txt ~/mfa_data/my_g2p_model.zip

As for other commands, if your dictionary is large, you’ll likely want to increase the number of jobs that MFA uses. For that and more advanced configuration of the training command, see Train a new G2P model (mfa train_g2p).

Once the G2P model is trained, you should see the exported archive in the folder. From here, we can save it for future use, or use the full path directly for generating pronunciations of new words.

mfa model save g2p ~/mfa_data/my_g2p_model.zip

mfa g2p ~/mfa_data/my_new_word_list.txt my_g2p_model ~/mfa_data/my_new_dictionary.txt

# Or

mfa g2p ~/mfa_data/my_new_word_list.txt ~/mfa_data/my_g2p_model.zip ~/mfa_data/my_new_dictionary.txt

Take a look at Generating pronunciations for OOV items in a corpus with this new model for a more detailed walk-through of generating a dictionary.