First steps#

The mfa command line utility has grown over the years to encompass a number of utility functions. This section aims to provide a path for first-time users to figure out the workflow that works best for them.

Also check out External tutorials for external tutorials or blog posts on specific topics.

Use cases#

There are several broad use cases that you might want to use MFA for. Take a look below and if any are close matches, you should be able to apply the linked instructions to your data.

  1. Use case 1: You have a speech corpus, the language has a pretrained acoustic model and pretrained dictionary.

    1. Follow Aligning a speech corpus with existing pronunciation dictionary and acoustic model to generate aligned TextGrids

  2. Use case 2: You have a speech corpus, the language has a pretrained acoustic model and pretrained dictionary, but the coverage of the dictionary for your corpus is not great, but the language has a pretrained G2P model.

    1. Follow Generating pronunciations for OOV items in a corpus to generate pronunciations for OOV words in the corpus

    2. Use the generated dictionary in Aligning a speech corpus with existing pronunciation dictionary and acoustic model to generate aligned TextGrids

  3. Use case 3: You have a speech corpus, the language has a pretrained acoustic model and pretrained G2P model, but it doesn’t have a pretrained dictionary.

    1. Follow Generating a pronunciation dictionary with a pretrained G2P model to generate a dictionary

    2. Use the generated dictionary in Aligning a speech corpus with existing pronunciation dictionary and acoustic model to generate aligned TextGrids

  4. Use case 4: You have a speech corpus and your own pronunciation dictionary, but there is no pretrained acoustic model for the language (or none that have the same phones as the pronunciation dictionary).

    1. Follow Training a new acoustic model on a corpus to generate aligned TextGrids

  5. Use case 5: You have a speech corpus and your own pronunciation dictionary, but it does not have great coverage of the words in the corpus.

    1. Follow Training a G2P model from a pronunciation dictionary to train a G2P model

    2. Use the trained G2P model in Generating a pronunciation dictionary with a pretrained G2P model to generate a pronunciation dictionary

    3. Use the generated pronunciation dictionary in Training a new acoustic model on a corpus to generate aligned TextGrids

  6. Use case 6: You have a speech corpus and the language has a pretrained acoustic model, but the language does not mark word boundaries in its orthography (and the language has a pretrained tokenizer model).

    1. Follow Tokenize a corpus to add word boundaries to tokenize the corpus

    2. Use the tokenized transcripts and follow Aligning a speech corpus with existing pronunciation dictionary and acoustic model

Aligning a speech corpus with existing pronunciation dictionary and acoustic model#

For the purposes of this example, we’ll use the “english_us_arpa” model, but the instructions will be applicable to any pretrained acoustic model/pronunciation dictionary pairing. We’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. Finally, we’ll assume that your speech corpus is stored in the folder ~/mfa_data/my_corpus, so when working with your data, this will be the main thing to update.

First we’ll need the pretrained models and dictionary. These are installed via the mfa model download command:

mfa model download acoustic english_us_arpa
mfa model download dictionary english_us_arpa

You should be able to run mfa model inspect acoustic english_us_arpa and it will output information about the english_us_arpa acoustic model.

Next, we want to make sure that the dataset is in the proper format for MFA, which is what the mfa validate command does:

mfa validate ~/mfa_data/my_corpus english_us_arpa english_us_arpa

This command will look through the corpus and make sure that MFA is parsing everything correctly. There are couple of different types of Corpus formats and structure that MFA supports, but in general the core requirement is that you should have pairs of sound files and transcription files with the same name (except for the extension). Take a look over the validator output and make sure that the number of speakers and number of files and utterances match your expectations, and that the number of Out of Vocabulary (OOV) items is not too high. If you want to generate transcriptions for these words so that they can be aligned, see Generating a pronunciation dictionary with a pretrained G2P model to make a new dictionary. The validator will also attempt to run feature generation and train a simple monophone model to make sure that everything works within Kaldi.

Once we’ve validated the data, we can align it via the mfa align command:

mfa align ~/mfa_data/my_corpus english_us_arpa english_us_arpa ~/mfa_data/my_corpus_aligned

If alignment is successful, you’ll see TextGrid files containing the aligned words and phones in the output directory (here ~/mfa_data/my_corpus_aligned). If there were issues in exporting the TextGrids, you’ll see them listed in the output directory. If your corpus is large, you’ll likely want to increase the number of jobs that MFA uses. For that and more advanced configuration, see Align with an acoustic model (mfa align).

Note

Please see Example 1: Aligning LibriSpeech (English) for an example using toy data.

Generating a pronunciation dictionary with a pretrained G2P model#

For the purposes of this example, we’ll use the “english_us_arpa” model, but the instructions will be applicable to any pretrained G2P model. We’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. Finally, we’ll assume that your corpus is stored in the folder ~/mfa_data/my_corpus, so when working with your data, this will be the main thing to update.

First we’ll need the pretrained G2P model. These are installed via the mfa model download command:

mfa model download g2p english_us_arpa

You should be able to run mfa model inspect g2p english_us_arpa and it will output information about the english_us_arpa G2P model.

Depending on your use case, you might have a list of words to run G2P over, or just a corpus of sound and transcription files. The mfa g2p command can process either:

mfa g2p ~/mfa_data/my_corpus english_us_arpa ~/mfa_data/new_dictionary.txt  # If using a corpus
mfa g2p ~/mfa_data/my_word_list.txt english_us_arpa ~/mfa_data/new_dictionary.txt  # If using a word list

Running one of the above will output a text file pronunciation dictionary in the MFA dictionary format. I recommend looking over the pronunciations generated and make sure that they look sensible. For languages where the orthography is not transparent, it may be helpful to include --num_pronunciations 3 so that more pronunciations are generated than just the most likely one. For more details on running G2P, see Generate pronunciations for words (mfa g2p).

From here you can use this dictionary file as input to any MFA command that uses dictionaries, i.e.

mfa align ~/mfa_data/my_corpus ~/mfa_data/new_dictionary.txt english_us_arpa ~/mfa_data/my_corpus_aligned

Note

Please see Example 2: Generate Mandarin dictionary for an example using toy data.

Generating pronunciations for OOV items in a corpus#

For the purposes of this example, we’ll use the “english_us_arpa” model, but the instructions will be applicable to any pretrained G2P model. We’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. Finally, we’ll assume that your corpus is stored in the folder ~/mfa_data/my_corpus, so when working with your data, this will be the main thing to update.

First we’ll need the pretrained G2P model. These are installed via the mfa model download command:

mfa model download g2p english_us_arpa

You should be able to run mfa model inspect g2p english_us_arpa and it will output information about the english_us_arpa G2P model.

Depending on your use case, you might have a list of words to run G2P over, or just a corpus of sound and transcription files. The mfa g2p command can process either:

mfa g2p ~/mfa_data/my_corpus english_us_arpa ~/mfa_data/g2pped_oovs.txt --dictionary_path english_us_arpa

Running the above will output a text file in the format that MFA uses (Pronunciation dictionary format) with all the OOV words (ignoring bracketed words like <cutoff>). I recommend looking over the pronunciations generated and make sure that they look sensible. For languages where the orthography is not transparent, it may be helpful to include --num_pronunciations 3 so that more pronunciations are generated than just the most likely one. For more details on running G2P, see Generate pronunciations for words (mfa g2p).

Once you have looked over the dictionary, you can save the new pronunciations via:

mfa model add_words english_us_arpa ~/mfa_data/g2pped_oovs.txt

The new pronunciations will be available when you use english_us_arpa as the dictionary path in an MFA command, i.e. the modified command from Aligning a speech corpus with existing pronunciation dictionary and acoustic model:

mfa align ~/mfa_data/my_corpus english_us_arpa english_us_arpa ~/mfa_data/my_corpus_aligned

Warning

Please do look over the G2P results before adding them to the dictionary, at the very least to spot check. Especially for non-transparent orthography systems, words with unseen graphemes, homographs, etc, G2P can generate phonotactically illegal forms, so I do not recommend piping G2P output to alignment without human spot checking.

Training a new acoustic model on a corpus#

For the purposes of this example, we’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. We’ll assume that your speech corpus is stored in the folder ~/mfa_data/my_corpus and that you have a pronunciation dictionary at ~/mfa_data/my_dictionary.txt, so when working with your data, these paths will be the main thing to update.

The first thing we want to do is to make sure that the dataset is in the proper format for MFA, which is what the mfa validate command does:

mfa validate ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt

This command will look through the corpus and make sure that MFA is parsing everything correctly. There are couple of different types of Corpus formats and structure that MFA supports, but in general the core requirement is that you should have pairs of sound files and transcription files with the same name (except for the extension). Take a look over the validator output and make sure that the number of speakers and number of files and utterances match your expectations, and that the number of Out of Vocabulary (OOV) items is not too high. If you want to generate transcriptions for these words so that they can be aligned, see Training a G2P model from a pronunciation dictionary and Generating a pronunciation dictionary with a pretrained G2P model to make a new dictionary. The validator will also attempt to run feature generation and train a simple monophone model to make sure that everything works within Kaldi.

Once we’ve validated the data, we can train an acoustic model (and output the aligned TextGrids if we want) it via the mfa train command:

mfa train ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt ~/mfa_data/new_acoustic_model.zip  # Export just the trained acoustic model
mfa train ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt ~/mfa_data/my_corpus_aligned  # Export just the training alignments
mfa train ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt ~/mfa_data/new_acoustic_model.zip --output_directory ~/mfa_data/my_corpus_aligned  # Export both trained model and alignments

As for other commands, if your data is large, you’ll likely want to increase the number of jobs that MFA uses. For that and more advanced configuration of the training command, see Train a new acoustic model (mfa train).

If training was successful, you’ll now see the TextGrids in the output directory, assuming you wanted to export them. The TextGrid export is identical to if you had run mfa align with the trained acoustic model.

If you choose export the acoustic model, you can now use this model for other utilities and use cases, such as refining your pronunciation dictionary through Add probabilities to a dictionary (mfa train_dictionary) or Transcribe audio files (mfa transcribe) for new data. If you would like to store the exported acoustic model for easy reference like the downloaded pretrained models, you can save it via mfa model save:

mfa model save acoustic ~/mfa_data/new_acoustic_model.zip

You can then run mfa model inspect on it:

mfa model inspect acoustic new_acoustic_model

Or use it as a reference in other MFA commands.

Training a G2P model from a pronunciation dictionary#

For the purposes of this example, we’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. Finally, we’ll assume that your pronunciation dictionary is stored as ~/mfa_data/my_dictionary.txt and that it fits the Pronunciation dictionary format.

To train the G2P model, we use the mfa train_g2p:

mfa train_g2p ~/mfa_data/my_dictionary.txt ~/mfa_data/my_g2p_model.zip

As for other commands, if your dictionary is large, you’ll likely want to increase the number of jobs that MFA uses. For that and more advanced configuration of the training command, see Train a new G2P model (mfa train_g2p).

Once the G2P model is trained, you should see the exported archive in the folder. From here, we can save it for future use, or use the full path directly for generating pronunciations of new words.

mfa model save g2p ~/mfa_data/my_g2p_model.zip

mfa g2p ~/mfa_data/my_new_word_list.txt my_g2p_model ~/mfa_data/my_new_dictionary.txt

# Or

mfa g2p ~/mfa_data/my_new_word_list.txt ~/mfa_data/my_g2p_model.zip ~/mfa_data/my_new_dictionary.txt

Take a look at Generating a pronunciation dictionary with a pretrained G2P model with this new model for a more detailed walk-through of generating a dictionary.

Note

Please see Example 3: Train Mandarin G2P model for an example using toy data.

Tokenize a corpus to add word boundaries#

For the purposes of this example, we’ll also assume that you have done nothing else with MFA other than follow the Installation instructions and you have the mfa command working. Finally, we’ll assume that your corpus is in Japanese and is stored in the folder ~/mfa_data/my_corpus, so when working with your data, this will be the main thing to update.

To tokenize the Japanese text to add spaces, first download the Japanese tokenizer model via:

mfa model download tokenizer japanese_mfa

Once you have the model downloaded, you can tokenize your corpus via:

mfa tokenize ~/mfa_data/my_corpus japanese_mfa ~/mfa_data/tokenized_version

You can check the tokenized text in ~/mfa_data/tokenized_version, verify that it looks good, and copy the files to replace the untokenized files in ~/mfa_data/my_corpus for use in alignment.

Warning

MFA’s tokenizer models are nowhere near state of the art, and I recommend using other tokenizers as they make sense:

The above were used in the initial construction of the training corpora for MFA, though the training segmentations for Japanese have begun to diverge from nagisa, as they break up phonological words into morphological parses where for the purposes of acoustic model training and alignment it makes more sense to not split (nagisa: 使っ て [ts ɨ k a Q t e] vs mfa: 使って [ts ɨ k a tː e]). The MFA tokenizer models are provided as an easy start up path as the ones listed above may have extra dependencies and platform restrictions.