Corpus creation utilities#

MFA now contains several command line utilities for helping to create corpora from scratch. The main workflow is as follows:

  1. If the corpus made up of long sound file that need segmenting, segment the audio files using VAD

  2. If the corpus does not contain transcriptions, transcribe utterances using existing acoustic models, language models, and dictionaries

  3. Use the Anchor annotator tool to manually correct error in transcription

  4. As necessary, bootstrap better transcriptions:

    1. Train language model with updated transcriptions

    2. Add pronunciation and silence probabilities to the dictionary