Corpus creation utilitiesΒΆ

MFA now contains several command line utilities for helping to create corpora from scratch. The main workflow is as follows:

  1. If the corpus made up of long sound file that need segmenting, Create segments
  2. If the corpus does not contain transcriptions, transcribe utterances using existing acoustic models, language models, and dictionaries (Running the transcriber)
  3. Use the annotator tool to fix up any errors (Annotator)
  4. As necessary, bootstrap better transcriptions:
    1. Retrain language model with new fixed transcriptions (Training language models)
    2. Train dictionary pronunciation probabilities (Modeling pronunciation probabilities)