Corpus creation utilities#
MFA now contains several command line utilities for helping to create corpora from scratch. The main workflow is as follows:
If the corpus made up of long sound file that need segmenting, segment the audio files using VAD
If the corpus does not contain transcriptions, transcribe utterances using existing acoustic models, language models, and dictionaries
Use the Anchor annotator tool to manually correct error in transcription
As necessary, bootstrap better transcriptions:
Train language model with updated transcriptions
Add pronunciation and silence probabilities to the dictionary