.. _corpus_creation: ************************* Corpus creation utilities ************************* MFA now contains several command line utilities for helping to create corpora from scratch. The main workflow is as follows: 1. If the corpus made up of long sound file that need segmenting, :ref:`segment the audio files using VAD ` 2. If the corpus does not contain transcriptions, :ref:`transcribe utterances using existing acoustic models, language models, and dictionaries ` 3. Use the :ref:`Anchor annotator tool ` to manually correct error in transcription 4. As necessary, bootstrap better transcriptions: 1. :ref:`Train language model ` with updated transcriptions 2. :ref:`Add pronunciation and silence probabilities to the dictionary ` .. toctree:: :hidden: create_segments train_ivector diarize_speakers transcribing training_lm training_dictionary tokenize train_tokenizer anchor