Transcribe audio files (mfa transcribe)#

MFA has some limited ability to use its acoustic and language models for performing transcription. The intent of this functionality is largely to aid in offline corpus construction, and not as an online capability like most ASR systems.

See also

See Train a new acoustic model (mfa train) and Train a new language model (mfa train_lm) details on training MFA models to use in transcription.

Unlike alignment, transcription does not require transcribed audio files (except when running in Evaluation mode, but instead will use the combination of acoustic model, language model, and pronunciation dictionary to create a decoding lattice and find the best path through it. When training a language model for transcription, it is recommended to train one on text/speech transcripts that are in the same domain to minimize errors.

Warning

The technology that MFA uses is several years out of date, and as such if you have other options available such as Coqui or other production systems for STT, we recommend using those. The transcription capabilities are more here for completeness.

Evaluation mode#

Transcriptions can be compared to a gold-standard references by transcribing a corpus in the same format as for alignment (i.e., each sound file has a corresponding TextGrid or lab file). Transcript will proceed as above, and then the resulting transcripts will be aligned with the gold transcriptions using the Bio.pairwise2 alignment algorithm. From the aligned transcripts, Word Error Rate and Character Error Rate will be calculated for each utterance as follows:

\[Error \: rate = \frac{insertions + deletions + (2 * substitutions)} {length_{ref}}\]

Command reference#

mfa transcribe#

Transcribe utterances using an acoustic model, language model, and pronunciation dictionary.

mfa transcribe [OPTIONS] CORPUS_DIRECTORY DICTIONARY_PATH ACOUSTIC_MODEL_PATH
               LANGUAGE_MODEL_PATH OUTPUT_DIRECTORY

Options

-c, --config_path <config_path>#

Path to config file to use for training.

-s, --speaker_characters <speaker_characters>#

Number of characters of file names to use for determining speaker, default is to use directory names.

-a, --audio_directory <audio_directory>#

Audio directory root to use for finding audio files.

--output_type <output_type>#

Flag for outputting transcription text or alignments.

Options:

transcription | alignment

--output_format <output_format>#

Format for aligned output files (default is long_textgrid).

Options:

long_textgrid | short_textgrid | json | csv

--evaluate#

Evaluate the transcription against golden texts.

--include_original_text#

Flag to include original utterance text in the output.

--language_model_weight <language_model_weight>#

Specific language model weight to use in evaluating transcriptions, defaults to 16.

--word_insertion_penalty <word_insertion_penalty>#

Specific word insertion penalty between 0.0 and 1.0 to use in evaluating transcription, defaults to 1.0.

-p, --profile <profile>#

Configuration profile to use, defaults to “global”

-t, --temporary_directory <temporary_directory>#

Set the default temporary directory, default is /home/docs/Documents/MFA

-j, --num_jobs <num_jobs>#

Set the number of processes to use by default, defaults to 3

--clean, --no_clean#

Remove files from previous runs, default is False

-v, --verbose, -nv, --no_verbose#

Output debug messages, default is False

-q, --quiet, -nq, --no_quiet#

Suppress all output messages (overrides verbose), default is False

--overwrite, --no_overwrite#

Overwrite output files when they exist, default is False

--use_mp, --no_use_mp#

Turn on/off multiprocessing. Multiprocessing is recommended will allow for faster executions.

--use_threading, --no_use_threading#

Use threading library rather than multiprocessing library. Multiprocessing is recommended will allow for faster executions.

-d, --debug, -nd, --no_debug#

Run extra steps for debugging issues, default is False

--use_postgres, --no_use_postgres#

Use postgres instead of sqlite for extra functionality, default is False

--single_speaker#

Single speaker mode creates multiprocessing splits based on utterances rather than speakers. This mode also disables speaker adaptation equivalent to --uses_speaker_adaptation false.

--textgrid_cleanup, --cleanup_textgrids, --no_textgrid_cleanup, --no_cleanup_textgrids#

Turn on/off post-processing of TextGrids that cleans up silences and recombines compound words and clitics.

-h, --help#

Show this message and exit.

Arguments

CORPUS_DIRECTORY#

Required argument

DICTIONARY_PATH#

Required argument

ACOUSTIC_MODEL_PATH#

Required argument

LANGUAGE_MODEL_PATH#

Required argument

OUTPUT_DIRECTORY#

Required argument

Configuration reference#

API reference#