Find OOVs in a corpus (mfa find_oovs)#

The mfa find_oovs command is a utility for generating a list of OOVs for a given corpus and pronunciation dictionary, along with counts of their occurrences in the corpus and which utterances they appear in.

Note

This command is functionally the same as using the corpus validator, but it outputs the OOV information more straight-forwardly.

Command reference#

mfa find_oovs#

Check for OOVs in a corpus

Usage

mfa find_oovs [OPTIONS] CORPUS_DIRECTORY DICTIONARY_PATH OUTPUT_DIRECTORY

Options

-c, --config_path <config_path>#

Path to config file to use for training.

-s, --speaker_characters <speaker_characters>#

Number of characters of file names to use for determining speaker, default is to use directory names.

-a, --audio_directory <audio_directory>#

Audio directory root to use for finding audio files.

-p, --profile <profile>#

Configuration profile to use, defaults to “global”

-t, --temporary_directory <temporary_directory>#

Set the default temporary directory, default is /home/docs/Documents/MFA

-j, --num_jobs <num_jobs>#

Set the number of processes to use by default, defaults to 3

--clean, --no_clean#

Remove files from previous runs, default is True

--final_clean, --no_final_clean#

Remove temporary files at the end of run, default is False

-v, --verbose, -nv, --no_verbose#

Output debug messages, default is False

-q, --quiet, -nq, --no_quiet#

Suppress all output messages (overrides verbose), default is False

--overwrite, --no_overwrite#

Overwrite output files when they exist, default is False

--use_mp, --no_use_mp#

Turn on/off multiprocessing. Multiprocessing is recommended will allow for faster executions.

--use_threading, --no_use_threading#

Use threading library rather than multiprocessing library. Multiprocessing is recommended will allow for faster executions.

-d, --debug, -nd, --no_debug#

Run extra steps for debugging issues, default is False

--use_postgres, --no_use_postgres#

Use postgres instead of sqlite for extra functionality, default is False

--single_speaker#

Single speaker mode creates multiprocessing splits based on utterances rather than speakers. This mode also disables speaker adaptation equivalent to --uses_speaker_adaptation false.

--textgrid_cleanup, --cleanup_textgrids, --no_textgrid_cleanup, --no_cleanup_textgrids#

Turn on/off post-processing of TextGrids that cleans up silences and recombines compound words and clitics.

-h, --help#

Show this message and exit.

Arguments

CORPUS_DIRECTORY#

Required argument

DICTIONARY_PATH#

Required argument

OUTPUT_DIRECTORY#

Required argument