Validating data#

The validation utility will perform the basic set up that alignment would perform, but analyzes and reports any issues that the user may want to fix.

First, the utility parses the corpus and dictionary, prints out summary information about the corpus, and logs any of the following issues:

  • If there are any words in transcriptions that are not in the dictionary, these are logged as out-of-vocabulary items (OOVs). A list of these OOVs and which utterances they appear in are saved to text files.

  • Any issues reading sound files

  • Any issues generating features, skipped if --ignore_acoustics is flagged

  • Mismatches in sound files and transcriptions

  • Any issues reading transcription files

  • Any unaligned files from trial alignment run, skipped if --ignore_acoustics is flagged - If no acoustic model is specified, a monophone model is trained for testing alignment

  • Any files that have deviations from their original transcription to decoded transcriptions using a simple language model when --test_transcriptions is supplied - Ngram language models for each speaker are generated and merged with models for each utterance for use in decoding utterances, which may help you find transcription or data inconsistency issues in the corpus

Phone confidence#

The phone confidence functionality of the validation utility is similar to Phone model alignments in that both are trying to represent the “goodness” of the phone label for the given interval. Where phone models use the acoustic model in combination with a phone language model, phone confidence simply calculates the likelihoods of each phone for each frame

Running the corpus validation utility#

Command reference#

mfa validate#

Validate a corpus for use in MFA.

mfa validate [OPTIONS] CORPUS_DIRECTORY DICTIONARY_PATH

Options

--acoustic_model_path <acoustic_model_path>#

Acoustic model to use in testing alignments.

-c, --config_path <config_path>#

Path to config file to use for training.

-s, --speaker_characters <speaker_characters>#

Number of characters of file names to use for determining speaker, default is to use directory names.

-a, --audio_directory <audio_directory>#

Audio directory root to use for finding audio files.

--phone_set <phone_set_type>#

DEPRECATED, please use –phone_groups_path to specify phone groups instead.

Options:

UNKNOWN | AUTO | MFA | IPA | ARPA | PINYIN

--phone_groups_path <phone_groups_path>#

Path to yaml file defining phone groups. See MontrealCorpusTools/mfa-models for examples.

--rules_path <rules_path>#

Path to yaml file defining phonological rules. See MontrealCorpusTools/mfa-models for examples.

--ignore_acoustics, --skip_acoustics#

Skip acoustic feature generation and associated validation.

--test_transcriptions#

Use per-speaker language models to test accuracy of transcriptions.

-p, --profile <profile>#

Configuration profile to use, defaults to “global”

-t, --temporary_directory <temporary_directory>#

Set the default temporary directory, default is /home/docs/Documents/MFA

-j, --num_jobs <num_jobs>#

Set the number of processes to use by default, defaults to 3

--clean, --no_clean#

Remove files from previous runs, default is False

-v, --verbose, -nv, --no_verbose#

Output debug messages, default is False

-q, --quiet, -nq, --no_quiet#

Suppress all output messages (overrides verbose), default is False

--overwrite, --no_overwrite#

Overwrite output files when they exist, default is False

--use_mp, --no_use_mp#

Turn on/off multiprocessing. Multiprocessing is recommended will allow for faster executions.

--use_threading, --no_use_threading#

Use threading library rather than multiprocessing library. Multiprocessing is recommended will allow for faster executions.

-d, --debug, -nd, --no_debug#

Run extra steps for debugging issues, default is False

--use_postgres, --no_use_postgres#

Use postgres instead of sqlite for extra functionality, default is False

--single_speaker#

Single speaker mode creates multiprocessing splits based on utterances rather than speakers. This mode also disables speaker adaptation equivalent to --uses_speaker_adaptation false.

--textgrid_cleanup, --cleanup_textgrids, --no_textgrid_cleanup, --no_cleanup_textgrids#

Turn on/off post-processing of TextGrids that cleans up silences and recombines compound words and clitics.

-h, --help#

Show this message and exit.

Arguments

CORPUS_DIRECTORY#

Required argument

DICTIONARY_PATH#

Required argument