Train a word tokenizer (mfa train_tokenizer)
#
Training a tokenizer uses a simplified sequence-to-sequence model like G2P, but with the following differences:
Both the input and output symbols are graphemes
Symbols can only output themselves
Only allow for inserting space characters
Command reference#
mfa train_tokenizer#
Train a tokenizer model from a tokenized corpus.
mfa train_tokenizer [OPTIONS] CORPUS_DIRECTORY OUTPUT_MODEL_PATH
Options
- -c, --config_path <config_path>#
Path to config file to use for training.
- --evaluate, --validate#
Perform an analysis of accuracy training on most of the data and validating on an unseen subset.
- --phonetisaurus#
Flag for using Phonetisaurus-style models.
- -p, --profile <profile>#
Configuration profile to use, defaults to “global”
- -t, --temporary_directory <temporary_directory>#
Set the default temporary directory, default is /home/docs/Documents/MFA
- -j, --num_jobs <num_jobs>#
Set the number of processes to use by default, defaults to 3
- --clean, --no_clean#
Remove files from previous runs, default is False
- -v, --verbose, -nv, --no_verbose#
Output debug messages, default is False
- -q, --quiet, -nq, --no_quiet#
Suppress all output messages (overrides verbose), default is False
- --overwrite, --no_overwrite#
Overwrite output files when they exist, default is False
- --use_mp, --no_use_mp#
Turn on/off multiprocessing. Multiprocessing is recommended will allow for faster executions.
- --use_threading, --no_use_threading#
Use threading library rather than multiprocessing library. Multiprocessing is recommended will allow for faster executions.
- -d, --debug, -nd, --no_debug#
Run extra steps for debugging issues, default is False
- --use_postgres, --no_use_postgres#
Use postgres instead of sqlite for extra functionality, default is False
- --single_speaker#
Single speaker mode creates multiprocessing splits based on utterances rather than speakers. This mode also disables speaker adaptation equivalent to
--uses_speaker_adaptation false
.
- --textgrid_cleanup, --cleanup_textgrids, --no_textgrid_cleanup, --no_cleanup_textgrids#
Turn on/off post-processing of TextGrids that cleans up silences and recombines compound words and clitics.
- -h, --help#
Show this message and exit.
Arguments
- CORPUS_DIRECTORY#
Required argument
- OUTPUT_MODEL_PATH#
Required argument