Tokenize utterances `(mfa tokenize)`#

Deprecated since version 3.4: The functionality for training tokenizers in MFA is deprecated and slated to be removed in MFA 4.0. For better solutions for tokenizing a given language, see Language tokenization for how to use dedicated packages and models for various languages.

Use a model trained from Train a word tokenizer (mfa train_tokenizer) to tokenize a corpus (i.e. insert spaces as word boundaries for orthographic systems that do not require them).

Command reference#

mfa tokenize#

Tokenize utterances with a trained tokenizer model

Usage

mfa tokenize [OPTIONS] INPUT_PATH TOKENIZER_MODEL_PATH OUTPUT_DIRECTORY

Options

-c, --config_path <config_path>#: Path to config file to use for training.

-p, --profile <profile>#: Configuration profile to use, defaults to “global”

-t, --temporary_directory <temporary_directory>#: Set the default temporary directory, default is /home/docs/Documents/MFA

-j, --num_jobs <num_jobs>#: Set the number of processes to use by default, defaults to 3

--clean, --no_clean#: Remove files from previous runs, default is True

--final_clean, --no_final_clean#: Remove temporary files at the end of run, default is False

-v, --verbose, -nv, --no_verbose#: Output debug messages, default is False

-q, --quiet, -nq, --no_quiet#: Suppress all output messages (overrides verbose), default is False

--overwrite, --no_overwrite#: Overwrite output files when they exist, default is False

--use_mp, --no_use_mp#: Turn on/off multiprocessing. Multiprocessing is recommended will allow for faster executions.

--use_threading, --no_use_threading#: Use threading library rather than multiprocessing library. Multiprocessing is recommended will allow for faster executions.

-d, --debug, -nd, --no_debug#: Run extra steps for debugging issues, default is False

--use_postgres, --no_use_postgres#: Use postgres instead of sqlite for extra functionality, default is False

--single_speaker#: Single speaker mode creates multiprocessing splits based on utterances rather than speakers. This mode also disables speaker adaptation equivalent to --uses_speaker_adaptation false.

--textgrid_cleanup, --cleanup_textgrids, --no_textgrid_cleanup, --no_cleanup_textgrids#: Turn on/off post-processing of TextGrids that cleans up silences and recombines compound words and clitics.

-h, --help#: Show this message and exit.

Arguments

INPUT_PATH#: Required argument

TOKENIZER_MODEL_PATH#: Required argument

OUTPUT_DIRECTORY#: Required argument

API reference#

Tokenizers

Tokenize utterances (mfa tokenize)#

Command reference#

mfa tokenize#

API reference#

Tokenize utterances `(mfa tokenize)`#