Speaker diarization (mfa diarize_speakers)#

The Montreal Forced Aligner can use trained ivector models (see Train an ivector extractor (mfa train_ivector) for more information about training these models) to classify or cluster utterances according to speakers.

Following ivector extraction, MFA stores utterance and speaker ivectors in PLDA-transformed space. Storing the PLDA transformation ensures that the transformation is performed only once when ivectors are initially extracted, rather than done each time scoring occurs. The dimensionality of the PLDA-transformed ivectors is 50, by default, but this can be changed through the Global configuration command.

See also

The PLDA transformation and scoring generally follows Probabilistic Linear Discriminant Analysis (PLDA) Explained by Prachi Singh and the associated code.

A number of clustering algorithms from scikit-learn are available to use as input, along with the default hdbscan. Specifying --use_plda will use PLDA scoring, as opposed to Euclidean distance in PLDA-transformed space. PLDA scoring is likely better, but does have the drawback of computing the full distance matrix for hdbscan, affinity, agglomerative, spectral, dbscan, and optics.

Warning

Some experimentation in clustering is likely necessary, and in general, should be run in a very supervised manner. Different recording conditions and noise in particular utterances can affect the ivectors. Please see the speaker diarization functionality of Anchor Annotator for a way to run MFA’s diarization in a supervised manner.

Also, do note that much of the speaker diarization functionality in MFA is implemented particularly for Anchor, as it’s not quite as constrained a problem as forced alignment. As such, please consider speaker diarization from the command line as alpha functionality, there are likely to be issues.

Command reference#

mfa diarize_speakers#

Use an ivector extractor to cluster utterances into speakers

If you would like to use SpeechBrain’s speaker recognition model, specify speechbrain as the ivector_extractor_path. When using SpeechBrain’s speaker recognition model, the --cuda flag is available to perform computations on GPU, and the --num_jobs parameter will be used as a the batch size for any parallel computation.

mfa diarize_speakers [OPTIONS] CORPUS_DIRECTORY IVECTOR_EXTRACTOR_PATH
                     OUTPUT_DIRECTORY

Options

-c, --config_path <config_path>#

Path to config file to use for training.

-s, --expected_num_speakers <expected_num_speakers>#

Number of speakers if known.

--output_format <output_format>#

Format for aligned output files (default is long_textgrid).

Options:

long_textgrid | short_textgrid | json | csv

--classify, --cluster#

Specify whether to classify speakers into pretrained IDs or cluster speakers without a classification model, default is cluster

--cluster_type <cluster_type>#

Type of clustering algorithm to use

Options:

mfa | affinity | agglomerative | spectral | dbscan | hdbscan | optics | kmeans | meanshift

--cuda, --no_cuda#

Flag for using CUDA for SpeechBrain’s model

--use_pca, --no_use_pca#

Flag for using PCA representations of ivectors

--evaluate, --validate#

Flag for whether to evaluate clustering/classification against existing speakers.

-p, --profile <profile>#

Configuration profile to use, defaults to “global”

-t, --temporary_directory <temporary_directory>#

Set the default temporary directory, default is /home/docs/Documents/MFA

-j, --num_jobs <num_jobs>#

Set the number of processes to use by default, defaults to 3

--clean, --no_clean#

Remove files from previous runs, default is False

-v, --verbose, -nv, --no_verbose#

Output debug messages, default is False

-q, --quiet, -nq, --no_quiet#

Suppress all output messages (overrides verbose), default is False

--overwrite, --no_overwrite#

Overwrite output files when they exist, default is False

--use_mp, --no_use_mp#

Turn on/off multiprocessing. Multiprocessing is recommended will allow for faster executions.

--use_threading, --no_use_threading#

Use threading library rather than multiprocessing library. Multiprocessing is recommended will allow for faster executions.

-d, --debug, -nd, --no_debug#

Run extra steps for debugging issues, default is False

--use_postgres, --no_use_postgres#

Use postgres instead of sqlite for extra functionality, default is False

--single_speaker#

Single speaker mode creates multiprocessing splits based on utterances rather than speakers. This mode also disables speaker adaptation equivalent to --uses_speaker_adaptation false.

--textgrid_cleanup, --cleanup_textgrids, --no_textgrid_cleanup, --no_cleanup_textgrids#

Turn on/off post-processing of TextGrids that cleans up silences and recombines compound words and clitics.

-h, --help#

Show this message and exit.

Arguments

CORPUS_DIRECTORY#

Required argument

IVECTOR_EXTRACTOR_PATH#

Required argument

OUTPUT_DIRECTORY#

Required argument

Configuration reference#

API reference#