VadSegmenter#

class montreal_forced_aligner.vad.segmenter.VadSegmenter(**kwargs)[source]#

Bases: VadConfigMixin, AcousticCorpusMixin, FileExporterMixin, SpeechbrainSegmenterMixin, TopLevelMfaWorker

Class for performing speaker classification, parameters are passed to speechbrain.pretrained.interfaces.VAD.get_speech_segments

Parameters:

segment_padding (float) – Size of padding on both ends of a segment
large_chunk_size (float) – Size (in seconds) of the large chunks that are read sequentially from the input audio file.
small_chunk_size (float) – Size (in seconds) of the small chunks extracted from the large ones. The audio signal is processed in parallel within the small chunks. Note that large_chunk_size/small_chunk_size must be an integer.
overlap_small_chunk (bool) – If True, it creates overlapped small chunks (with 50% overal). The probabilities of the overlapped chunks are combined using hamming windows.
apply_energy_VAD (bool) – If True, a energy-based VAD is used on the detected speech segments. The neural network VAD often creates longer segments and tends to merge close segments together. The energy VAD post-processes can be useful for having a fine-grained voice activity detection. The energy thresholds is managed by activation_th and deactivation_th (see below).
double_check (bool) – If True, double checks (using the neural VAD) that the candidate speech segments actually contain speech. A threshold on the mean posterior probabilities provided by the neural network is applied based on the speech_th parameter (see below).
activation_th (float) – Threshold of the neural posteriors above which starting a speech segment.
deactivation_th (float) – Threshold of the neural posteriors below which ending a speech segment.
en_activation_th (float) – A new speech segment is started it the energy is above activation_th. This is active only if apply_energy_VAD is True.
en_deactivation_th (float) – The segment is considered ended when the energy is <= deactivation_th. This is active only if apply_energy_VAD is True.
speech_th (float) – Threshold on the mean posterior probability within the candidate speech segment. Below that threshold, the segment is re-assigned to a non-speech region. This is active only if double_check is True.
close_th (float) – If the distance between boundaries is smaller than close_th, the segments will be merged.
len_th (float) – If the length of the segment is smaller than len_th, the segments will be merged.

export_files(output_directory, output_format=None)[source]#

Export the results of segmentation as TextGrids

Parameters:

output_directory (str) – Directory to save segmentation TextGrids
output_format (str, optional) – Format to force output files into

classmethod parse_parameters(config_path=None, args=None, unknown_args=None)[source]#

Parse parameters for segmentation from a config path or command-line arguments

Parameters:

config_path (Path) – Config path
args (dict[str, Any]) – Parsed arguments
unknown_args (list[str]) – Optional list of arguments that were not parsed

Returns:

Configuration parameters

Return type:

dict[str, Any]

segment()[source]#

Performs VAD and segmentation into utterances

Raises:: KaldiProcessingError – If there were any errors in running Kaldi binaries

segment_vad_arguments()[source]#

Generate Job arguments for SegmentVadFunction

Returns:: Arguments for processing
Return type:: list[SegmentVadArguments]

segment_vad_mfa()[source]#

Run segmentation based off of VAD.