
class montreal_forced_aligner.vad.segmenter.VadSegmenter(**kwargs)[source]#

Bases: VadConfigMixin, AcousticCorpusMixin, FileExporterMixin, SegmenterMixin, TopLevelMfaWorker

Class for performing speaker classification, parameters are passed to speechbrain.pretrained.interfaces.VAD.get_speech_segments

  • segment_padding (float) – Size of padding on both ends of a segment

  • large_chunk_size (float) – Size (in seconds) of the large chunks that are read sequentially from the input audio file.

  • small_chunk_size (float) – Size (in seconds) of the small chunks extracted from the large ones. The audio signal is processed in parallel within the small chunks. Note that large_chunk_size/small_chunk_size must be an integer.

  • overlap_small_chunk (bool) – If True, it creates overlapped small chunks (with 50% overal). The probabilities of the overlapped chunks are combined using hamming windows.

  • apply_energy_VAD (bool) – If True, a energy-based VAD is used on the detected speech segments. The neural network VAD often creates longer segments and tends to merge close segments together. The energy VAD post-processes can be useful for having a fine-grained voice activity detection. The energy thresholds is managed by activation_th and deactivation_th (see below).

  • double_check (bool) – If True, double checks (using the neural VAD) that the candidate speech segments actually contain speech. A threshold on the mean posterior probabilities provided by the neural network is applied based on the speech_th parameter (see below).

  • activation_th (float) – Threshold of the neural posteriors above which starting a speech segment.

  • deactivation_th (float) – Threshold of the neural posteriors below which ending a speech segment.

  • en_activation_th (float) – A new speech segment is started it the energy is above activation_th. This is active only if apply_energy_VAD is True.

  • en_deactivation_th (float) – The segment is considered ended when the energy is <= deactivation_th. This is active only if apply_energy_VAD is True.

  • speech_th (float) – Threshold on the mean posterior probability within the candidate speech segment. Below that threshold, the segment is re-assigned to a non-speech region. This is active only if double_check is True.

  • close_th (float) – If the distance between boundaries is smaller than close_th, the segments will be merged.

  • len_th (float) – If the length of the segment is smaller than len_th, the segments will be merged.

export_files(output_directory, output_format=None)[source]#

Export the results of segmentation as TextGrids

  • output_directory (str) – Directory to save segmentation TextGrids

  • output_format (str, optional) – Format to force output files into

classmethod parse_parameters(config_path=None, args=None, unknown_args=None)[source]#

Parse parameters for segmentation from a config path or command-line arguments

  • config_path (Path) – Config path

  • args (dict[str, Any]) – Parsed arguments

  • unknown_args (list[str]) – Optional list of arguments that were not parsed


Configuration parameters

Return type:

dict[str, Any]


Performs VAD and segmentation into utterances


KaldiProcessingError – If there were any errors in running Kaldi binaries


Run segmentation based off of VAD.

See also


Multiprocessing helper function for each job


Job method for generating arguments for helper function


Generate Job arguments for SegmentVadFunction


Arguments for processing

Return type:



Setup segmentation