VadSegmenter#
- class montreal_forced_aligner.vad.segmenter.VadSegmenter(**kwargs)[source]#
Bases:
VadConfigMixin
,AcousticCorpusMixin
,FileExporterMixin
,SegmenterMixin
,TopLevelMfaWorker
Class for performing speaker classification, parameters are passed to speechbrain.pretrained.interfaces.VAD.get_speech_segments
- Parameters:
segment_padding (float) – Size of padding on both ends of a segment
large_chunk_size (float) – Size (in seconds) of the large chunks that are read sequentially from the input audio file.
small_chunk_size (float) – Size (in seconds) of the small chunks extracted from the large ones. The audio signal is processed in parallel within the small chunks. Note that large_chunk_size/small_chunk_size must be an integer.
overlap_small_chunk (bool) – If True, it creates overlapped small chunks (with 50% overal). The probabilities of the overlapped chunks are combined using hamming windows.
apply_energy_VAD (bool) – If True, a energy-based VAD is used on the detected speech segments. The neural network VAD often creates longer segments and tends to merge close segments together. The energy VAD post-processes can be useful for having a fine-grained voice activity detection. The energy thresholds is managed by activation_th and deactivation_th (see below).
double_check (bool) – If True, double checks (using the neural VAD) that the candidate speech segments actually contain speech. A threshold on the mean posterior probabilities provided by the neural network is applied based on the speech_th parameter (see below).
activation_th (float) – Threshold of the neural posteriors above which starting a speech segment.
deactivation_th (float) – Threshold of the neural posteriors below which ending a speech segment.
en_activation_th (float) – A new speech segment is started it the energy is above activation_th. This is active only if apply_energy_VAD is True.
en_deactivation_th (float) – The segment is considered ended when the energy is <= deactivation_th. This is active only if apply_energy_VAD is True.
speech_th (float) – Threshold on the mean posterior probability within the candidate speech segment. Below that threshold, the segment is re-assigned to a non-speech region. This is active only if double_check is True.
close_th (float) – If the distance between boundaries is smaller than close_th, the segments will be merged.
len_th (float) – If the length of the segment is smaller than len_th, the segments will be merged.
- export_files(output_directory, output_format=None)[source]#
Export the results of segmentation as TextGrids
- classmethod parse_parameters(config_path=None, args=None, unknown_args=None)[source]#
Parse parameters for segmentation from a config path or command-line arguments
- segment()[source]#
Performs VAD and segmentation into utterances
- Raises:
KaldiProcessingError – If there were any errors in running Kaldi binaries
- segment_vad()[source]#
Run segmentation based off of VAD.
See also
SegmentVadFunction
Multiprocessing helper function for each job
segment_vad_arguments
Job method for generating arguments for helper function