Corpus¶

class aligner.corpus.Corpus(directory, output_directory, use_speaker_information=True, speaker_characters=0, num_jobs=3, debug=False, ignore_exceptions=False)[source]¶

Class that stores information about the dataset to align.

Corpus objects have a number of mappings from either utterances or speakers to various properties, and mappings between utterances and speakers.

See http://kaldi-asr.org/doc/data_prep.html for more information about the files that are created by this class.

Parameters:

directorystr: Directory of the dataset to align
output_directorystr: Directory to store generated data for the Kaldi binaries
mfcc_configMfccConfig: Configuration object for how to calculate MFCCs
speaker_charactersint, optional: Number of characters in the filenames to count as the speaker ID, if not specified, speaker IDs are generated from directory names
num_jobsint, optional: Number of processes to use, defaults to 3

Raises:

CorpusError: Raised if the specified corpus directory does not exist
SampleRateError: Raised if the wav files in the dataset do not share a consistent sample rate

Attributes

`grouped_cmvn`
`grouped_feat`
`grouped_segments`
`grouped_spk2utt`
`grouped_utt2spk`
`grouped_wav`
`mfcc_directory`
`mfcc_log_directory`
`num_utterances`
`split_directory`
`word_set`

Methods

`create_mfccs`()
`find_best_groupings`()
`get_feat_dim`()
`get_wav_duration`(utt)
`get_word_frquency`(dictionary)
`grouped_text`([dictionary])
`grouped_text_int`(dictionary)
`grouped_utt2fst`(dictionary[, num_frequent_words])
`initialize_corpus`(dictionary[, skip_input])
`parse_mfcc_logs`()
`speaker_utterance_info`()
`write`()