Corpus

class aligner.corpus.Corpus(directory, output_directory, use_speaker_information=True, speaker_characters=0, num_jobs=3, debug=False, ignore_exceptions=False)[source]

Class that stores information about the dataset to align.

Corpus objects have a number of mappings from either utterances or speakers to various properties, and mappings between utterances and speakers.

See http://kaldi-asr.org/doc/data_prep.html for more information about the files that are created by this class.

Parameters:

directory : str

Directory of the dataset to align

output_directory : str

Directory to store generated data for the Kaldi binaries

mfcc_config : MfccConfig

Configuration object for how to calculate MFCCs

speaker_characters : int, optional

Number of characters in the filenames to count as the speaker ID, if not specified, speaker IDs are generated from directory names

num_jobs : int, optional

Number of processes to use, defaults to 3

Raises:

CorpusError

Raised if the specified corpus directory does not exist

SampleRateError

Raised if the wav files in the dataset do not share a consistent sample rate

Attributes

grouped_cmvn
grouped_feat
grouped_segments
grouped_spk2utt
grouped_utt2spk
grouped_wav
mfcc_directory
mfcc_log_directory
num_utterances
split_directory
word_set

Methods

create_mfccs()
find_best_groupings()
get_feat_dim()
get_wav_duration(utt)
get_word_frquency(dictionary)
grouped_text([dictionary])
grouped_text_int(dictionary)
grouped_utt2fst(dictionary[, num_frequent_words])
initialize_corpus(dictionary)
parse_mfcc_logs()
speaker_utterance_info()
write()