Corpus

class aligner.corpus.Corpus(directory, output_directory, use_speaker_information=True, speaker_characters=0, num_jobs=3, debug=False, ignore_exceptions=False)[source]

Class that stores information about the dataset to align.

Corpus objects have a number of mappings from either utterances or speakers to various properties, and mappings between utterances and speakers.

See http://kaldi-asr.org/doc/data_prep.html for more information about the files that are created by this class.

Parameters:
directorystr

Directory of the dataset to align

output_directorystr

Directory to store generated data for the Kaldi binaries

mfcc_configMfccConfig

Configuration object for how to calculate MFCCs

speaker_charactersint, optional

Number of characters in the filenames to count as the speaker ID, if not specified, speaker IDs are generated from directory names

num_jobsint, optional

Number of processes to use, defaults to 3

Raises:
CorpusError

Raised if the specified corpus directory does not exist

SampleRateError

Raised if the wav files in the dataset do not share a consistent sample rate

Attributes

grouped_cmvn

grouped_feat

grouped_segments

grouped_spk2utt

grouped_utt2spk

grouped_wav

mfcc_directory

mfcc_log_directory

num_utterances

split_directory

word_set

Methods

create_mfccs()

find_best_groupings()

get_feat_dim()

get_wav_duration(utt)

get_word_frquency(dictionary)

grouped_text([dictionary])

grouped_text_int(dictionary)

grouped_utt2fst(dictionary[, num_frequent_words])

initialize_corpus(dictionary[, skip_input])

parse_mfcc_logs()

speaker_utterance_info()

write()