Corpus

class aligner.corpus.Corpus(directory, output_directory, speaker_characters=0, num_jobs=3, debug=False, ignore_exceptions=False)[source]

Class that stores information about the dataset to align.

Corpus objects have a number of mappings from either utterances or speakers to various properties, and mappings between utterances and speakers.

See http://kaldi-asr.org/doc/data_prep.html for more information about the files that are created by this class.

Parameters:
directory : str

Directory of the dataset to align

output_directory : str

Directory to store generated data for the Kaldi binaries

speaker_characters : int, optional

Number of characters in the filenames to count as the speaker ID, if not specified, speaker IDs are generated from directory names

num_jobs : int, optional

Number of processes to use, defaults to 3

Raises:
CorpusError

Raised if the specified corpus directory does not exist

SampleRateError

Raised if the wav files in the dataset do not share a consistent sample rate

Attributes

features_directory
features_log_directory
grouped_cmvn
grouped_feat
grouped_segments
grouped_spk2utt
grouped_utt2spk
grouped_wav
ivector_directory
num_utterances
utterances
word_set

Methods

combine_feats()
create_subset(subset, feature_config)
figure_utterance_lengths()
find_best_groupings()
get_feat_dim(feature_config)
get_wav_duration(utt)
get_word_frquency(dictionary)
grouped_text([dictionary])
grouped_text_int(dictionary)
grouped_utt2fst(dictionary[, num_frequent_words])
initialize_corpus(dictionary)
parse_features_logs()
speaker_utterance_info()
split_directory()
subset_directory(subset, feature_config)
write()