Dictionary

class aligner.dictionary.Dictionary(input_path, output_directory, oov_code='<unk>', position_dependent_phones=True, num_sil_states=5, num_nonsil_states=3, shared_silence_phones=True, sil_prob=0.5, word_set=None, debug=False)[source]

Class containing information about a pronunciation dictionary

Parameters:
input_path : str

Path to an input pronunciation dictionary

output_directory : str

Path to a directory to store files for Kaldi

oov_code : str, optional

What to label words not in the dictionary, defaults to '<unk>'

position_dependent_phones : bool, optional

Specifies whether phones should be represented as dependent on their position in the word (beginning, middle or end), defaults to True

num_sil_states : int, optional

Number of states to use for silence phones, defaults to 5

num_nonsil_states : int, optional

Number of states to use for non-silence phones, defaults to 3

shared_silence_phones : bool, optional

Specify whether to share states across all silence phones, defaults to True

pronunciation probabilities : bool, optional

Specifies whether to model different pronunciation probabilities or to treat each entry as a separate word, defaults to True

sil_prob : float, optional

Probability of optional silences following words, defaults to 0.5

Attributes

clitic_markers
oov_int The integer id for out of vocabulary items
optional_silence_csl Phone id of the optional silence phone
phones The set of all phones (silence and non-silence)
phones_dir Directory to store information Kaldi needs about phones
positional_nonsil_phones List of non-silence phones with positions
positional_sil_phones List of silence phones with positions
positions
reversed_phone_mapping A mapping of integer ids to phones
reversed_word_mapping A mapping of integer ids to words
silence_csl A colon-separated list (as a string) of silence phone ids
topo_sil_template
topo_template
topo_transition_template

Methods

add_disambiguation()
cleanup() Clean up temporary files in the output directory
create_utterance_fst(text, frequent_words)
export_lexicon(path[, disambig, probability])
generate_mappings()
save_oovs_found(directory) Save all out of vocabulary items to a file in the specified directory
separate_clitics(item) Separates words with apostrophes or hyphens if the subparts are in the lexicon.
to_int(item) Convert a given word into its integer id
write() Write the files necessary for Kaldi
cleanup()[source]

Clean up temporary files in the output directory

oov_int

The integer id for out of vocabulary items

optional_silence_csl

Phone id of the optional silence phone

phones

The set of all phones (silence and non-silence)

phones_dir

Directory to store information Kaldi needs about phones

positional_nonsil_phones

List of non-silence phones with positions

positional_sil_phones

List of silence phones with positions

reversed_phone_mapping

A mapping of integer ids to phones

reversed_word_mapping

A mapping of integer ids to words

save_oovs_found(directory)[source]

Save all out of vocabulary items to a file in the specified directory

Parameters:
directory : str

Path to directory to save oovs_found.txt

separate_clitics(item)[source]

Separates words with apostrophes or hyphens if the subparts are in the lexicon.

Checks whether the text on either side of an apostrophe or hyphen is in the dictionary. If so, splits the word. If neither part is in the dictionary, returns the word without splitting it.

Parameters:
item : string

Lexical item

Returns:
vocab_items: list

List containing all words after any splits due to apostrophes or hyphens

silence_csl

A colon-separated list (as a string) of silence phone ids

to_int(item)[source]

Convert a given word into its integer id

write()[source]

Write the files necessary for Kaldi