SplitWordsFunction#

class montreal_forced_aligner.tokenization.simple.SplitWordsFunction(word_table, clitic_marker, initial_clitic_regex, final_clitic_regex, compound_regex, cutoff_regex, non_speech_regexes, oov_word=None, grapheme_set=None)[source]#

Bases: object

Class for functions that splits words that have compound and clitic markers

Parameters:
  • word_table (pywrapfst.SymbolTable) – Symbol table to look words up

  • clitic_marker (str) – Character that marks clitics

  • initial_clitic_regex (re.Pattern) – Regex for splitting off initial clitics

  • final_clitic_regex (re.Pattern) – Regex for splitting off final clitics

  • compound_regex (re.Pattern) – Regex for splitting compound words

  • non_speech_regexes (dict[str, re.Pattern]) – Regex for detecting and sanitizing non-speech words

  • oov_word (str) – What to label words not in the dictionary, defaults to None

split_clitics(item)[source]#

Split a word into subwords based on dictionary information

Parameters:

item (str) – Word to split

Returns:

List of subwords

Return type:

list[str]

to_str(normalized_text)[source]#

Convert normalized text to an integer ID

Parameters:

normalized_text – Word to convert

Returns:

Normalized string

Return type:

str