SplitWordsFunction#

class montreal_forced_aligner.tokenization.simple.SplitWordsFunction(word_table, clitic_marker, initial_clitic_regex, final_clitic_regex, compound_regex, non_speech_regexes, oov_word=None, grapheme_set=None)[source]#

Bases: object

Class for functions that splits words that have compound and clitic markers

Parameters:

word_table (pywrapfst.SymbolTable) – Symbol table to look words up
clitic_marker (str) – Character that marks clitics
initial_clitic_regex (re.Pattern) – Regex for splitting off initial clitics
final_clitic_regex (re.Pattern) – Regex for splitting off final clitics
compound_regex (re.Pattern) – Regex for splitting compound words
non_speech_regexes (dict[str, re.Pattern]) – Regex for detecting and sanitizing non-speech words
oov_word (str) – What to label words not in the dictionary, defaults to None

split_clitics(item)[source]#

Split a word into subwords based on dictionary information

Parameters:: item (str) – Word to split
Returns:: List of subwords
Return type:: list[str]

to_str(normalized_text)[source]#

Convert normalized text to an integer ID

Parameters:: normalized_text – Word to convert
Returns:: Normalized string
Return type:: str