SplitWordsFunction#
- class montreal_forced_aligner.tokenization.simple.SplitWordsFunction(word_table, clitic_marker, initial_clitic_regex, final_clitic_regex, compound_regex, cutoff_regex, non_speech_regexes, oov_word=None, grapheme_set=None, always_split_compounds=False)[source]#
Bases:
object
Class for functions that splits words that have compound and clitic markers
- Parameters:
word_table (
pywrapfst.SymbolTable
) – Symbol table to look words upclitic_marker (str) – Character that marks clitics
initial_clitic_regex (
re.Pattern
) – Regex for splitting off initial cliticsfinal_clitic_regex (
re.Pattern
) – Regex for splitting off final cliticscompound_regex (
re.Pattern
) – Regex for splitting compound wordsnon_speech_regexes (dict[str,
re.Pattern
]) – Regex for detecting and sanitizing non-speech wordsoov_word (str) – What to label words not in the dictionary, defaults to None