SplitWordsFunction#

class montreal_forced_aligner.dictionary.mixins.SplitWordsFunction(clitic_marker, initial_clitic_regex, final_clitic_regex, compound_regex, non_speech_regexes, oov_word=None, word_mapping=None, grapheme_mapping=None)[source]#

Bases: object

Class for functions that splits words that have compound and clitic markers

Parameters:
  • clitic_markers (list[str]) – Characters that mark clitics

  • compound_markers (list[str]) – Characters that mark compound words

  • clitic_set (set[str]) – Set of clitic words

  • brackets (list[tuple[str, str], optional) – Character tuples to treat as full brackets around words

  • words_mapping (dict[str, int]) – Mapping of words to integer IDs

  • specials_set (set[str]) – Set of special words

  • oov_word (str) – What to label words not in the dictionary, defaults to None

split_clitics(item)[source]#

Split a word into subwords based on dictionary information

Parameters:

item (str) – Word to split

Returns:

List of subwords

Return type:

list[str]

to_str(normalized_text)[source]#

Convert normalized text to an integer ID

Parameters:

normalized_text – Word to convert

Returns:

Normalized string

Return type:

str