G2P Configuration#

Global options#

Parameter	Default value	Notes
punctuation	、。।，@<>”(),.:;¿?¡!\&%#*~【】，…‥「」『』〝〟″⟨⟩♪・‹›«»～′$+=	Characters to treat as punctuation and strip from around words
clitic_markers	‘’	Characters to treat as clitic markers, will be collapsed to the first character in the string
compound_markers	-	Characters to treat as marker in compound words (i.e., doesnt need to be preserved like for clitics)
num_pronunciations	1	Number of pronunciations to generate

G2P training options#

In addition to the parameters above, the following parameters are used as part of training a G2P model.

Parameter	Default value	Notes
order	7	Ngram order of the G2P Model
random_starts	25	Number of random starts for aligning orthography to phones
seed	1917	Seed for randomization
delta	1/1024	Comparison/quatization delta for Baum-Welch training
lr	1.0	Learning rate for Baum-Welch training
batch_size	200	Batch size for Baum-Welch training
max_iterations	10	Maximum number of iterations to use in Baum-Welch training
smoothing_method	kneser_ney	Smoothing method for the ngram model
pruning_method	relative_entropy	Pruning method for pruning the ngram model
model_size	1000000	Target number of ngrams for pruning

Example G2P configuration files#

Default G2P training config file#

punctuation: "、。।，@<>\"(),.:;¿?¡!\\&%#*~【】，…‥「」『』〝〟″⟨⟩♪・‹›«»～′$+="
clitic_markers: "'’"
compound_markers: "-"
num_pronunciations: 1  # Used if running in validation mode
order: 7
random_starts: 25
seed: 1917
delta: 0.0009765
lr: 1.0
batch_size: 200
max_iterations: 10
smoothing_method: "kneser_ney"
pruning_method: "relative_entropy"
model_size: 1000000

Default dictionary generation config file#

punctuation: "、。।，@<>\"(),.:;¿?¡!\\&%#*~【】，…‥「」『』〝〟″⟨⟩♪・‹›«»～′$+="
clitic_markers: "'’"
compound_markers: "-"
num_pronunciations: 1