G2P Configuration

Global options

Parameter Default value Notes
punctuation 、。।,@<>”(),.:;¿?¡!\&%#*~【】,…‥「」『』〝〟″⟨⟩♪・‹›«»~′$+= Characters to treat as punctuation and strip from around words
clitic_markers ‘’ Characters to treat as clitic markers, will be collapsed to the first character in the string
compound_markers - Characters to treat as marker in compound words (i.e., doesnt need to be preserved like for clitics)
num_pronunciations 1 Number of pronunciations to generate
use_mp True Flag for whether to use multiprocessing

Train G2P Configuration

In addition to the parameters above, the following parameters are used as part of training a G2P model.

Parameter Default value Notes
order 7 Ngram order of the G2P Model
random_starts 25 Number of random starts for aligning orthography to phones
seed 1917 Seed for randomization
delta 1/1024 Comparison/quatization delta for Baum-Welch training
lr 1.0 Learning rate for Baum-Welch training
batch_size 200 Batch size for Baum-Welch training
max_iterations 10 Maximum number of iterations to use in Baum-Welch training
smoothing_method kneser_ney Smoothing method for the ngram model
pruning_method relative_entropy Pruning method for pruning the ngram model
model_size 1000000 Target number of ngrams for pruning

Default G2P training config file

punctuation: "、。।,@<>\"(),.:;¿?¡!\\&%#*~【】,…‥「」『』〝〟″⟨⟩♪・‹›«»~′$+="
clitic_markers: "'’"
compound_markers: "-"
num_pronunciations: 1  # Used if running in validation mode
use_mp: True
order: 7
random_starts: 25
seed: 1917
delta: 0.0009765
lr: 1.0
batch_size: 200
max_iterations: 10
smoothing_method: "kneser_ney"
pruning_method: "relative_entropy"
model_size: 1000000

G2P generation configuration file

punctuation: "、。।,@<>\"(),.:;¿?¡!\\&%#*~【】,…‥「」『』〝〟″⟨⟩♪・‹›«»~′$+="
clitic_markers: "'’"
compound_markers: "-"
num_pronunciations: 1
use_mp: True