G2P Configuration#

Global options#

Parameter

Default value

Notes

punctuation

、。।,@<>”(),.:;¿?¡!\&%#*~【】,…‥「」『』〝〟″⟨⟩♪・‹›«»~′$+=

Characters to treat as punctuation and strip from around words

clitic_markers

‘’

Characters to treat as clitic markers, will be collapsed to the first character in the string

compound_markers

-

Characters to treat as marker in compound words (i.e., doesnt need to be preserved like for clitics)

num_pronunciations

1

Number of pronunciations to generate

G2P training options#

In addition to the parameters above, the following parameters are used as part of training a G2P model.

Parameter

Default value

Notes

order

7

Ngram order of the G2P Model

random_starts

25

Number of random starts for aligning orthography to phones

seed

1917

Seed for randomization

delta

1/1024

Comparison/quatization delta for Baum-Welch training

lr

1.0

Learning rate for Baum-Welch training

batch_size

200

Batch size for Baum-Welch training

max_iterations

10

Maximum number of iterations to use in Baum-Welch training

smoothing_method

kneser_ney

Smoothing method for the ngram model

pruning_method

relative_entropy

Pruning method for pruning the ngram model

model_size

1000000

Target number of ngrams for pruning

Example G2P configuration files#

Default G2P training config file#

punctuation: "、。।,@<>\"(),.:;¿?¡!\\&%#*~【】,…‥「」『』〝〟″⟨⟩♪・‹›«»~′$+="
clitic_markers: "'’"
compound_markers: "-"
num_pronunciations: 1  # Used if running in validation mode
order: 7
random_starts: 25
seed: 1917
delta: 0.0009765
lr: 1.0
batch_size: 200
max_iterations: 10
smoothing_method: "kneser_ney"
pruning_method: "relative_entropy"
model_size: 1000000

Default dictionary generation config file#

punctuation: "、。।,@<>\"(),.:;¿?¡!\\&%#*~【】,…‥「」『』〝〟″⟨⟩♪・‹›«»~′$+="
clitic_markers: "'’"
compound_markers: "-"
num_pronunciations: 1