G2P Configuration#
Global options#
Parameter |
Default value |
Notes |
---|---|---|
punctuation |
、。।,@<>”(),.:;¿?¡!\&%#*~【】,…‥「」『』〝〟″⟨⟩♪・‹›«»~′$+= |
Characters to treat as punctuation and strip from around words |
clitic_markers |
‘’ |
Characters to treat as clitic markers, will be collapsed to the first character in the string |
compound_markers |
- |
Characters to treat as marker in compound words (i.e., doesnt need to be preserved like for clitics) |
num_pronunciations |
1 |
Number of pronunciations to generate |
G2P training options#
In addition to the parameters above, the following parameters are used as part of training a G2P model.
Parameter |
Default value |
Notes |
---|---|---|
order |
7 |
Ngram order of the G2P Model |
random_starts |
25 |
Number of random starts for aligning orthography to phones |
seed |
1917 |
Seed for randomization |
delta |
1/1024 |
Comparison/quatization delta for Baum-Welch training |
lr |
1.0 |
Learning rate for Baum-Welch training |
batch_size |
200 |
Batch size for Baum-Welch training |
max_iterations |
10 |
Maximum number of iterations to use in Baum-Welch training |
smoothing_method |
kneser_ney |
Smoothing method for the ngram model |
pruning_method |
relative_entropy |
Pruning method for pruning the ngram model |
model_size |
1000000 |
Target number of ngrams for pruning |
Example G2P configuration files#
Default G2P training config file#
punctuation: "、。।,@<>\"(),.:;¿?¡!\\&%#*~【】,…‥「」『』〝〟″⟨⟩♪・‹›«»~′$+="
clitic_markers: "'’"
compound_markers: "-"
num_pronunciations: 1 # Used if running in validation mode
order: 7
random_starts: 25
seed: 1917
delta: 0.0009765
lr: 1.0
batch_size: 200
max_iterations: 10
smoothing_method: "kneser_ney"
pruning_method: "relative_entropy"
model_size: 1000000
Default dictionary generation config file#
punctuation: "、。।,@<>\"(),.:;¿?¡!\\&%#*~【】,…‥「」『』〝〟″⟨⟩♪・‹›«»~′$+="
clitic_markers: "'’"
compound_markers: "-"
num_pronunciations: 1