Dictionaries¶
Dictionaries should be specified in the following format:
WORDA PHONEA PHONEB
WORDB PHONEB PHONEC
where each line is a word with a transcription separated by white space. Each phone in the transcription should be separated by white space as well.
A dictionary for English that has good coverage is the lexicon derived from the LibriSpeech corpus (LibriSpeech lexicon). This lexicon uses the Arpabet transcription format (like the CMU Pronouncing Dictionary).
There is an option when running the aligner for not using a dictionary (--no_dict
).
When run in this mode, the aligner will construct pronunciations for words
in the corpus based on their orthographies. In this mode, a dataset with an example transcription
WORDA WORDB
for a sound file would have the following dictionary generated:
WORDA W O R D A
WORDB W O R D B
The Prosodylab-aligner has two preconstructed dictionaries as well, one for English (Prosodylab-aligner English dictionary) and one for Quebec French (Prosodylab-aligner French dictionary)
Note
See the page on generating dictionaries for how to use G2P models to generate a dictionary from our pretrained models.
Non-speech annotations¶
There are two special phones that can be used for annotations that are not speech, sil
and spn
. The sil
phone is used
to model silence, and the spn
phone is used to model unknown words. If you have annotations for non-speech vocalizations that are
similar to silence like breathing or exhalation, you can use the sil
phone to align those. You can use the spn
phone
to align annotations like laughter, coughing, etc.
{LG} spn
{SL} sil