Dictionaries

Dictionaries should be specified in the following format:

WORDA PHONEA PHONEB
WORDB PHONEB PHONEC

where each line is a word with a transcription separated by white space. Each phone in the transcription should be separated by white space as well.

A dictionary for English that has good coverage is the lexicon derived from the LibriSpeech corpus (LibriSpeech lexicon). This lexicon uses the Arpabet transcription format (like the CMU Pronouncing Dictionary).

There is an option when running the aligner for not using a dictionary (--no_dict). When run in this mode, the aligner will construct pronunciations for words in the corpus based on their orthographies. In this mode, a dataset with an example transcription

WORDA WORDB

for a sound file would have the following dictionary generated:

WORDA W O R D A
WORDB W O R D B

The Prosodylab-aligner has two preconstructed dictionaries as well, one for English (Prosodylab-aligner English dictionary) and one for Quebec French (Prosodylab-aligner French dictionary)

Note

See the page on generating dictionaries for how to use G2P models to generate a dictionary from our pretrained models.

Non-speech annotations

There are two special phones that can be used for annotations that are not speech, sil and spn. The sil phone is used to model silence, and the spn phone is used to model unknown words. If you have annotations for non-speech vocalizations that are similar to silence like breathing or exhalation, you can use the sil phone to align those. You can use the spn phone to align annotations like laughter, coughing, etc.

{LG} spn
{SL} sil