Dictionaries should be specified in the following format:


where each line is a word with a transcription separated by white space. Each phone in the transcription should be separated by white space as well.

A dictionary for English that has good coverage is the lexicon derived from the LibriSpeech corpus (LibriSpeech lexicon). This lexicon uses the Arpabet transcription format (like the CMU Pronouncing Dictionary).

The Prosodylab-aligner has two preconstructed dictionaries as well, one for English (Prosodylab-aligner English dictionary) and one for Quebec French (Prosodylab-aligner French dictionary), also see dictionaries for a list of supported dictionaries.


See the page on generating dictionaries for how to use G2P models to generate a dictionary from our pretrained models or how to generate pronunciation dictionaries from orthographies.

Non-speech annotations

There are two special phones that can be used for annotations that are not speech, sil and spn. The sil phone is used to model silence, and the spn phone is used to model unknown words. If you have annotations for non-speech vocalizations that are similar to silence like breathing or exhalation, you can use the sil phone to align those. You can use the spn phone to align annotations like laughter, coughing, etc.

{LG} spn
{SL} sil