Phone groups — Montreal Forced Aligner 3.0.0 documentation

When training an acoustic model, MFA begins by training a monophone model, where each phone is context-independent. Consider an English ARPABET model as an example. A [T] is modeled the same regardless of:

For each of these cases, the acoustic model will proceed through the same HMM states with the same GMM PDFs (Probability Distribution Functions).

Full utterance

The truck righted itself just before it tipped over onto it’s top and came to a full stop.

../../_images/english_t.svg — Waveform, spectrogram, and aligned labels for the full reading of the English text#

truck

righted

itself

just

../../_images/english_t_it.svg — Waveform, spectrogram, and aligned labels for the word “it”, realized as [t̚]#

tipped

onto

it’s

../../_images/english_t_it%27s.svg — Waveform, spectrogram, and aligned labels for the word “it’s”, realized as [t]#

top

../../_images/english_t_top.svg — Waveform, spectrogram, and aligned labels for the word “top”, realized as [tʰ]#

../../_images/english_t_to.svg — Waveform, spectrogram, and aligned labels for the word “to”, realized as [tʰ]#

stop

Given the range of acoustic realizations of [T] for the utterance above, modeling all occurrences as the same sequence of three HMM states doesn’t make a ton of sense. One aspect of the MFA ARPA model that adds some accounting for this variation is the use of position dependent phones, so rather than a single [T], you actually have [T_B] (at the beginnings of words), [T_E] (at the ends of words), [T_I] (in the middle of words), and [T_S] (word consists of just [T_S], doesn’t really apply for [T], but is more relevant for vowels like [AY1_S]). So final realizations won’t be modelled the same as initial realizations or those in the middle of words, each of which will have its own HMM states and GMM PDFs. This carries its own drawback, as sometimes a final or intermediate [T] is realized the same as an initial [T] (i.e. [tʰ]), but there’s no pooling across the positions, so [T_E] and [T_I] HMM-GMMs do not contain any learned stats from the [T_S].

Moving on from monophones which by definition cannot account well for coarticulation and contextual variability, the next stage of MFA training uses triphones. Triphones are essentially strings of three phones to represent a phone. So for a word like stop, the monophone string would be [S T AA1 P], but the corresponding triphone string would be [_/S/T S/T/AA1 T/AA1/P AA1/P/_], where the original [T] is no longer the same as all other instances of [T], but instead is only the same as [T] preceded by [S] and followed by [AA1]. As a result of taking the preceding and following context into account, you now have a ton of different phone symbols that are each modeled differently and have different amounts of data. A triphone like [S/T/AA1] might be decently common, but one like [S/T/AA2] would not have much data given the rarity of [AA2] in transcriptions. However, we’d really like to pool the data across these and other triphones as the key aspect for modeling the [T] in this case is that it is preceded by [S], and followed by a vowel, not so much what quality the vowel has.

So instead of taking each triphone string as a separate phone, these triphones are clustered to make a decision tree based on the previous and following contexts. These decision trees should learn that if a [T] is preceded by [S], then use PDFs related to the unaspirated [t] realization, if it’s at the beginning of a word followed by a vowel, use the PDFs related to [tʰ] realization, etc. By clustering PDFs into similar ones and making decision trees based on context, we can side step the sparsity issue related to blowing up the inventory of sounds with trigrams, and we can explicitly include groups of phones together that should be modeled in the same way.

These phone groups specify what phone symbols should use the same decision trees in modeling. For position dependent phone modeling, it naturally follows that we should put all positions under the same root, so [T_B], [T_E], [T_I] and [T_S] can benefit from data associated with other positions, while still having some bias towards particular realizations (as the decision tree takes the central symbol into account as well as the preceding and following).

In MFA 2.1, you can now specify what phones should be grouped together, rather than specifying arbitrary phone sets like IPA or ARPA as in MFA 2.0. There are baseline versions of these phone groups available in mfa-models/config/acoustic/phone_groups. The English US ARPA phone group gives the same phone groups that were used in training the English (US) ARPA 2.0 models, while the MFA phone set ones are a bit more subject to change as I iterate on them.

A general rule of thumb that I follow is to keep phonetically similar-ish phones in the same group, so for English MFA phone group, I’ve added phonetic variants to dictionary and specified phonological rules for adding more variation to it, but most of these share phone groups with their root phone. So variants like [t tʰ tʲ tʷ] are grouped together, but less similar variants like [ɾ] and [ʔ] have their own phone groups (shown in the excerpt below). Similar dialectal variants variants like [ow əw o] are grouped together as well.

-
  - t
  - tʷ
  - tʰ
  - tʲ
-
  - d
  - dʲ
-
  - ɾ
  - ɾʲ
-
  - ʔ

The default phone groups without any custom yaml file or phone set type specified is to treat each phone as its own phone group. Regardless of how phone groups are set up, if position_dependent_phones is specified, then each phone’s phone group will contain all the various positional phone variants.

Phone groups#