Analyzing alignment quality#

When exporting textgrids following alignment, an additional file named alignment_analysis.csv will be exported. I am still currently working to refine what are the best measures for analyzing alignments as it’s not entirely as straightforward as taking the overall alignment log-likelihood.

Alignment log-likelihood#

The first measure provided for each utterance is the alignment log-likelihood. This represents overall the objective measure that was optimized for alignment. However, it is extremely important to note that this log-likelihood is a relative measure for the best path of alignment for this particular utterance compared to other possible alignments.

A primary reason that such heavy caveats come with this metric is due to the use of speaker adaptation. MFA does two passes of alignment. The first uses a speaker-independent model to generate an initial alignment. This initial alignment is used to estimate per-speaker feature transforms that try to map the observed features into a common space. Depending on the amount of data for a particular speaker, and the amount of variability they exhibit (i.e., do they yell, do they get excited, do they whisper, did they have a cold, etc etc), speaker transforms have a variable effect on improving alignment. This variable improvement directly affects the log likelihood for a given utterance.

Additionally, log-likelihood reflects differences in the training data versus alignment data. Is the variety of the language the same? Does it have similar gender distribution? Does it have similar styles (conversational, scripted)? Does it have similar noise levels? All of these can affect the acoustics of phones and skew how “likely” a given phone at a given point in time is.

Speech log-likelihood#

The overall alignment log-likelihood represents the best path including all sections of silence. In general when we’re thinking about how good an alignment is, we don’t necessarily care how good of a match the silence intervals in a given utterance are to the trained silence model. So the speech log-likelihood measure takes out all log-likelihoods from silence intervals and is the average of per-phone log-likelihoods in the utterance.

Phone duration deviation#

Stepping back from log-likelihoods generated by the model, we can take a look at statistics of the duration of phones in the aligned corpus. By calculating the mean and standard deviation of durations per phone, we can z-score the individual phone’s duration to see how unexpected it is relative to the corpus overall. The phone duration deviation measure is an average of the absolute z-score of each phones duration.

We use the absolute value of the z-score because often excessive durations due to misalignment will also result in excessively small durations on other phones. The average of raw z-scores in these cases will trend towards zero, when really we want these deviations to aggregate to utterances that clearly had something go wrong.

It is important to note that there stylistic and speaker influences on duration, and statistics are gathered for the whole corpus, not normalized per speaker, so false positives are likely to pop up when sorting by this metric. Normalizing per-speaker, however, might minimize the magnitude of duration deviation if a given speaker’s utterances are all poorly aligned. This would increase the likelihood of false negatives, and false positives are more acceptable than false negatives.

Ideas for the future that need a lot more thinking before I implement them#

  1. Use the alignment best path from the speaker adapted pass with a lattice and scores generated using the speaker-independent first-pass alignment model

    • This might help get around the variable optimizations that are speaker dependent