(news=)

News#

Roadmap#

Warning

Please bear in mind that all plans below are tentative and subject to change.

Version 2.1#

  • Generalize phone group specification for user specification

    • Phone groups are allophonic variation, common neutralization etc.

      • English [ɾ] is grouped with [t] and [d], but Spanish [ɾ] is grouped with [r], [ð] is grouped with [d]

  • Update to use PyKaldi for interfacing with Kaldi components rather than relying on piped CLI commands to Kaldi binaries

    • This change should also allow for more nnet3 functionality to be available (i.e., for segmentation and speaker diarization below). The nnet3 scripts rely on python code in the Kaldi egs/wsj/s5/steps folder that is not currently exported as part of the Kaldi feedstock on conda forge.

  • Update segmentation functionality

    • At the moment, only a simple speech activity detection (SAD) algorithm is implemented that uses amplitude of the signal and thresholds for speech vs non-speech

    • For 2.1, I plan to implement new SAD training capability as well as release a pretrained SAD model trained across all the current training data for every language with a pretrained acoustic model

  • Update speaker diarization functionality

    • Support x-vector models as well as ivector models

    • Properly implement and train PLDA models for diarization

  • Update dictionary model format to move away from the current plain-text lexicons to a more robust compressed format

    • With extra meta data and capabilities in the form of phonological rules and phone groupings, it makes more sense to package those with the lexicon rather than the acoustic model

    • Another option would be to package up the lexicon (and maybe G2P models) with the acoustic model into a complete MFA model

    • As part of any update, I would expand the MFA model CLI with functionality for adding new pronunciations to internal lexicons

      • Something like mfa model update /path/to/g2pped_file.txt

Not tied to 2.1, but in the near-ish term I would like to:

  • Retrain existing acoustic models with new phone groups and rules features

  • Begin work on expanding to new languages

    • Japanese (in progress)

    • Arabic

    • Tamil

  • Localize documentation

    • I’ll initially do a pass at localizing the documentation to Japanese and see if I can crowd source other languages (and fixing my initial Japanese pass)

  • Finally release Anchor compatible with the latest versions of MFA

  • Update pitch feature calculation to use speaker-adjusted min and max f0 ranges

Future#

  • Moving away from Kaldi-based dependencies

    • Kaldi is not being actively developed and I don’t have much of a desire to depend on it long term

    • Most actively developed ASR toolkits and libraries are based around neural networks

      • I’m not the biggest fan of using these for alignment, as most of the research is geared towards improving end-to-end signal to orthographic text models that don’t have intermediate representations of phones

      • That said, if alignment were the task that was being optimized for rather than some “word error rate” style metric, then alignment performance could improve significantly

        • One particular direction would be towards sample-based or waveform-based alignment rather than frame-based

          • Frame-based methods are time-smeared, so providing an exact time for voicing onset or stop closure is murky

          • Phoneticians use spectrograms for gross boundaries, but more accurate manual alignments are determined based on the waveform

        • Perhaps combining a model that performs language-independent boundary insertion combined with per-language models to combine resulting segments might perform better ([a] + [j] becomes [aj] in English, but not in other languages like Japanese, Spanish, or Portuguese, etc)

      • Additionally, neural networks might allow for better modeling of phone symbols, so embedding [pʲ] could result in a more compositional “voiceless bilabial stop plus palatalization”

    • Other options for toolkits to support MFA are

      • SpeechBrain

      • Custom PyTorch code

      • Custom tensorflow code