(news=)

News#

Roadmap#

Warning

Please bear in mind that all plans below are tentative and subject to change.

Version 2.1#

Generalize phone group specification for user specification
- Phone groups are allophonic variation, common neutralization etc.
  - English [ɾ] is grouped with [t] and [d], but Spanish [ɾ] is grouped with [r], [ð] is grouped with [d]
Update to use PyKaldi for interfacing with Kaldi components rather than relying on piped CLI commands to Kaldi binaries
- This change should also allow for more nnet3 functionality to be available (i.e., for segmentation and speaker diarization below). The nnet3 scripts rely on python code in the Kaldi egs/wsj/s5/steps folder that is not currently exported as part of the Kaldi feedstock on conda forge.
Update segmentation functionality
- At the moment, only a simple speech activity detection (SAD) algorithm is implemented that uses amplitude of the signal and thresholds for speech vs non-speech
- For 2.1, I plan to implement new SAD training capability as well as release a pretrained SAD model trained across all the current training data for every language with a pretrained acoustic model
Update speaker diarization functionality
- Support x-vector models as well as ivector models
- Properly implement and train PLDA models for diarization
Update dictionary model format to move away from the current plain-text lexicons to a more robust compressed format
- With extra meta data and capabilities in the form of phonological rules and phone groupings, it makes more sense to package those with the lexicon rather than the acoustic model
- Another option would be to package up the lexicon (and maybe G2P models) with the acoustic model into a complete MFA model
- As part of any update, I would expand the MFA model CLI with functionality for adding new pronunciations to internal lexicons
  - Something like mfa model update /path/to/g2pped_file.txt

Not tied to 2.1, but in the near-ish term I would like to:

Retrain existing acoustic models with new phone groups and rules features
Begin work on expanding to new languages
- Japanese (in progress)
- Arabic
- Tamil
Localize documentation
- I’ll initially do a pass at localizing the documentation to Japanese and see if I can crowd source other languages (and fixing my initial Japanese pass)
Finally release Anchor compatible with the latest versions of MFA
Update pitch feature calculation to use speaker-adjusted min and max f0 ranges

Future#

Moving away from Kaldi-based dependencies
- Kaldi is not being actively developed and I don’t have much of a desire to depend on it long term
- Most actively developed ASR toolkits and libraries are based around neural networks
  - I’m not the biggest fan of using these for alignment, as most of the research is geared towards improving end-to-end signal to orthographic text models that don’t have intermediate representations of phones
  - That said, if alignment were the task that was being optimized for rather than some “word error rate” style metric, then alignment performance could improve significantly
    - One particular direction would be towards sample-based or waveform-based alignment rather than frame-based
      - Frame-based methods are time-smeared, so providing an exact time for voicing onset or stop closure is murky
      - Phoneticians use spectrograms for gross boundaries, but more accurate manual alignments are determined based on the waveform
    - Perhaps combining a model that performs language-independent boundary insertion combined with per-language models to combine resulting segments might perform better ([a] + [j] becomes [aj] in English, but not in other languages like Japanese, Spanish, or Portuguese, etc)
  - Additionally, neural networks might allow for better modeling of phone symbols, so embedding [pʲ] could result in a more compositional “voiceless bilabial stop plus palatalization”
- Other options for toolkits to support MFA are
  - SpeechBrain
  - Custom PyTorch code
  - Custom tensorflow code