News#
Roadmap#
Warning
Please bear in mind that all plans below are tentative and subject to change.
Version 3.1#
Persistent server for sending audio/text files to
May not be necessary/prioritized with Align a single file (mfa align_one) command
Update tokenization to use spacy tokenizers instead of custom specification
Should be more robust than MFA’s custom rules
Some languages are more finely tokenized than others (i.e., Japanese and Korean tokens are largely morphemes, while the English one doesn’t do morpheme analysis), but the ideal would be some morphologically-aware G2P of phonological words
Add option for training SpeechBrain ASR model on phone strings of MFA models
Should allow for better single-pass alignment and faster with GPUs
Release Anchor compatible with the latest versions of MFA
Future#
Retrain existing acoustic models with new phone groups and rules features
Begin work on expanding to new languages
Japanese (in progress)
Arabic
Tamil
Localize documentation
I’ll initially do a pass at localizing the documentation to Japanese and see if I can crowd source other languages (and fixing my initial Japanese pass)
Update pitch feature calculation to use speaker-adjusted min and max f0 ranges
Moving away from Kaldi-based dependencies
Kaldi is not being actively developed and I don’t have much of a desire to depend on it long term
Most actively developed ASR toolkits and libraries are based around neural networks
I’m not the biggest fan of using these for alignment, as most of the research is geared towards improving end-to-end signal to orthographic text models that don’t have intermediate representations of phones
That said, if alignment were the task that was being optimized for rather than some “word error rate” style metric, then alignment performance could improve significantly
One particular direction would be towards sample-based or waveform-based alignment rather than frame-based
Frame-based methods are time-smeared, so providing an exact time for voicing onset or stop closure is murky
Phoneticians use spectrograms for gross boundaries, but more accurate manual alignments are determined based on the waveform
Perhaps combining a model that performs language-independent boundary insertion combined with per-language models to combine resulting segments might perform better ([a] + [j] becomes [aj] in English, but not in other languages like Japanese, Spanish, or Portuguese, etc)
Additionally, neural networks might allow for better modeling of phone symbols, so embedding [pʲ] could result in a more compositional “voiceless bilabial stop plus palatalization”
Other options for toolkits to support MFA are
Custom PyTorch code
Custom tensorflow code
Update dictionary model format to move away from the current plain-text lexicons to a more robust compressed format
With extra meta data and capabilities in the form of phonological rules and phone groupings, it makes more sense to package those with the lexicon rather than the acoustic model
Another option would be to package up the lexicon (and maybe G2P models) with the acoustic model into a complete MFA model
As part of any update, I would expand the MFA model CLI with functionality for adding new pronunciations to internal lexicons
Something like
mfa model update /path/to/g2pped_file.txt