(news=) # News ## Roadmap ```{warning} Please bear in mind that all plans below are tentative and subject to change. ``` ### Version 2.1 * Generalize phone group specification for user specification * Phone groups are allophonic variation, common neutralization etc. * English {ipa_inline}`[ɾ]` is grouped with {ipa_inline}`[t]` and {ipa_inline}`[d]`, but Spanish {ipa_inline}`[ɾ]` is grouped with {ipa_inline}`[r]`, {ipa_inline}`[ð]` is grouped with {ipa_inline}`[d]` * Update to use PyKaldi for interfacing with Kaldi components rather than relying on piped CLI commands to Kaldi binaries * This change should also allow for more nnet3 functionality to be available (i.e., for segmentation and speaker diarization below). The {code}`nnet3` scripts rely on python code in the Kaldi {code}`egs/wsj/s5/steps` folder that is not currently exported as part of the Kaldi feedstock on conda forge. * Update segmentation functionality * At the moment, only a simple speech activity detection (SAD) algorithm is implemented that uses amplitude of the signal and thresholds for speech vs non-speech * For 2.1, I plan to implement new SAD training capability as well as release a pretrained SAD model trained across all the current training data for every language with a pretrained acoustic model * Update speaker diarization functionality * Support x-vector models as well as ivector models * Properly implement and train PLDA models for diarization * Update dictionary model format to move away from the current plain-text lexicons to a more robust compressed format * With extra meta data and capabilities in the form of phonological rules and phone groupings, it makes more sense to package those with the lexicon rather than the acoustic model * Another option would be to package up the lexicon (and maybe G2P models) with the acoustic model into a complete MFA model * As part of any update, I would expand the {ref}`MFA model CLI ` with functionality for adding new pronunciations to internal lexicons * Something like {code}`mfa model update /path/to/g2pped_file.txt` Not tied to 2.1, but in the near-ish term I would like to: * Retrain existing acoustic models with new phone groups and rules features * Begin work on expanding to new languages * Japanese (in progress) * Arabic * Tamil * Localize documentation * I'll initially do a pass at localizing the documentation to Japanese and see if I can crowd source other languages (and fixing my initial Japanese pass) * Finally release Anchor compatible with the latest versions of MFA * Update pitch feature calculation to use speaker-adjusted min and max f0 ranges ### Future * Moving away from Kaldi-based dependencies * Kaldi is not being actively developed and I don't have much of a desire to depend on it long term * Most actively developed ASR toolkits and libraries are based around neural networks * I'm not the biggest fan of using these for alignment, as most of the research is geared towards improving end-to-end signal to orthographic text models that don't have intermediate representations of phones * That said, if alignment were the task that was being optimized for rather than some "word error rate" style metric, then alignment performance could improve significantly * One particular direction would be towards sample-based or waveform-based alignment rather than frame-based * Frame-based methods are time-smeared, so providing an exact time for voicing onset or stop closure is murky * Phoneticians use spectrograms for gross boundaries, but more accurate manual alignments are determined based on the waveform * Perhaps combining a model that performs language-independent boundary insertion combined with per-language models to combine resulting segments might perform better ({ipa_inline}`[a]` + {ipa_inline}`[j]` becomes {ipa_inline}`[aj]` in English, but not in other languages like Japanese, Spanish, or Portuguese, etc) * Additionally, neural networks might allow for better modeling of phone symbols, so embedding {ipa_inline}`[pʲ]` could result in a more compositional "voiceless bilabial stop plus palatalization" * Other options for toolkits to support MFA are * [SpeechBrain](https://speechbrain.github.io/) * Custom PyTorch code * Custom tensorflow code ```{toctree} :hidden: :maxdepth: 1 changelog_2.2.rst news_2.1.rst changelog_2.1.rst news_2.0.rst changelog_2.0.rst changelog_2.0_pre_release.rst news_1.1.rst changelog_1.0.rst ```