Example: Adapting a model to a new language#

Set up#

Important

Ensure you have installed MFA via Installation. For comparing alignments to reference alignments from aligning via native language models, ensure you have completed the initital alignment for demo corpus in Example: Aligning a demo corpus.

You can see a more fully worked example of this with scripts for analyzing German, Czech, and Mandarin applied to an English corpus in the mfa-adaptation GitHub repository.

English

For English, we will align the demo English corpus with the Mandarin pretrained acoustic model, and remap the English dictionary into the phone set that the Mandarin acoustic model uses.

Ensure you have downloaded the pretrained Mandarin model via mfa model download acoustic mandarin_mfa
Ensure you have downloaded the pretrained US English dictionary via mfa model download dictionary english_us_mfa
Download the English LibriSpeech demo corpus and extract it to somewhere on your computer

Japanese

For Japanese, we will align the demo Japanese corpus with the English pretrained acoustic model, and remap the Japanese dictionary into the phone set that the English acoustic model uses.

Ensure you have downloaded the pretrained English model via mfa model download acoustic english_mfa
Ensure you have downloaded the pretrained Japanese dictionary via mfa model download dictionary japanese_mfa
Download the Japanese JVS demo corpus and extract it to somewhere on your computer
Install Japanese-specific dependencies via conda install -c conda-forge spacy sudachipy sudachidict-core

Mandarin

For Mandarin, we will align the demo Mandarin corpus with the English pretrained acoustic model, and remap the Mandarin dictionary into the phone set that the English acoustic model uses.

Ensure you have downloaded the pretrained model via mfa model download acoustic english_mfa
Ensure you have downloaded the pretrained China Mandarin dictionary via mfa model download dictionary mandarin_china_mfa
Download the Mandarin THCHS-30 demo corpus and extract it to somewhere on your computer
Install Mandarin-specific dependencies via pip install spacy-pkuseg dragonmapper hanziconv

Important

This example assumes you have a directory named mfa_data in your home directory in which the demo corpus was extracted.

Remapping the dictionary#

English

First, download and save the contents of english_to_mandarin_phone_mapping.yaml to ~/mfa_data/english_to_mandarin_phone_mapping.yaml. This is a file that maps phones in the English MFA phone set to phones in the Japanese MFA phone set, which we can use to create a new dictionary of English words with Mandarin MFA pronunciations.

mfa remap dictionary english_us_mfa mandarin_mfa ~/mfa_data/english_to_mandarin_phone_mapping.yaml ~/mfa_data/english_mandarin.dict

If you open up ~/mfa_data/english_mandarin.dict in a text editor, you’ll now see pronunciations for English forms using Mandarin MFA phones. For example, any ʒ phones now have ʐ instead, as that’s the closest phone in the Mandarin MFA phone set.

Japanese

First, download and save the contents of japanese_to_english_phone_mapping.yaml to ~/mfa_data/japanese_to_english_phone_mapping.yaml. This is a file that maps phones in the Japanese MFA phone set to phones in the English MFA phone set, which we can use to create a new dictionary of Japanese words with English MFA pronunciations.

mfa remap dictionary japanese_mfa english_mfa ~/mfa_data/japanese_to_english_phone_mapping.yaml ~/mfa_data/japanese_english.dict

If you open up ~/mfa_data/japanese_english.dict in a text editor, you’ll now see pronunciations for Japanese forms using English MFA phones. For example, any tɕ phones now have tʃ instead, as that’s the closest phone in the English MFA phone set.

Mandarin

First, download and save the contents of mandarin_to_english_phone_mapping.yaml to ~/mfa_data/mandarin_to_english_phone_mapping.yaml. This is a file that maps phones in the Mandarin MFA phone set to phones in the English MFA phone set, which we can use to create a new dictionary of Mandarin words with English MFA pronunciations.

mfa remap dictionary mandarin_china_mfa english_mfa ~/mfa_data/mandarin_to_english_phone_mapping.yaml ~/mfa_data/mandarin_english.dict

If you open up ~/mfa_data/mandarin_english.dict in a text editor, you’ll now see pronunciations for Mandarin forms using English MFA phones. For example, any tɕ phones now have tʃ instead, as that’s the closest phone in the English MFA phone set.

Alignment#

Aligning using pre-trained models#

English

mfa align ~/mfa_data/librispeech-demo-1.0.0 ~/mfa_data/english_mandarin.dict english_mfa ~/mfa_data/aligned_librispeech_demo --clean

Japanese

First, download and save the contents of english_to_japanese_phone_mapping.yaml to ~/mfa_data/english_to_japanese_phone_mapping.yaml. This file is similar to the previously downloaded ~/mfa_data/japanese_to_english_phone_mapping.yaml except it maps phones in the opposite direction. This mapping says for every Japanese phone, what is an acceptable phone that counts as a “matching phone”, allowing the overlap scoring algorithm to more correctly penalize issues in alignment.

If you have not aligned the Japanese demo corpus as the first step in Example: Aligning a demo corpus, you will have to omit the --reference_directory and --custom_mapping_path of the following command.

mfa align ~/mfa_data/japanese-jvs-demo-1.0.0 ~/mfa_data/japanese_english.dict english_mfa ~/mfa_data/english_adapted/english_japanese_remapped_aligned --clean --reference_directory ~/mfa_data/aligned_jvs_demo --custom_mapping_path ~/mfa_data/english_to_japanese_phone_mapping.yaml --language japanese

Note

The --language japanese flag must be included to ensure that the Japanese text is properly tokenized by the Japanese morphological parser. When aligning using the Japanese MFA model, the language is set to Japanese by default, but we must override it here when using the English MFA model.

The end output will give:

INFO     Evaluating alignments...
INFO     Exporting evaluation...
INFO     Average overlap score: 0.010834011956534382
INFO     Average phone error rate: 0.02820097244732577

Which reports a mean phone boundary error of 10.8 ms (Average overlap score), and an average PER of 2.8% (percent of insertions, deletions and substitutions).

Mandarin

First, download and save the contents of english_to_mandarin_phone_mapping.yaml to ~/mfa_data/english_to_mandarin_phone_mapping.yaml. This file is similar to the previously downloaded ~/mfa_data/mandarin_to_english_phone_mapping.yaml except it maps phones in the opposite direction. This mapping says for every Mandarin phone, what is an acceptable phone that counts as a “matching phone”, allowing the overlap scoring algorithm to more correctly penalize issues in alignment.

If you have not aligned the Mandarin demo corpus as the first step in Example: Aligning a demo corpus, you will have to omit the --reference_directory and --custom_mapping_path of the following command.

mfa align ~/mfa_data/mandarin-thchs-30-demo-1.0.0 ~/mfa_data/mandarin_english.dict english_mfa ~/mfa_data/english_adapted/english_mandarin_remapped_aligned --clean --reference_directory ~/mfa_data/aligned_thchs_30_demo --custom_mapping_path ~/mfa_data/english_to_mandarin_phone_mapping.yaml --language chinese

Once the files are aligned we can take a look at the alignment_analysis.csv file in the output directory to see if there are any glaring issues in alignment. This file is sorted initially by the phone_duration_deviation column, which is the maximum z-scored duration for phones in the utterance. High values indication much longer or shorter phones than we would expect given the phone, i.e., a [ɾ] lasting 100ms is very unlikely given the usual duration is typically around 10-20ms.

Additionally, there is are two files from the alignment evaluation triggered by having --reference_directory specified. As we’re comparing alignments to reference alignments, we can look at a confusion matrix in alignment_reference_confusions.csv and find utterances with high errors by looking at alignment_reference_evaluation.csv and sorting on the alignment_score column.

Adapting the acoustic model#

In general, adapting a pretrained acoustic model to your specific data will improve alignments, but this is particularly so when using pretrained model that was trained on a different language than what you’re aligning.

We can adapt our pretrained model via the mfa adapt command:

English

Warning

Under construction

Japanese

mfa adapt ~/mfa_data/japanese-jvs-demo-1.0.0 ~/mfa_data/japanese_english.dict english_mfa ~/mfa_data/english_adapted/english_adapted.zip --clean --language japanese

We can now use the adapted model to align the japanese-jvs-demo corpus. Note the change from english_mfa to ~/mfa_data/english_adapted/english_adapted.zip below.

mfa align ~/mfa_data/japanese-jvs-demo-1.0.0 ~/mfa_data/japanese_english.dict ~/mfa_data/english_adapted/english_adapted.zip ~/mfa_data/english_adapted/english_remapped_aligned_adapted --clean --reference_directory ~/mfa_data/aligned_jvs_demo --custom_mapping_path ~/mfa_data/english_to_japanese_phone_mapping.yaml --language japanese

The end output will give:

INFO     Evaluating alignments...
INFO     Exporting evaluation...
INFO     Average overlap score: 0.010524732208295882
INFO     Average phone error rate: 0.026904376012965966

Which reports a mean phone boundary error of 10.5 ms, improving on the previous 10.8 ms error aligning by default, and an average PER of 2.7%, improving from 2.8%. So adaptation gives some modest gains for making the alignments generated from English MFA more similar to those generated by the Japanese MFA model. The benefit for adaptation is going to be a function of the size of the dataset, and the demo corpus here is pretty small, so only a little bit of improvement is to be expected and observed.

Mandarin

Warning

Under construction