(alignment_example)= # Example: Aligning a demo corpus ```{note} See also our [Google Colab notebook](https://colab.research.google.com/drive/1kqaSSyx-DEVAxrSmoWhJTNXtEsVI15yf?usp=sharing) for running this example without installing or downloading anything locally. There is also [NTT123's Jupyter notebook](https://gist.github.com/NTT123/12264d15afad861cb897f7a20a01762e) for running the alignment example with a custom LibriSpeech dataset, created by [NTT123](https://github.com/NTT123). ``` ## Set up ```{important} Ensure you have installed MFA via {ref}`installation`. ``` ::::{tab-set} :::{tab-item} English :sync: english 1. Ensure you have downloaded the pretrained model via {code}`mfa model download acoustic english_mfa` 2. Ensure you have downloaded the pretrained US English dictionary via {code}`mfa model download dictionary english_us_mfa` 3. Download the [English LibriSpeech demo corpus](https://github.com/MontrealCorpusTools/librispeech-demo/archive/refs/tags/v1.0.0.tar.gz) and extract it to somewhere on your computer ::: :::{tab-item} Japanese :sync: japanese 1. Ensure you have downloaded the pretrained model via {code}`mfa model download acoustic japanese_mfa` 2. Ensure you have downloaded the pretrained Japanese dictionary via {code}`mfa model download dictionary japanese_mfa` 3. Download the [Japanese JVS demo corpus](https://github.com/MontrealCorpusTools/japanaese-jvs-demo/archive/refs/tags/v1.0.0.tar.gz) and extract it to somewhere on your computer 4. Install Japanese-specific dependencies via {code}`conda install -c conda-forge spacy sudachipy sudachidict-core` ::: :::{tab-item} Mandarin :sync: mandarin 1. Ensure you have downloaded the pretrained model via {code}`mfa model download acoustic mandarin_mfa` 2. Ensure you have downloaded the pretrained China Mandarin dictionary via {code}`mfa model download dictionary mandarin_china_mfa` 3. Download the [Mandarin THCHS-30 demo corpus](https://github.com/MontrealCorpusTools/mandarin-thchs-30-demo/archive/refs/tags/v1.0.0.tar.gz) and extract it to somewhere on your computer 4. Install Mandarin-specific dependencies via {code}`pip install spacy-pkuseg dragonmapper hanziconv` ::: :::: ```{important} This example assumes you have a directory named ``mfa_data`` in your home directory in which the demo corpus was extracted. ``` ## Alignment ### Aligning using pre-trained models In the same environment that you've installed MFA, enter the following command into the terminal: :::::{tab-set} ::::{tab-item} English :sync: english :::{code-block} bash mfa align ~/mfa_data/librispeech-demo-1.0.0 english_us_mfa english_mfa ~/mfa_data/aligned_librispeech_demo --clean ::: :::: ::::{tab-item} Japanese :sync: japanese :::{code-block} bash mfa align ~/mfa_data/japanese-jvs-demo-1.0.0 japanese_mfa japanese_mfa ~/mfa_data/aligned_jvs_demo --clean ::: :::: ::::{tab-item} Mandarin :sync: mandarin :::{code-block} bash mfa align ~/mfa_data/mandarin-thchs-30-demo-1.0.0 mandarin_china_mfa mandarin_mfa ~/mfa_data/aligned_thchs_30_demo --clean ::: :::: ::::: ### Adding words to the dictionary First we'll need the pretrained G2P model. These are installed via the {code}`mfa model download` command: :::::{tab-set} ::::{tab-item} English :sync: english :::{code-block} bash mfa model download g2p english_us_mfa ::: You should be able to run {code}`mfa model inspect g2p english_us_mfa` and it will output information about the {code}`english_us_mfa` G2P model. :::: ::::{tab-item} Japanese :sync: japanese :::{code-block} bash mfa model download g2p japanese_mfa ::: You should be able to run {code}`mfa model inspect g2p japanese_mfa` and it will output information about the {code}`japanese_mfa` G2P model. :::: ::::{tab-item} Mandarin :sync: mandarin :::{code-block} bash mfa model download g2p mandarin_china_mfa ::: You should be able to run {code}`mfa model inspect g2p mandarin_china_mfa` and it will output information about the {code}`mandarin_china_mfa` G2P model. :::: ::::: Depending on your use case, you might have a list of words to run G2P over, or just a corpus of sound and transcription files. The {code}`mfa g2p` command can process either: :::::{tab-set} ::::{tab-item} English :sync: english :::{code-block} bash mfa g2p ~/mfa_data/librispeech-demo-1.0.0 english_us_mfa ~/mfa_data/g2pped_oovs.txt --dictionary_path english_us_mfa --clean ::: :::: ::::{tab-item} Japanese :sync: japanese For Japanese, G2P functionality is done as part of alignment by specifying {code}`--g2p_model_path`. :::: ::::{tab-item} Mandarin :sync: mandarin :::{code-block} bash mfa g2p ~/mfa_data/mandarin-thchs-30-demo-1.0.0 mandarin_china_mfa ~/mfa_data/g2pped_oovs.txt --dictionary_path mandarin_china_mfa --clean ::: :::: ::::: Running the above will output a text file in the format that MFA uses ({ref}`dictionary_format`) with all the OOV words (ignoring bracketed words like {ipa_inline}``). I recommend looking over the pronunciations generated and make sure that they look sensible. For languages where the orthography is not transparent, it may be helpful to include {code}`--num_pronunciations 3` so that more pronunciations are generated than just the most likely one. For more details on running G2P, see {ref}`g2p_dictionary_generating`. Once you have looked over the dictionary, you can save the new pronunciations via: :::::{tab-set} ::::{tab-item} English :sync: english :::{code-block} bash mfa model add_words english_us_mfa ~/mfa_data/g2pped_oovs.txt ::: :::: ::::{tab-item} Japanese :sync: japanese For Japanese, G2P functionality is done as part of alignment by specifying {code}`--g2p_model_path`. :::: ::::{tab-item} Mandarin :sync: mandarin :::{code-block} bash mfa model add_words mandarin_china_mfa ~/mfa_data/g2pped_oovs.txt ::: :::: ::::: The new pronunciations will be available when you use the dictionary identifier in an MFA command, i.e. the modified command from {ref}`first_steps_align_pretrained`: :::::{tab-set} ::::{tab-item} English :sync: english :::{code-block} bash mfa align ~/mfa_data/librispeech-demo-1.0.0 english_us_mfa english_mfa ~/mfa_data/aligned_librispeech_demo_no_oovs --clean ::: :::: ::::{tab-item} Japanese :sync: japanese :::{code-block} bash mfa align ~/mfa_data/japanese-jvs-demo-1.0.0 japanese_mfa japanese_mfa ~/mfa_data/aligned_jva_demo_no_oovs --g2p_model_path japanese_mfa --clean ::: :::: ::::{tab-item} Mandarin :sync: mandarin :::{code-block} bash mfa align ~/mfa_data/mandarin-thchs-30-demo-1.0.0 mandarin_china_mfa mandarin_mfa ~/mfa_data/aligned_mandarin_demo_no_oovs --clean ::: :::: ::::: ```{seealso} * {ref}`first_steps_g2p_oovs` ``` ### Adapting the acoustic model In general, adapting a pretrained acoustic model to your specific data will improve alignments. We can adapt our pretrained model via the {code}`mfa adapt` command: :::::{tab-set} ::::{tab-item} English :sync: english :::{code-block} bash mfa adapt ~/mfa_data/librispeech-demo-1.0.0 english_us_mfa english_mfa ~/mfa_data/english_mfa_adapted.zip --clean ::: We can now use the adapted model to align the librispeech-demo corpus. Note the change from ``english_mfa`` to ``~/mfa_data/english_mfa_adapted.zip`` below. :::{code-block} bash mfa align ~/mfa_data/librispeech-demo-1.0.0 english_us_mfa ~/mfa_data/english_mfa_adapted.zip ~/mfa_data/aligned_librispeech_demo_adapted --clean ::: :::: ::::{tab-item} Japanese :sync: japanese :::{code-block} bash mfa adapt ~/mfa_data/japanese-jvs-demo-1.0.0 japanese_mfa japanese_mfa ~/mfa_data/japanese_mfa_adapted.zip --g2p_model_path japanese_mfa --clean ::: We can now use the adapted model to align the japanese-jvs-demo corpus. Note the change from ``japanese_mfa`` to ``~/mfa_data/japanese_mfa_adapted.zip`` below. :::{code-block} bash mfa align ~/mfa_data/japanese-jvs-demo-1.0.0 japanese_mfa ~/mfa_data/japanese_mfa_adapted.zip ~/mfa_data/aligned_jvs_demo_adapted --g2p_model_path japanese_mfa --clean ::: :::: ::::{tab-item} Mandarin :sync: mandarin :::{code-block} bash mfa adapt ~/mfa_data/mandarin-thchs-30-demo-1.0.0 mandarin_china_mfa mandarin_mfa ~/mfa_data/mandarin_mfa_adapted.zip --clean ::: We can now use the adapted model to align the mandarin-thchs-30-demo corpus. Note the change from ``mandarin_mfa`` to ``~/mfa_data/mandarin_mfa_adapted.zip`` below. :::{code-block} bash mfa align ~/mfa_data/mandarin-thchs-30-demo-1.0.0 mandarin_china_mfa ~/mfa_data/mandarin_mfa_adapted.zip ~/mfa_data/aligned_thchs_30_demo_adapted --clean ::: :::: ::::: ```{seealso} * {ref}`first_steps_adapt_pretrained` ```