Troubleshooting performance#

There are a number of optimizations that you can do to your corpus to speed up MFA or make it more accurate.

Speed optimizations#

Convert to basic wav files#

In general, 16kHz, 16-bit wav files are lingua franca of audio, though they are uncompressed and can take up a lot of space. However, if space is less an issue than processing time, converting all your files to wav format before running MFA will result in faster processing times (both load and feature generation).

Script example#

Warning

This script modifies audio files in place and deletes the original file. Please back up your data before running it if you only have one copy.

import os
import subprocess
import sys

corpus_directory = '/path/to/corpus'

file_extensions = ['.flac', '.mp3', '.wav', '.aiff']

def wavify_sound_files():
    for speaker in os.listdir(corpus_directory):
        speaker_dir = os.path.join(corpus_directory, speaker)
        if not os.path.isdir(speaker_dir):
            continue
        for file in os.listdir(speaker_dir):
            for ext in file_extensions:
               if file.endswith(ext):
                   path = os.path.join(speaker_dir, file)
                   if ext == '.wav':
                      resampled_file = path.replace(ext, f'_fixed{ext}')
                   else:
                      resampled_file = path.replace(ext, f'.wav')
                   if sys.platform == 'win32' or ext in {'.opus', '.ogg'}:
                       command = ['ffmpeg', '-nostdin', '-hide_banner', '-loglevel', 'error', '-nostats', '-i', path '-acodec' 'pcm_s16le' '-f' 'wav', '-ar', '16000', resampled_file]
                   else:
                       command = ['sox', path, '-t', 'wav' '-r', '16000', '-b', '16', resampled_file]
                   subprocess.check_call(command)
                   os.remove(path)
                   os.rename(resampled_file, path)

if __name__ == '__main__':
    wavify_sound_files()

Note

This script assumes that the corpus is already adheres to MFA’s supported Corpus formats and structure (with speaker directories of their files under the corpus root), and that you are in the conda environment for MFA.

Downsample to 16kHz#

Both Kaldi and SpeechBrain operate on 16kHz as the primary sampling rate. If your files have a sampling rate greater than 16kHz, then every time they are processed (either as part of MFCC generation in Kaldi, or in running SpeechBrain’s VAD/Speaker classification models), there will be extra computation as they are downsampled to 16kHz.

Note

As always, I recommend having an immutable copy of the original corpus that is backed up and archived separate from the copy that is being processed.

Script example#

Warning

This script modifies the sample rate in place and deletes the original file. Please back up your data before running it if you only have one copy.

import os
import subprocess

corpus_directory = '/path/to/corpus'

file_extensions = ['.wav', '.flac']

def fix_sample_rate():

    for speaker in os.listdir(corpus_directory):
        speaker_dir = os.path.join(corpus_directory, speaker)
        if not os.path.isdir(speaker_dir):
            continue
        for file in os.listdir(speaker_dir):
            for ext in file_extensions:
               if file.endswith(ext):
                   path = os.path.join(speaker_dir, file)
                   resampled_file = path.replace(ext, f'_resampled{ext}')
                   subprocess.check_call(['sox', path, '-r', '16000', resampled_file])
                   os.remove(path)
                   os.rename(resampled_file, path)

if __name__ == '__main__':
    fix_sample_rate()

Note

This script assumes that the corpus is already adheres to MFA’s supported Corpus formats and structure (with speaker directories of their files under the corpus root), and that you are in the conda environment for MFA.

Change bit depth of wav files to 16bit#

Kaldi does not support .wav files that are not 16 bit, so any files that are 24 or 32 bit will be processed by sox. Changing the bit depth of processed wav files ahead of time will save this computation when MFA processes the corpus.

Script example#

Warning

This script modifies the bit depth in place and deletes the original file. Please back up your data before running it if you only have one copy.

import os
import subprocess

corpus_directory = '/path/to/corpus'


def fix_bit_depth():

    for speaker in os.listdir(corpus_directory):
        speaker_dir = os.path.join(corpus_directory, speaker)
        if not os.path.isdir(speaker_dir):
            continue
        for file in os.listdir(speaker_dir):
            if file.endswith('.wav'):
                path = os.path.join(speaker_dir, file)
                resampled_file = path.replace(ext, f'_resampled{ext}')
                subprocess.check_call(['sox', path, '-b', '16', resampled_file])
                os.remove(path)
                os.rename(resampled_file, path)

if __name__ == '__main__':
    fix_bit_depth()

Note

This script assumes that the corpus is already adheres to MFA’s supported Corpus formats and structure, and that you are in the conda environment for MFA.