CorpusMixin#

class montreal_forced_aligner.corpus.base.CorpusMixin(corpus_directory, speaker_characters=0, ignore_speakers=False, oov_count_threshold=0, language=Language.unknown, **kwargs)[source]#

Bases: MfaWorker, DatabaseMixin

Mixin class for processing corpora

Notes

Using characters in files to specify speakers is generally finicky and leads to errors, so I would not recommend using it. Additionally, consider it deprecated and could be removed in future versions

Parameters:
  • corpus_directory (str) – Path to corpus

  • speaker_characters (int or str, optional) – Number of characters in the file name to specify the speaker

  • ignore_speakers (bool) – Flag for whether to discard any parsed speaker information during top-level worker’s processing

  • oov_count_threshold (int) – Words in the corpus with counts less than or equal to the threshold will be treated as OOV items, defaults to 0

See also

MfaWorker

For MFA processing parameters

TemporaryDirectoryMixin

For temporary directory parameters

Variables:
  • jobs (list[Job]) – List of jobs for processing the corpus and splitting speakers

  • stopped (Event) – Stop check for loading the corpus

  • decode_error_files (list[str]) – List of text files that could not be loaded with utf8

  • textgrid_read_errors (list[str]) – List of TextGrid files that had an error in loading

add_file(file, session=None)[source]#

Add a file to the corpus

Parameters:

file (FileData) – File to be added

add_speaker(name, session=None)[source]#

Add a speaker to the corpus

Parameters:
  • name (str) – Name of the speaker

  • session (sqlalchemy.orm.Session) – Database session, if not specified, will use a temporary session

add_utterance(utterance, session=None)[source]#

Add an utterance to the corpus

Parameters:

utterance (UtteranceData) – Utterance to add

property base_data_directory#

Corpus data directory

property corpus_meta#

Corpus metadata

property corpus_word_set#

Set of words used in the corpus

create_corpus_split()[source]#

Create split directory and output information from Jobs

create_subset(subset)[source]#

Create a subset of utterances to use for training

Parameters:

subset (int) – Number of utterances to include in subset

property data_directory#

Corpus data directory

property data_source_identifier#

Corpus name

delete_utterance(utterance_id, session=None)[source]#

Delete an utterance from the corpus

Parameters:

utterance_id (int) – Utterance to delete

property features_log_directory#

Feature log directory

files(session=None)[source]#

Get all files in the corpus

Parameters:

session (sqlalchemy.orm.Session, optional) – Session to use in querying

Returns:

File query

Return type:

sqlalchemy.orm.Query

generate_import_objects(file)[source]#

Add a file to the corpus

Parameters:

file (FileData) – File to be added

get_file(id=None, name=None, session=None)[source]#

Get a file from search parameters

Parameters:
  • id (int) – Integer ID to look up

  • name (str) – File name to look up

Returns:

File match

Return type:

File

get_latest_workflow_run(workflow, session)[source]#

Get the latest version of a workflow type

Parameters:
Returns:

Latest run of workflow type

Return type:

CorpusWorkflow or None

get_utterances(id=None, file=None, speaker=None, begin=None, end=None, session=None)[source]#

Get a file from search parameters

Parameters:
  • id (int) – Integer ID to look up

  • file (str or int) – File name or ID to look up

  • speaker (str or int) – Speaker name or ID to look up

  • begin (float) – Begin timestamp to look up

  • end (float) – Ending timestamp to look up

Returns:

Utterance match

Return type:

Utterance

initialize_jobs()[source]#

Initialize the corpus’s Jobs

inspect_database()[source]#

Check if a database file exists and create the necessary metadata

normalize_text()[source]#

Normalize the text of the corpus using a dictionary’s sanitization functions and word mappings

property num_files#

Number of files in the corpus

property num_speakers#

Number of speakers in the corpus

property num_utterances#

Number of utterances in the corpus

speakers(session=None)[source]#

Get all speakers in the corpus

Parameters:

session (sqlalchemy.orm.Session, optional) – Session to use in querying

Returns:

Speaker query

Return type:

sqlalchemy.orm.Query

property split_directory#

Directory used to store information split by job

subset_directory(subset)[source]#

Construct a subset directory for the corpus

Parameters:

subset (int, optional) – Number of utterances to include, if larger than the total number of utterance or not specified, the split_directory is returned

Returns:

Path to subset directory

Return type:

str

utterances(session=None)[source]#

Get all utterances in the corpus

Parameters:

session (sqlalchemy.orm.Session, optional) – Session to use in querying

Returns:

Utterance query

Return type:

sqlalchemy.orm.Query