CorpusMixin#

class montreal_forced_aligner.corpus.base.CorpusMixin(corpus_directory, speaker_characters=0, ignore_speakers=False, oov_count_threshold=0, language=Language.unknown, **kwargs)[source]#

Bases: MfaWorker, DatabaseMixin

Mixin class for processing corpora

Notes

Using characters in files to specify speakers is generally finicky and leads to errors, so I would not recommend using it. Additionally, consider it deprecated and could be removed in future versions

Parameters:

corpus_directory (str) – Path to corpus
speaker_characters (int or str, optional) – Number of characters in the file name to specify the speaker
ignore_speakers (bool) – Flag for whether to discard any parsed speaker information during top-level worker’s processing
oov_count_threshold (int) – Words in the corpus with counts less than or equal to the threshold will be treated as OOV items, defaults to 0

See also

MfaWorker: For MFA processing parameters
TemporaryDirectoryMixin: For temporary directory parameters

Variables:

jobs (list[Job]) – List of jobs for processing the corpus and splitting speakers
stopped (Event) – Stop check for loading the corpus
decode_error_files (list[str]) – List of text files that could not be loaded with utf8
textgrid_read_errors (list[str]) – List of TextGrid files that had an error in loading

add_file(file, session=None)[source]#

Add a file to the corpus

Parameters:: file (FileData) – File to be added

add_speaker(name, session=None)[source]#

Add a speaker to the corpus

Parameters:

name (str) – Name of the speaker
session (sqlalchemy.orm.Session) – Database session, if not specified, will use a temporary session

add_utterance(utterance, session=None)[source]#

Add an utterance to the corpus

Parameters:: utterance (UtteranceData) – Utterance to add

property base_data_directory#: Corpus data directory

property corpus_meta#: Corpus metadata

property corpus_word_set#: Set of words used in the corpus

create_corpus_split()[source]#: Create split directory and output information from Jobs

create_subset(subset)[source]#

Create a subset of utterances to use for training

Parameters:: subset (int) – Number of utterances to include in subset

property data_directory#: Corpus data directory

property data_source_identifier#: Corpus name

delete_utterance(utterance_id, session=None)[source]#

Delete an utterance from the corpus

Parameters:: utterance_id (int) – Utterance to delete

property features_log_directory#: Feature log directory

files(session=None)[source]#

Get all files in the corpus

Parameters:: session (sqlalchemy.orm.Session, optional) – Session to use in querying
Returns:: File query
Return type:: sqlalchemy.orm.Query

generate_import_objects(file)[source]#

Add a file to the corpus

Parameters:: file (FileData) – File to be added

get_file(id=None, name=None, session=None)[source]#

Get a file from search parameters

Parameters:

id (int) – Integer ID to look up
name (str) – File name to look up

Returns:

File match

Return type:

File

get_latest_workflow_run(workflow, session)[source]#

Get the latest version of a workflow type

Parameters:

workflow (WorkflowType) – Workflow type
session (sqlalchemy.orm.Session) – Database session

Returns:

Latest run of workflow type

Return type:

CorpusWorkflow or None

get_utterances(id=None, file=None, speaker=None, begin=None, end=None, session=None)[source]#

Get a file from search parameters

Parameters:

id (int) – Integer ID to look up
file (str or int) – File name or ID to look up
speaker (str or int) – Speaker name or ID to look up
begin (float) – Begin timestamp to look up
end (float) – Ending timestamp to look up

Returns:

Utterance match

Return type:

Utterance

initialize_jobs()[source]#: Initialize the corpus’s Jobs

inspect_database()[source]#: Check if a database file exists and create the necessary metadata

normalize_text()[source]#: Normalize the text of the corpus using a dictionary’s sanitization functions and word mappings

property num_files#: Number of files in the corpus

property num_speakers#: Number of speakers in the corpus

property num_utterances#: Number of utterances in the corpus

speakers(session=None)[source]#

Get all speakers in the corpus

Parameters:: session (sqlalchemy.orm.Session, optional) – Session to use in querying
Returns:: Speaker query
Return type:: sqlalchemy.orm.Query

property split_directory#: Directory used to store information split by job

subset_directory(subset)[source]#

Construct a subset directory for the corpus

Parameters:: subset (int, optional) – Number of utterances to include, if larger than the total number of utterance or not specified, the split_directory is returned
Returns:: Path to subset directory
Return type:: str

utterances(session=None)[source]#

Get all utterances in the corpus

Parameters:: session (sqlalchemy.orm.Session, optional) – Session to use in querying
Returns:: Utterance query
Return type:: sqlalchemy.orm.Query