CorpusMixin#
- class montreal_forced_aligner.corpus.base.CorpusMixin(corpus_directory, speaker_characters=0, ignore_speakers=False, oov_count_threshold=0, language=Language.unknown, **kwargs)[source]#
Bases:
MfaWorker
,DatabaseMixin
Mixin class for processing corpora
Notes
Using characters in files to specify speakers is generally finicky and leads to errors, so I would not recommend using it. Additionally, consider it deprecated and could be removed in future versions
- Parameters:
corpus_directory (str) – Path to corpus
speaker_characters (int or str, optional) – Number of characters in the file name to specify the speaker
ignore_speakers (bool) – Flag for whether to discard any parsed speaker information during top-level worker’s processing
oov_count_threshold (int) – Words in the corpus with counts less than or equal to the threshold will be treated as OOV items, defaults to 0
See also
MfaWorker
For MFA processing parameters
TemporaryDirectoryMixin
For temporary directory parameters
- Variables:
jobs (list[
Job
]) – List of jobs for processing the corpus and splitting speakersstopped (
Event
) – Stop check for loading the corpusdecode_error_files (list[str]) – List of text files that could not be loaded with utf8
textgrid_read_errors (list[str]) – List of TextGrid files that had an error in loading
- add_file(file, session=None)[source]#
Add a file to the corpus
- Parameters:
file (
FileData
) – File to be added
- add_speaker(name, session=None)[source]#
Add a speaker to the corpus
- Parameters:
name (str) – Name of the speaker
session (sqlalchemy.orm.Session) – Database session, if not specified, will use a temporary session
- add_utterance(utterance, session=None)[source]#
Add an utterance to the corpus
- Parameters:
utterance (
UtteranceData
) – Utterance to add
- property base_data_directory#
Corpus data directory
- property corpus_meta#
Corpus metadata
- property corpus_word_set#
Set of words used in the corpus
- create_subset(subset)[source]#
Create a subset of utterances to use for training
- Parameters:
subset (int) – Number of utterances to include in subset
- property data_directory#
Corpus data directory
- property data_source_identifier#
Corpus name
- delete_utterance(utterance_id, session=None)[source]#
Delete an utterance from the corpus
- Parameters:
utterance_id (int) – Utterance to delete
- property features_log_directory#
Feature log directory
- files(session=None)[source]#
Get all files in the corpus
- Parameters:
session (sqlalchemy.orm.Session, optional) – Session to use in querying
- Returns:
File query
- Return type:
- generate_import_objects(file)[source]#
Add a file to the corpus
- Parameters:
file (
FileData
) – File to be added
- get_latest_workflow_run(workflow, session)[source]#
Get the latest version of a workflow type
- Parameters:
workflow (
WorkflowType
) – Workflow typesession (
sqlalchemy.orm.Session
) – Database session
- Returns:
Latest run of workflow type
- Return type:
CorpusWorkflow
or None
- get_utterances(id=None, file=None, speaker=None, begin=None, end=None, session=None)[source]#
Get a file from search parameters
- normalize_text()[source]#
Normalize the text of the corpus using a dictionary’s sanitization functions and word mappings
- property num_files#
Number of files in the corpus
- property num_speakers#
Number of speakers in the corpus
- property num_utterances#
Number of utterances in the corpus
- speakers(session=None)[source]#
Get all speakers in the corpus
- Parameters:
session (sqlalchemy.orm.Session, optional) – Session to use in querying
- Returns:
Speaker query
- Return type:
- property split_directory#
Directory used to store information split by job
- utterances(session=None)[source]#
Get all utterances in the corpus
- Parameters:
session (sqlalchemy.orm.Session, optional) – Session to use in querying
- Returns:
Utterance query
- Return type: