Buckets:

mobiletechline
/

audio-languages

61.2 GB

156 files

Updated 4 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
data		4 days ago	50 items
.gitattributes	2.46 kB xet	4 days ago	19463de8
README.md	4.24 kB xet	4 days ago	6b68df58

README.md

Yoruba Speech-Text Parallel Dataset

Dataset Description

This dataset contains 1647022 parallel speech-text pairs for Yoruba, a language spoken primarily in Nigeria and other West African countries. The dataset consists of audio recordings paired with their corresponding text transcriptions, making it suitable for automatic speech recognition (ASR) and text-to-speech (TTS) tasks.

Dataset Summary

Language: Yoruba - yo
Task: Speech Recognition, Text-to-Speech
Size: 1647022 audio files > 1KB (small/corrupted files filtered out)
Format: WAV audio files with corresponding text files
Modalities: Audio + Text

Supported Tasks

Automatic Speech Recognition (ASR): Train models to convert Yoruba speech to text
Text-to-Speech (TTS): Use parallel data for TTS model development
Keyword Spotting: Identify specific Yoruba words in audio
Phonetic Analysis: Study Yoruba pronunciation patterns

Dataset Structure

Data Fields

audio: Audio file in WAV format
text: Corresponding text transcription from paired text file

Data Splits

The dataset contains a single training split with 1647022 filtered audio-text pairs.

Dataset Creation

Source Data

The audio data has been sourced ethically from consenting contributors. To protect the privacy of the original authors and speakers, specific source information cannot be shared publicly.

Data Processing

Audio files and corresponding text files were collected from organized folder structure
Text content was read from separate .txt files with matching filenames
Files smaller than 1KB were filtered out to ensure audio quality
Empty text files were excluded from the dataset
Audio was processed using the MMS-300M-1130 Forced Aligner tool for alignment and quality assurance

Annotations

Text annotations are stored in separate text files with matching filenames to the audio files, representing the spoken content in each audio recording.

Considerations for Using the Data

Social Impact of Dataset

This dataset contributes to the preservation and digital representation of Yoruba, supporting:

Language technology development for underrepresented languages
Educational resources for Yoruba language learning
Cultural preservation through digital archives

Discussion of Biases

The dataset may reflect the pronunciation patterns and dialects of specific regions or speakers
Audio quality and recording conditions may vary across samples
The vocabulary is limited to the words present in the collected samples

Other Known Limitations

Limited vocabulary scope (word-level rather than sentence-level)
Potential audio quality variations
Regional dialect representation may be uneven

Additional Information

Licensing Information

This dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Acknowledgments

Audio processing and alignment performed using MMS-300M-1130 Forced Aligner
The original audio is produced by Biblica

Citation Information

If you use this dataset in your research, please cite:

@dataset{yoruba_words_parallel_2025,
  title={Yoruba Words Speech-Text Parallel Dataset},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/datasets/[your-username]/yoruba-speech-text-parallel}}
}

Contact

For questions or concerns about this dataset, please open an issue in the dataset repository.

Total size: 61.2 GB

Files: 156

Last updated: Jun 28

Pre-warmed CDN: US EU US EU