61.2 GB
156 files
Updated 4 days ago
Name
Size
data
.gitattributes2.46 kB
xet
README.md4.24 kB
xet
README.md

Yoruba Speech-Text Parallel Dataset

Dataset Description

This dataset contains 1647022 parallel speech-text pairs for Yoruba, a language spoken primarily in Nigeria and other West African countries. The dataset consists of audio recordings paired with their corresponding text transcriptions, making it suitable for automatic speech recognition (ASR) and text-to-speech (TTS) tasks.

Dataset Summary

  • Language: Yoruba - yo
  • Task: Speech Recognition, Text-to-Speech
  • Size: 1647022 audio files > 1KB (small/corrupted files filtered out)
  • Format: WAV audio files with corresponding text files
  • Modalities: Audio + Text

Supported Tasks

  • Automatic Speech Recognition (ASR): Train models to convert Yoruba speech to text
  • Text-to-Speech (TTS): Use parallel data for TTS model development
  • Keyword Spotting: Identify specific Yoruba words in audio
  • Phonetic Analysis: Study Yoruba pronunciation patterns

Dataset Structure

Data Fields

  • audio: Audio file in WAV format
  • text: Corresponding text transcription from paired text file

Data Splits

The dataset contains a single training split with 1647022 filtered audio-text pairs.

Dataset Creation

Source Data

The audio data has been sourced ethically from consenting contributors. To protect the privacy of the original authors and speakers, specific source information cannot be shared publicly.

Data Processing

  1. Audio files and corresponding text files were collected from organized folder structure
  2. Text content was read from separate .txt files with matching filenames
  3. Files smaller than 1KB were filtered out to ensure audio quality
  4. Empty text files were excluded from the dataset
  5. Audio was processed using the MMS-300M-1130 Forced Aligner tool for alignment and quality assurance

Annotations

Text annotations are stored in separate text files with matching filenames to the audio files, representing the spoken content in each audio recording.

Considerations for Using the Data

Social Impact of Dataset

This dataset contributes to the preservation and digital representation of Yoruba, supporting:

  • Language technology development for underrepresented languages
  • Educational resources for Yoruba language learning
  • Cultural preservation through digital archives

Discussion of Biases

  • The dataset may reflect the pronunciation patterns and dialects of specific regions or speakers
  • Audio quality and recording conditions may vary across samples
  • The vocabulary is limited to the words present in the collected samples

Other Known Limitations

  • Limited vocabulary scope (word-level rather than sentence-level)
  • Potential audio quality variations
  • Regional dialect representation may be uneven

Additional Information

Licensing Information

This dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Acknowledgments

Citation Information

If you use this dataset in your research, please cite:

@dataset{yoruba_words_parallel_2025,
  title={Yoruba Words Speech-Text Parallel Dataset},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/datasets/[your-username]/yoruba-speech-text-parallel}}
}

Contact

For questions or concerns about this dataset, please open an issue in the dataset repository.

Total size
61.2 GB
Files
156
Last updated
Jun 28
Pre-warmed CDN
US EU US EU

Contributors