Ukrainian Code-Mixed ASR Dataset (Thesis)
448 audio samples of Ukrainian technical speech across 16 speakers/agents.
Dataset Description
This dataset contains transcribed Ukrainian speech samples covering software engineering standups, technical discussions, and project management topics. Speakers ("agents") cover different technical domains (backend .NET, React frontend, DevOps, project management, game development, AI compliance) with varied speaking styles and recording setups.
Speakers
- DarkValera
- Valera
- ValeraDevops
- ValeraLongPart1
- ValeraPart10
- ValeraPart2
- ValeraPart3
- ValeraPart4
- ValeraPart5
- ValeraPart6
- ValeraPart7
- ValeraPart8
- ValeraPart9
- ValeraShort1
- ValeraShort2
- ValeraShort3
Format
- Audio: OGG Vorbis (448 files)
- Metadata: JSONL (one record per sample)
- Language: uk
- Provider: all real (non-synthetic) speech
Fields
Each line in metadata.jsonl contains:
file_name: path to the audio file (relative to dataset root)transcription: Ukrainian text transcriptionspeaker: agent/speaker namelanguage: ISO language code (uk)voice: registered voice name (if in voices.json)provider: TTS provider (null = real speech)
Usage
from datasets import load_dataset
dataset = load_dataset("audiofolder", data_dir="path/to/output")
License
This dataset is provided under a custom ASR-only license.
Permitted use
You may use this dataset for:
- Automatic Speech Recognition (ASR) training, evaluation, benchmarking, and research.
- Commercial and non-commercial ASR systems.
- Linguistic analysis related to ASR, transcription, pronunciation, or Ukrainian-English code-mixed speech.
Prohibited use
You may not use this dataset for:
- Text-to-Speech (TTS) training.
- Voice cloning.
- Speaker imitation.
- Voice conversion.
- Speaker identification or speaker verification.
- Biometric identification.
- Generating synthetic speech that imitates or resembles the speaker.
- Any non-ASR purpose without explicit written permission.
Attribution
If you use this dataset, you must provide attribution:
Dataset by lnxasd, available on Hugging Face under a custom ASR-only license.
Notification requirement
If you use this dataset, you must notify the author by email:
The notification should include:
- Your name or organization.
- The project or product using the dataset.
- Whether the use is commercial or non-commercial.
- A link to the project, paper, model, or product, if available.