Ukrainian Code-Mixed ASR Dataset (Thesis)

448 audio samples of Ukrainian technical speech across 16 speakers/agents.

Dataset Description

This dataset contains transcribed Ukrainian speech samples covering software engineering standups, technical discussions, and project management topics. Speakers ("agents") cover different technical domains (backend .NET, React frontend, DevOps, project management, game development, AI compliance) with varied speaking styles and recording setups.

Speakers

  • DarkValera
  • Valera
  • ValeraDevops
  • ValeraLongPart1
  • ValeraPart10
  • ValeraPart2
  • ValeraPart3
  • ValeraPart4
  • ValeraPart5
  • ValeraPart6
  • ValeraPart7
  • ValeraPart8
  • ValeraPart9
  • ValeraShort1
  • ValeraShort2
  • ValeraShort3

Format

  • Audio: OGG Vorbis (448 files)
  • Metadata: JSONL (one record per sample)
  • Language: uk
  • Provider: all real (non-synthetic) speech

Fields

Each line in metadata.jsonl contains:

  • file_name: path to the audio file (relative to dataset root)
  • transcription: Ukrainian text transcription
  • speaker: agent/speaker name
  • language: ISO language code (uk)
  • voice: registered voice name (if in voices.json)
  • provider: TTS provider (null = real speech)

Usage

from datasets import load_dataset

dataset = load_dataset("audiofolder", data_dir="path/to/output")

License

This dataset is provided under a custom ASR-only license.

Permitted use

You may use this dataset for:

  • Automatic Speech Recognition (ASR) training, evaluation, benchmarking, and research.
  • Commercial and non-commercial ASR systems.
  • Linguistic analysis related to ASR, transcription, pronunciation, or Ukrainian-English code-mixed speech.

Prohibited use

You may not use this dataset for:

  • Text-to-Speech (TTS) training.
  • Voice cloning.
  • Speaker imitation.
  • Voice conversion.
  • Speaker identification or speaker verification.
  • Biometric identification.
  • Generating synthetic speech that imitates or resembles the speaker.
  • Any non-ASR purpose without explicit written permission.

Attribution

If you use this dataset, you must provide attribution:

Dataset by lnxasd, available on Hugging Face under a custom ASR-only license.

Notification requirement

If you use this dataset, you must notify the author by email:

valerii@example.com

The notification should include:

  • Your name or organization.
  • The project or product using the dataset.
  • Whether the use is commercial or non-commercial.
  • A link to the project, paper, model, or product, if available.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support