---
title: Dataset Creator
emoji: 🧩
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 6.19.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_scopes:
- write-repos
short_description: Combine chat-format data from multiple Hugging Face datasets
---

# Dataset Creator

Combine chat-format conversation data from multiple Hugging Face datasets into
one shuffled, ready-to-train dataset - no notebook required.

## What it does

- Add any number of HF datasets by repo id (dynamic list, no cap).
- Each one gets a schema check via the lightweight `datasets-server` API
  (falls back to a short streaming pull for gated or script-based datasets).
  Known ShareGPT/OpenAI-style conversation lists and a handful of common flat
  field pairs (`prompt`/`chosen`, `query`/`answer`, etc.) are detected
  automatically.
- Anything that doesn't match a known shape shows you the raw columns so you
  can map them by hand - no silent guesses, no half-right auto-mapping.
- Per-dataset sample limit and an optional system-prompt override.
- Build the combined, shuffled dataset, then either push it straight to your
  own HF account (private or public, via your own OAuth login) or download it
  as JSONL.

## Auth

Sign in with the button at the top. Pushing to the Hub uses *your* account
through HF's OAuth flow - there's no token to paste in, and nothing is stored
server-side between sessions.

## Project layout

Everything lives flat in the repo root (no subfolders, so it uploads fine
through the Spaces web UI one file at a time):

- `app.py` - Gradio UI and event wiring.
- `models.py` - the `DatasetEntry` / `FieldMapping` dataclasses shared by
  everything else.
- `schema_detect.py` - pure schema-detection logic. No network calls in
  here, which makes it the easy part to unit test on its own.
- `field_mapper.py` - pure row-to-triplet extraction logic, also network-free.
- `hf_inspect.py` - peeks at a dataset's shape via the lightweight
  `datasets-server` API (with a streaming fallback) without downloading it.
- `hf_dataset_loader.py` - pulls the real rows, up to the per-dataset limit.
- `hf_publish.py` - pushes the combined dataset to the Hub, or writes it
  out as JSONL.

## Known limitations

- Auto-detection is intentionally conservative. It only auto-applies a
  mapping when it recognizes both conversation-role tags (`human`/`gpt`,
  `user`/`assistant`, `user`/`model`) or an exact known flat-field-name pair.
  Anything else routes to manual mapping on purpose - a near-miss column name
  (e.g. `question`/`chosen` instead of `prompt`/`chosen`) won't get silently
  mis-mapped. The known-pairs list is a plain Python list at the top of
  `schema_detect.py` if you want to extend it for patterns you hit a lot.
- Datasets that bundle extra per-row enrichment beyond a simple triplet (for
  example, appending a vulnerability note to an answer, or synthesizing a
  prompt from a code snippet plus a language tag) won't be replicated by the
  generic mapper - it pulls the two fields you point it at, verbatim. If you
  need that kind of enrichment again, it's a small addition to
  `field_mapper.py`.
- Gated datasets need the signed-in user's token to actually have read
  access; there's no separate "request access" flow built in here.
- The `gr.OAuthToken` / `gr.OAuthProfile` injection is wired to the documented
  Gradio pattern (annotated trailing parameters, not listed in `inputs=`) but
  could only be validated structurally in this environment - there was no
  live Spaces OAuth session available to confirm end to end. Worth a quick
  sign-in-and-push smoke test right after deploying, before relying on it.