Spaces:
Running
Running
metadata
title: Dataset Creator
emoji: 🧩
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 6.19.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_scopes:
- write-repos
short_description: Combine chat-format data from multiple Hugging Face datasets
Dataset Creator
Combine chat-format conversation data from multiple Hugging Face datasets into one shuffled, ready-to-train dataset - no notebook required.
What it does
- Add any number of HF datasets by repo id (dynamic list, no cap).
- Each one gets a schema check via the lightweight
datasets-serverAPI (falls back to a short streaming pull for gated or script-based datasets). Known ShareGPT/OpenAI-style conversation lists and a handful of common flat field pairs (prompt/chosen,query/answer, etc.) are detected automatically. - Anything that doesn't match a known shape shows you the raw columns so you can map them by hand - no silent guesses, no half-right auto-mapping.
- Per-dataset sample limit and an optional system-prompt override.
- Build the combined, shuffled dataset, then either push it straight to your own HF account (private or public, via your own OAuth login) or download it as JSONL.
Auth
Sign in with the button at the top. Pushing to the Hub uses your account through HF's OAuth flow - there's no token to paste in, and nothing is stored server-side between sessions.
Project layout
Everything lives flat in the repo root (no subfolders, so it uploads fine through the Spaces web UI one file at a time):
app.py- Gradio UI and event wiring.models.py- theDatasetEntry/FieldMappingdataclasses shared by everything else.schema_detect.py- pure schema-detection logic. No network calls in here, which makes it the easy part to unit test on its own.field_mapper.py- pure row-to-triplet extraction logic, also network-free.hf_inspect.py- peeks at a dataset's shape via the lightweightdatasets-serverAPI (with a streaming fallback) without downloading it.hf_dataset_loader.py- pulls the real rows, up to the per-dataset limit.hf_publish.py- pushes the combined dataset to the Hub, or writes it out as JSONL.
Known limitations
- Auto-detection is intentionally conservative. It only auto-applies a
mapping when it recognizes both conversation-role tags (
human/gpt,user/assistant,user/model) or an exact known flat-field-name pair. Anything else routes to manual mapping on purpose - a near-miss column name (e.g.question/choseninstead ofprompt/chosen) won't get silently mis-mapped. The known-pairs list is a plain Python list at the top ofschema_detect.pyif you want to extend it for patterns you hit a lot. - Datasets that bundle extra per-row enrichment beyond a simple triplet (for
example, appending a vulnerability note to an answer, or synthesizing a
prompt from a code snippet plus a language tag) won't be replicated by the
generic mapper - it pulls the two fields you point it at, verbatim. If you
need that kind of enrichment again, it's a small addition to
field_mapper.py. - Gated datasets need the signed-in user's token to actually have read access; there's no separate "request access" flow built in here.
- The
gr.OAuthToken/gr.OAuthProfileinjection is wired to the documented Gradio pattern (annotated trailing parameters, not listed ininputs=) but could only be validated structurally in this environment - there was no live Spaces OAuth session available to confirm end to end. Worth a quick sign-in-and-push smoke test right after deploying, before relying on it.