--- title: Dataset Creator emoji: 🧩 colorFrom: indigo colorTo: pink sdk: gradio sdk_version: 6.19.0 app_file: app.py pinned: false hf_oauth: true hf_oauth_scopes: - write-repos short_description: Combine chat-format data from multiple Hugging Face datasets --- # Dataset Creator Combine chat-format conversation data from multiple Hugging Face datasets into one shuffled, ready-to-train dataset - no notebook required. ## What it does - Add any number of HF datasets by repo id (dynamic list, no cap). - Each one gets a schema check via the lightweight `datasets-server` API (falls back to a short streaming pull for gated or script-based datasets). Known ShareGPT/OpenAI-style conversation lists and a handful of common flat field pairs (`prompt`/`chosen`, `query`/`answer`, etc.) are detected automatically. - Anything that doesn't match a known shape shows you the raw columns so you can map them by hand - no silent guesses, no half-right auto-mapping. - Per-dataset sample limit and an optional system-prompt override. - Build the combined, shuffled dataset, then either push it straight to your own HF account (private or public, via your own OAuth login) or download it as JSONL. ## Auth Sign in with the button at the top. Pushing to the Hub uses *your* account through HF's OAuth flow - there's no token to paste in, and nothing is stored server-side between sessions. ## Project layout Everything lives flat in the repo root (no subfolders, so it uploads fine through the Spaces web UI one file at a time): - `app.py` - Gradio UI and event wiring. - `models.py` - the `DatasetEntry` / `FieldMapping` dataclasses shared by everything else. - `schema_detect.py` - pure schema-detection logic. No network calls in here, which makes it the easy part to unit test on its own. - `field_mapper.py` - pure row-to-triplet extraction logic, also network-free. - `hf_inspect.py` - peeks at a dataset's shape via the lightweight `datasets-server` API (with a streaming fallback) without downloading it. - `hf_dataset_loader.py` - pulls the real rows, up to the per-dataset limit. - `hf_publish.py` - pushes the combined dataset to the Hub, or writes it out as JSONL. ## Known limitations - Auto-detection is intentionally conservative. It only auto-applies a mapping when it recognizes both conversation-role tags (`human`/`gpt`, `user`/`assistant`, `user`/`model`) or an exact known flat-field-name pair. Anything else routes to manual mapping on purpose - a near-miss column name (e.g. `question`/`chosen` instead of `prompt`/`chosen`) won't get silently mis-mapped. The known-pairs list is a plain Python list at the top of `schema_detect.py` if you want to extend it for patterns you hit a lot. - Datasets that bundle extra per-row enrichment beyond a simple triplet (for example, appending a vulnerability note to an answer, or synthesizing a prompt from a code snippet plus a language tag) won't be replicated by the generic mapper - it pulls the two fields you point it at, verbatim. If you need that kind of enrichment again, it's a small addition to `field_mapper.py`. - Gated datasets need the signed-in user's token to actually have read access; there's no separate "request access" flow built in here. - The `gr.OAuthToken` / `gr.OAuthProfile` injection is wired to the documented Gradio pattern (annotated trailing parameters, not listed in `inputs=`) but could only be validated structurally in this environment - there was no live Spaces OAuth session available to confirm end to end. Worth a quick sign-in-and-push smoke test right after deploying, before relying on it.