Spaces:
Running
Running
| title: Dataset Creator | |
| emoji: 🧩 | |
| colorFrom: indigo | |
| colorTo: pink | |
| sdk: gradio | |
| sdk_version: 6.19.0 | |
| app_file: app.py | |
| pinned: false | |
| hf_oauth: true | |
| hf_oauth_scopes: | |
| - write-repos | |
| short_description: Combine chat-format data from multiple Hugging Face datasets | |
| # Dataset Creator | |
| Combine chat-format conversation data from multiple Hugging Face datasets into | |
| one shuffled, ready-to-train dataset - no notebook required. | |
| ## What it does | |
| - Add any number of HF datasets by repo id (dynamic list, no cap). | |
| - Each one gets a schema check via the lightweight `datasets-server` API | |
| (falls back to a short streaming pull for gated or script-based datasets). | |
| Known ShareGPT/OpenAI-style conversation lists and a handful of common flat | |
| field pairs (`prompt`/`chosen`, `query`/`answer`, etc.) are detected | |
| automatically. | |
| - Anything that doesn't match a known shape shows you the raw columns so you | |
| can map them by hand - no silent guesses, no half-right auto-mapping. | |
| - Per-dataset sample limit and an optional system-prompt override. | |
| - Build the combined, shuffled dataset, then either push it straight to your | |
| own HF account (private or public, via your own OAuth login) or download it | |
| as JSONL. | |
| ## Auth | |
| Sign in with the button at the top. Pushing to the Hub uses *your* account | |
| through HF's OAuth flow - there's no token to paste in, and nothing is stored | |
| server-side between sessions. | |
| ## Project layout | |
| Everything lives flat in the repo root (no subfolders, so it uploads fine | |
| through the Spaces web UI one file at a time): | |
| - `app.py` - Gradio UI and event wiring. | |
| - `models.py` - the `DatasetEntry` / `FieldMapping` dataclasses shared by | |
| everything else. | |
| - `schema_detect.py` - pure schema-detection logic. No network calls in | |
| here, which makes it the easy part to unit test on its own. | |
| - `field_mapper.py` - pure row-to-triplet extraction logic, also network-free. | |
| - `hf_inspect.py` - peeks at a dataset's shape via the lightweight | |
| `datasets-server` API (with a streaming fallback) without downloading it. | |
| - `hf_dataset_loader.py` - pulls the real rows, up to the per-dataset limit. | |
| - `hf_publish.py` - pushes the combined dataset to the Hub, or writes it | |
| out as JSONL. | |
| ## Known limitations | |
| - Auto-detection is intentionally conservative. It only auto-applies a | |
| mapping when it recognizes both conversation-role tags (`human`/`gpt`, | |
| `user`/`assistant`, `user`/`model`) or an exact known flat-field-name pair. | |
| Anything else routes to manual mapping on purpose - a near-miss column name | |
| (e.g. `question`/`chosen` instead of `prompt`/`chosen`) won't get silently | |
| mis-mapped. The known-pairs list is a plain Python list at the top of | |
| `schema_detect.py` if you want to extend it for patterns you hit a lot. | |
| - Datasets that bundle extra per-row enrichment beyond a simple triplet (for | |
| example, appending a vulnerability note to an answer, or synthesizing a | |
| prompt from a code snippet plus a language tag) won't be replicated by the | |
| generic mapper - it pulls the two fields you point it at, verbatim. If you | |
| need that kind of enrichment again, it's a small addition to | |
| `field_mapper.py`. | |
| - Gated datasets need the signed-in user's token to actually have read | |
| access; there's no separate "request access" flow built in here. | |
| - The `gr.OAuthToken` / `gr.OAuthProfile` injection is wired to the documented | |
| Gradio pattern (annotated trailing parameters, not listed in `inputs=`) but | |
| could only be validated structurally in this environment - there was no | |
| live Spaces OAuth session available to confirm end to end. Worth a quick | |
| sign-in-and-push smoke test right after deploying, before relying on it. |