Dataset-Creator / README.md
TitleOS's picture
Update README.md
21bbdf9 verified
|
Raw
History Blame Contribute Delete
3.65 kB
---
title: Dataset Creator
emoji: 🧩
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 6.19.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_scopes:
- write-repos
short_description: Combine chat-format data from multiple Hugging Face datasets
---
# Dataset Creator
Combine chat-format conversation data from multiple Hugging Face datasets into
one shuffled, ready-to-train dataset - no notebook required.
## What it does
- Add any number of HF datasets by repo id (dynamic list, no cap).
- Each one gets a schema check via the lightweight `datasets-server` API
(falls back to a short streaming pull for gated or script-based datasets).
Known ShareGPT/OpenAI-style conversation lists and a handful of common flat
field pairs (`prompt`/`chosen`, `query`/`answer`, etc.) are detected
automatically.
- Anything that doesn't match a known shape shows you the raw columns so you
can map them by hand - no silent guesses, no half-right auto-mapping.
- Per-dataset sample limit and an optional system-prompt override.
- Build the combined, shuffled dataset, then either push it straight to your
own HF account (private or public, via your own OAuth login) or download it
as JSONL.
## Auth
Sign in with the button at the top. Pushing to the Hub uses *your* account
through HF's OAuth flow - there's no token to paste in, and nothing is stored
server-side between sessions.
## Project layout
Everything lives flat in the repo root (no subfolders, so it uploads fine
through the Spaces web UI one file at a time):
- `app.py` - Gradio UI and event wiring.
- `models.py` - the `DatasetEntry` / `FieldMapping` dataclasses shared by
everything else.
- `schema_detect.py` - pure schema-detection logic. No network calls in
here, which makes it the easy part to unit test on its own.
- `field_mapper.py` - pure row-to-triplet extraction logic, also network-free.
- `hf_inspect.py` - peeks at a dataset's shape via the lightweight
`datasets-server` API (with a streaming fallback) without downloading it.
- `hf_dataset_loader.py` - pulls the real rows, up to the per-dataset limit.
- `hf_publish.py` - pushes the combined dataset to the Hub, or writes it
out as JSONL.
## Known limitations
- Auto-detection is intentionally conservative. It only auto-applies a
mapping when it recognizes both conversation-role tags (`human`/`gpt`,
`user`/`assistant`, `user`/`model`) or an exact known flat-field-name pair.
Anything else routes to manual mapping on purpose - a near-miss column name
(e.g. `question`/`chosen` instead of `prompt`/`chosen`) won't get silently
mis-mapped. The known-pairs list is a plain Python list at the top of
`schema_detect.py` if you want to extend it for patterns you hit a lot.
- Datasets that bundle extra per-row enrichment beyond a simple triplet (for
example, appending a vulnerability note to an answer, or synthesizing a
prompt from a code snippet plus a language tag) won't be replicated by the
generic mapper - it pulls the two fields you point it at, verbatim. If you
need that kind of enrichment again, it's a small addition to
`field_mapper.py`.
- Gated datasets need the signed-in user's token to actually have read
access; there's no separate "request access" flow built in here.
- The `gr.OAuthToken` / `gr.OAuthProfile` injection is wired to the documented
Gradio pattern (annotated trailing parameters, not listed in `inputs=`) but
could only be validated structurally in this environment - there was no
live Spaces OAuth session available to confirm end to end. Worth a quick
sign-in-and-push smoke test right after deploying, before relying on it.