Dataset-Creator / README.md
TitleOS's picture
Update README.md
21bbdf9 verified
|
Raw
History Blame Contribute Delete
3.65 kB
metadata
title: Dataset Creator
emoji: 🧩
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 6.19.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_scopes:
  - write-repos
short_description: Combine chat-format data from multiple Hugging Face datasets

Dataset Creator

Combine chat-format conversation data from multiple Hugging Face datasets into one shuffled, ready-to-train dataset - no notebook required.

What it does

  • Add any number of HF datasets by repo id (dynamic list, no cap).
  • Each one gets a schema check via the lightweight datasets-server API (falls back to a short streaming pull for gated or script-based datasets). Known ShareGPT/OpenAI-style conversation lists and a handful of common flat field pairs (prompt/chosen, query/answer, etc.) are detected automatically.
  • Anything that doesn't match a known shape shows you the raw columns so you can map them by hand - no silent guesses, no half-right auto-mapping.
  • Per-dataset sample limit and an optional system-prompt override.
  • Build the combined, shuffled dataset, then either push it straight to your own HF account (private or public, via your own OAuth login) or download it as JSONL.

Auth

Sign in with the button at the top. Pushing to the Hub uses your account through HF's OAuth flow - there's no token to paste in, and nothing is stored server-side between sessions.

Project layout

Everything lives flat in the repo root (no subfolders, so it uploads fine through the Spaces web UI one file at a time):

  • app.py - Gradio UI and event wiring.
  • models.py - the DatasetEntry / FieldMapping dataclasses shared by everything else.
  • schema_detect.py - pure schema-detection logic. No network calls in here, which makes it the easy part to unit test on its own.
  • field_mapper.py - pure row-to-triplet extraction logic, also network-free.
  • hf_inspect.py - peeks at a dataset's shape via the lightweight datasets-server API (with a streaming fallback) without downloading it.
  • hf_dataset_loader.py - pulls the real rows, up to the per-dataset limit.
  • hf_publish.py - pushes the combined dataset to the Hub, or writes it out as JSONL.

Known limitations

  • Auto-detection is intentionally conservative. It only auto-applies a mapping when it recognizes both conversation-role tags (human/gpt, user/assistant, user/model) or an exact known flat-field-name pair. Anything else routes to manual mapping on purpose - a near-miss column name (e.g. question/chosen instead of prompt/chosen) won't get silently mis-mapped. The known-pairs list is a plain Python list at the top of schema_detect.py if you want to extend it for patterns you hit a lot.
  • Datasets that bundle extra per-row enrichment beyond a simple triplet (for example, appending a vulnerability note to an answer, or synthesizing a prompt from a code snippet plus a language tag) won't be replicated by the generic mapper - it pulls the two fields you point it at, verbatim. If you need that kind of enrichment again, it's a small addition to field_mapper.py.
  • Gated datasets need the signed-in user's token to actually have read access; there's no separate "request access" flow built in here.
  • The gr.OAuthToken / gr.OAuthProfile injection is wired to the documented Gradio pattern (annotated trailing parameters, not listed in inputs=) but could only be validated structurally in this environment - there was no live Spaces OAuth session available to confirm end to end. Worth a quick sign-in-and-push smoke test right after deploying, before relying on it.