Spaces:

TitleOS
/

Dataset-Creator

Running

App Files Files Community

Dataset-Creator / README.md

TitleOS

Update README.md

21bbdf9 verified 6 days ago

preview code

Raw

History Blame Contribute Delete

3.65 kB

	---
	title: Dataset Creator
	emoji: 🧩
	colorFrom: indigo
	colorTo: pink
	sdk: gradio
	sdk_version: 6.19.0
	app_file: app.py
	pinned: false
	hf_oauth: true
	hf_oauth_scopes:
	- write-repos
	short_description: Combine chat-format data from multiple Hugging Face datasets
	---

	# Dataset Creator

	Combine chat-format conversation data from multiple Hugging Face datasets into
	one shuffled, ready-to-train dataset - no notebook required.

	## What it does

	- Add any number of HF datasets by repo id (dynamic list, no cap).
	- Each one gets a schema check via the lightweight `datasets-server` API
	(falls back to a short streaming pull for gated or script-based datasets).
	Known ShareGPT/OpenAI-style conversation lists and a handful of common flat
	field pairs (`prompt`/`chosen`, `query`/`answer`, etc.) are detected
	automatically.
	- Anything that doesn't match a known shape shows you the raw columns so you
	can map them by hand - no silent guesses, no half-right auto-mapping.
	- Per-dataset sample limit and an optional system-prompt override.
	- Build the combined, shuffled dataset, then either push it straight to your
	own HF account (private or public, via your own OAuth login) or download it
	as JSONL.

	## Auth

	Sign in with the button at the top. Pushing to the Hub uses your account
	through HF's OAuth flow - there's no token to paste in, and nothing is stored
	server-side between sessions.

	## Project layout

	Everything lives flat in the repo root (no subfolders, so it uploads fine
	through the Spaces web UI one file at a time):

	- `app.py` - Gradio UI and event wiring.
	- `models.py` - the `DatasetEntry` / `FieldMapping` dataclasses shared by
	everything else.
	- `schema_detect.py` - pure schema-detection logic. No network calls in
	here, which makes it the easy part to unit test on its own.
	- `field_mapper.py` - pure row-to-triplet extraction logic, also network-free.
	- `hf_inspect.py` - peeks at a dataset's shape via the lightweight
	`datasets-server` API (with a streaming fallback) without downloading it.
	- `hf_dataset_loader.py` - pulls the real rows, up to the per-dataset limit.
	- `hf_publish.py` - pushes the combined dataset to the Hub, or writes it
	out as JSONL.

	## Known limitations

	- Auto-detection is intentionally conservative. It only auto-applies a
	mapping when it recognizes both conversation-role tags (`human`/`gpt`,
	`user`/`assistant`, `user`/`model`) or an exact known flat-field-name pair.
	Anything else routes to manual mapping on purpose - a near-miss column name
	(e.g. `question`/`chosen` instead of `prompt`/`chosen`) won't get silently
	mis-mapped. The known-pairs list is a plain Python list at the top of
	`schema_detect.py` if you want to extend it for patterns you hit a lot.
	- Datasets that bundle extra per-row enrichment beyond a simple triplet (for
	example, appending a vulnerability note to an answer, or synthesizing a
	prompt from a code snippet plus a language tag) won't be replicated by the
	generic mapper - it pulls the two fields you point it at, verbatim. If you
	need that kind of enrichment again, it's a small addition to
	`field_mapper.py`.
	- Gated datasets need the signed-in user's token to actually have read
	access; there's no separate "request access" flow built in here.
	- The `gr.OAuthToken` / `gr.OAuthProfile` injection is wired to the documented
	Gradio pattern (annotated trailing parameters, not listed in `inputs=`) but
	could only be validated structurally in this environment - there was no
	live Spaces OAuth session available to confirm end to end. Worth a quick
	sign-in-and-push smoke test right after deploying, before relying on it.