Spaces:

Luminia
/

Augmentoolkit

Sleeping

App Files Files Community

Augmentoolkit / README.md

Nekochu

Augmentoolkit - 5 pipelines, 7 question styles, CLI + WebGPU

ba79c5a 7 days ago

preview code

raw

history blame contribute delete

2.52 kB

	---
	title: Augmentoolkit CPU
	emoji: 📊
	colorFrom: green
	colorTo: blue
	sdk: gradio
	sdk_version: 6.10.0
	app_file: app.py
	pinned: false
	license: mit
	python_version: "3.10"
	---

	# Augmentoolkit — Synthetic Dataset Generator

	Generate synthetic training datasets from text using LLM APIs or WebGPU browser inference. Faithful port of [augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong.

	## Pipelines

	\| Pipeline \| Purpose \| Status \|
	\|----------\|---------\|--------\|
	\| Factual Q&A \| Educational Q&A pairs with 9-step validation \| Full \|
	\| RPToolkit \| Roleplay/character training data (scenes, stories, archetypes) \| Prompts ready, pipeline stub \|
	\| Representation Variation \| Rephrase content as essays, lists, JSON, XML, logic chains \| Prompts ready, pipeline stub \|
	\| RAG Training \| RAG success/failure conversation data \| Prompts ready, pipeline stub \|
	\| Correction \| Fix/improve generated data \| Prompts ready, pipeline stub \|

	## Factual Q&A Pipeline (9 steps)

	1. Chunk text into paragraphs
	2. Filter chunks for suitability (5 few-shot examples)
	3. Generate Q&A pairs (5 few-shot, 7 question styles)
	4. Check questions — relevance (3 few-shot)
	5. Check answer relevancy — no hallucinations (5 few-shot)
	6. Check answer accuracy — fact-check against source (5 few-shot)
	7. Context repair — REWORD/PASS/FAIL source references
	8. Conversation generation — optional multi-turn dialogue
	9. Format as ShareGPT JSONL

	### Question Styles

	Standard, Comparison, Followup, Hallucination (I don't know), Negative, Open-ended, Vague

	## Backends

	\| Backend \| Speed \| Key \|
	\|---------\|-------\|-----\|
	\| WebGPU (Nemotron-3-Nano 4B) \| 20-40 t/s \| No (browser) \|
	\| OpenRouter \| Fast \| Yes (free tier) \|
	\| Groq Free \| Very fast \| Yes (free) \|
	\| HuggingFace Inference \| Moderate \| Yes \|
	\| Custom Endpoint \| Varies \| Yes \|

	## CLI

	```bash
	python app.py cli input.txt \
	--url http://localhost:5000/v1 \
	--api-key your-key \
	--model mistralai/mistral-7b-instruct \
	--num-questions 5 \
	--chunk-size 8000 \
	--output dataset.jsonl \
	--generate-conversations
	```

	Options: `--skip-filter`, `--skip-validation`, `--temperature 0.7`, `--generate-conversations`

	## API + MCP

	- REST API: `POST /api/generate_dataset`
	- MCP Server: enabled (`mcp_server=True`)

	## Credits

	[augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong — all prompt templates faithfully ported with original few-shot examples.