--- title: Augmentoolkit CPU emoji: 📊 colorFrom: green colorTo: blue sdk: gradio sdk_version: 6.10.0 app_file: app.py pinned: false license: mit python_version: "3.10" --- # Augmentoolkit — Synthetic Dataset Generator Generate synthetic training datasets from text using LLM APIs or WebGPU browser inference. Faithful port of [augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong. ## Pipelines | Pipeline | Purpose | Status | |----------|---------|--------| | **Factual Q&A** | Educational Q&A pairs with 9-step validation | Full | | **RPToolkit** | Roleplay/character training data (scenes, stories, archetypes) | Prompts ready, pipeline stub | | **Representation Variation** | Rephrase content as essays, lists, JSON, XML, logic chains | Prompts ready, pipeline stub | | **RAG Training** | RAG success/failure conversation data | Prompts ready, pipeline stub | | **Correction** | Fix/improve generated data | Prompts ready, pipeline stub | ## Factual Q&A Pipeline (9 steps) 1. **Chunk** text into paragraphs 2. **Filter** chunks for suitability (5 few-shot examples) 3. **Generate** Q&A pairs (5 few-shot, 7 question styles) 4. **Check questions** — relevance (3 few-shot) 5. **Check answer relevancy** — no hallucinations (5 few-shot) 6. **Check answer accuracy** — fact-check against source (5 few-shot) 7. **Context repair** — REWORD/PASS/FAIL source references 8. **Conversation generation** — optional multi-turn dialogue 9. **Format** as ShareGPT JSONL ### Question Styles Standard, Comparison, Followup, Hallucination (I don't know), Negative, Open-ended, Vague ## Backends | Backend | Speed | Key | |---------|-------|-----| | WebGPU (Nemotron-3-Nano 4B) | 20-40 t/s | No (browser) | | OpenRouter | Fast | Yes (free tier) | | Groq Free | Very fast | Yes (free) | | HuggingFace Inference | Moderate | Yes | | Custom Endpoint | Varies | Yes | ## CLI ```bash python app.py cli input.txt \ --url http://localhost:5000/v1 \ --api-key your-key \ --model mistralai/mistral-7b-instruct \ --num-questions 5 \ --chunk-size 8000 \ --output dataset.jsonl \ --generate-conversations ``` Options: `--skip-filter`, `--skip-validation`, `--temperature 0.7`, `--generate-conversations` ## API + MCP - REST API: `POST /api/generate_dataset` - MCP Server: enabled (`mcp_server=True`) ## Credits [augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong — all prompt templates faithfully ported with original few-shot examples.