Spaces:
Sleeping
Sleeping
| title: Augmentoolkit CPU | |
| emoji: π | |
| colorFrom: green | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 6.10.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| python_version: "3.10" | |
| # Augmentoolkit β Synthetic Dataset Generator | |
| Generate synthetic training datasets from text using LLM APIs or WebGPU browser inference. Faithful port of [augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong. | |
| ## Pipelines | |
| | Pipeline | Purpose | Status | | |
| |----------|---------|--------| | |
| | **Factual Q&A** | Educational Q&A pairs with 9-step validation | Full | | |
| | **RPToolkit** | Roleplay/character training data (scenes, stories, archetypes) | Prompts ready, pipeline stub | | |
| | **Representation Variation** | Rephrase content as essays, lists, JSON, XML, logic chains | Prompts ready, pipeline stub | | |
| | **RAG Training** | RAG success/failure conversation data | Prompts ready, pipeline stub | | |
| | **Correction** | Fix/improve generated data | Prompts ready, pipeline stub | | |
| ## Factual Q&A Pipeline (9 steps) | |
| 1. **Chunk** text into paragraphs | |
| 2. **Filter** chunks for suitability (5 few-shot examples) | |
| 3. **Generate** Q&A pairs (5 few-shot, 7 question styles) | |
| 4. **Check questions** β relevance (3 few-shot) | |
| 5. **Check answer relevancy** β no hallucinations (5 few-shot) | |
| 6. **Check answer accuracy** β fact-check against source (5 few-shot) | |
| 7. **Context repair** β REWORD/PASS/FAIL source references | |
| 8. **Conversation generation** β optional multi-turn dialogue | |
| 9. **Format** as ShareGPT JSONL | |
| ### Question Styles | |
| Standard, Comparison, Followup, Hallucination (I don't know), Negative, Open-ended, Vague | |
| ## Backends | |
| | Backend | Speed | Key | | |
| |---------|-------|-----| | |
| | WebGPU (Nemotron-3-Nano 4B) | 20-40 t/s | No (browser) | | |
| | OpenRouter | Fast | Yes (free tier) | | |
| | Groq Free | Very fast | Yes (free) | | |
| | HuggingFace Inference | Moderate | Yes | | |
| | Custom Endpoint | Varies | Yes | | |
| ## CLI | |
| ```bash | |
| python app.py cli input.txt \ | |
| --url http://localhost:5000/v1 \ | |
| --api-key your-key \ | |
| --model mistralai/mistral-7b-instruct \ | |
| --num-questions 5 \ | |
| --chunk-size 8000 \ | |
| --output dataset.jsonl \ | |
| --generate-conversations | |
| ``` | |
| Options: `--skip-filter`, `--skip-validation`, `--temperature 0.7`, `--generate-conversations` | |
| ## API + MCP | |
| - REST API: `POST /api/generate_dataset` | |
| - MCP Server: enabled (`mcp_server=True`) | |
| ## Credits | |
| [augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong β all prompt templates faithfully ported with original few-shot examples. | |