Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available: 6.11.0
metadata
title: Augmentoolkit CPU
emoji: π
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 6.10.0
app_file: app.py
pinned: false
license: mit
python_version: '3.10'
Augmentoolkit β Synthetic Dataset Generator
Generate synthetic training datasets from text using LLM APIs or WebGPU browser inference. Faithful port of augmentoolkit by e-p-armstrong.
Pipelines
| Pipeline | Purpose | Status |
|---|---|---|
| Factual Q&A | Educational Q&A pairs with 9-step validation | Full |
| RPToolkit | Roleplay/character training data (scenes, stories, archetypes) | Prompts ready, pipeline stub |
| Representation Variation | Rephrase content as essays, lists, JSON, XML, logic chains | Prompts ready, pipeline stub |
| RAG Training | RAG success/failure conversation data | Prompts ready, pipeline stub |
| Correction | Fix/improve generated data | Prompts ready, pipeline stub |
Factual Q&A Pipeline (9 steps)
- Chunk text into paragraphs
- Filter chunks for suitability (5 few-shot examples)
- Generate Q&A pairs (5 few-shot, 7 question styles)
- Check questions β relevance (3 few-shot)
- Check answer relevancy β no hallucinations (5 few-shot)
- Check answer accuracy β fact-check against source (5 few-shot)
- Context repair β REWORD/PASS/FAIL source references
- Conversation generation β optional multi-turn dialogue
- Format as ShareGPT JSONL
Question Styles
Standard, Comparison, Followup, Hallucination (I don't know), Negative, Open-ended, Vague
Backends
| Backend | Speed | Key |
|---|---|---|
| WebGPU (Nemotron-3-Nano 4B) | 20-40 t/s | No (browser) |
| OpenRouter | Fast | Yes (free tier) |
| Groq Free | Very fast | Yes (free) |
| HuggingFace Inference | Moderate | Yes |
| Custom Endpoint | Varies | Yes |
CLI
python app.py cli input.txt \
--url http://localhost:5000/v1 \
--api-key your-key \
--model mistralai/mistral-7b-instruct \
--num-questions 5 \
--chunk-size 8000 \
--output dataset.jsonl \
--generate-conversations
Options: --skip-filter, --skip-validation, --temperature 0.7, --generate-conversations
API + MCP
- REST API:
POST /api/generate_dataset - MCP Server: enabled (
mcp_server=True)
Credits
augmentoolkit by e-p-armstrong β all prompt templates faithfully ported with original few-shot examples.