Spaces:
Sleeping
Sleeping
Commit Β·
ba79c5a
0
Parent(s):
Augmentoolkit - 5 pipelines, 7 question styles, CLI + WebGPU
Browse files- .gitignore +6 -0
- README.md +76 -0
- app.py +0 -0
- requirements.txt +4 -0
.gitignore
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.venv/
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.pyc
|
| 4 |
+
.specs/
|
| 5 |
+
*.jsonl
|
| 6 |
+
|
README.md
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Augmentoolkit CPU
|
| 3 |
+
emoji: π
|
| 4 |
+
colorFrom: green
|
| 5 |
+
colorTo: blue
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 6.10.0
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
license: mit
|
| 11 |
+
python_version: "3.10"
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Augmentoolkit β Synthetic Dataset Generator
|
| 15 |
+
|
| 16 |
+
Generate synthetic training datasets from text using LLM APIs or WebGPU browser inference. Faithful port of [augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong.
|
| 17 |
+
|
| 18 |
+
## Pipelines
|
| 19 |
+
|
| 20 |
+
| Pipeline | Purpose | Status |
|
| 21 |
+
|----------|---------|--------|
|
| 22 |
+
| **Factual Q&A** | Educational Q&A pairs with 9-step validation | Full |
|
| 23 |
+
| **RPToolkit** | Roleplay/character training data (scenes, stories, archetypes) | Prompts ready, pipeline stub |
|
| 24 |
+
| **Representation Variation** | Rephrase content as essays, lists, JSON, XML, logic chains | Prompts ready, pipeline stub |
|
| 25 |
+
| **RAG Training** | RAG success/failure conversation data | Prompts ready, pipeline stub |
|
| 26 |
+
| **Correction** | Fix/improve generated data | Prompts ready, pipeline stub |
|
| 27 |
+
|
| 28 |
+
## Factual Q&A Pipeline (9 steps)
|
| 29 |
+
|
| 30 |
+
1. **Chunk** text into paragraphs
|
| 31 |
+
2. **Filter** chunks for suitability (5 few-shot examples)
|
| 32 |
+
3. **Generate** Q&A pairs (5 few-shot, 7 question styles)
|
| 33 |
+
4. **Check questions** β relevance (3 few-shot)
|
| 34 |
+
5. **Check answer relevancy** β no hallucinations (5 few-shot)
|
| 35 |
+
6. **Check answer accuracy** β fact-check against source (5 few-shot)
|
| 36 |
+
7. **Context repair** β REWORD/PASS/FAIL source references
|
| 37 |
+
8. **Conversation generation** β optional multi-turn dialogue
|
| 38 |
+
9. **Format** as ShareGPT JSONL
|
| 39 |
+
|
| 40 |
+
### Question Styles
|
| 41 |
+
|
| 42 |
+
Standard, Comparison, Followup, Hallucination (I don't know), Negative, Open-ended, Vague
|
| 43 |
+
|
| 44 |
+
## Backends
|
| 45 |
+
|
| 46 |
+
| Backend | Speed | Key |
|
| 47 |
+
|---------|-------|-----|
|
| 48 |
+
| WebGPU (Nemotron-3-Nano 4B) | 20-40 t/s | No (browser) |
|
| 49 |
+
| OpenRouter | Fast | Yes (free tier) |
|
| 50 |
+
| Groq Free | Very fast | Yes (free) |
|
| 51 |
+
| HuggingFace Inference | Moderate | Yes |
|
| 52 |
+
| Custom Endpoint | Varies | Yes |
|
| 53 |
+
|
| 54 |
+
## CLI
|
| 55 |
+
|
| 56 |
+
```bash
|
| 57 |
+
python app.py cli input.txt \
|
| 58 |
+
--url http://localhost:5000/v1 \
|
| 59 |
+
--api-key your-key \
|
| 60 |
+
--model mistralai/mistral-7b-instruct \
|
| 61 |
+
--num-questions 5 \
|
| 62 |
+
--chunk-size 8000 \
|
| 63 |
+
--output dataset.jsonl \
|
| 64 |
+
--generate-conversations
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
Options: `--skip-filter`, `--skip-validation`, `--temperature 0.7`, `--generate-conversations`
|
| 68 |
+
|
| 69 |
+
## API + MCP
|
| 70 |
+
|
| 71 |
+
- REST API: `POST /api/generate_dataset`
|
| 72 |
+
- MCP Server: enabled (`mcp_server=True`)
|
| 73 |
+
|
| 74 |
+
## Credits
|
| 75 |
+
|
| 76 |
+
[augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong β all prompt templates faithfully ported with original few-shot examples.
|
app.py
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
requirements.txt
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio[mcp]>=6.0.0
|
| 2 |
+
openai>=1.0.0
|
| 3 |
+
httpx
|
| 4 |
+
nltk
|