Nekochu commited on
Commit
ba79c5a
Β·
0 Parent(s):

Augmentoolkit - 5 pipelines, 7 question styles, CLI + WebGPU

Browse files
Files changed (4) hide show
  1. .gitignore +6 -0
  2. README.md +76 -0
  3. app.py +0 -0
  4. requirements.txt +4 -0
.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ .venv/
2
+ __pycache__/
3
+ *.pyc
4
+ .specs/
5
+ *.jsonl
6
+
README.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Augmentoolkit CPU
3
+ emoji: πŸ“Š
4
+ colorFrom: green
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: 6.10.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ python_version: "3.10"
12
+ ---
13
+
14
+ # Augmentoolkit β€” Synthetic Dataset Generator
15
+
16
+ Generate synthetic training datasets from text using LLM APIs or WebGPU browser inference. Faithful port of [augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong.
17
+
18
+ ## Pipelines
19
+
20
+ | Pipeline | Purpose | Status |
21
+ |----------|---------|--------|
22
+ | **Factual Q&A** | Educational Q&A pairs with 9-step validation | Full |
23
+ | **RPToolkit** | Roleplay/character training data (scenes, stories, archetypes) | Prompts ready, pipeline stub |
24
+ | **Representation Variation** | Rephrase content as essays, lists, JSON, XML, logic chains | Prompts ready, pipeline stub |
25
+ | **RAG Training** | RAG success/failure conversation data | Prompts ready, pipeline stub |
26
+ | **Correction** | Fix/improve generated data | Prompts ready, pipeline stub |
27
+
28
+ ## Factual Q&A Pipeline (9 steps)
29
+
30
+ 1. **Chunk** text into paragraphs
31
+ 2. **Filter** chunks for suitability (5 few-shot examples)
32
+ 3. **Generate** Q&A pairs (5 few-shot, 7 question styles)
33
+ 4. **Check questions** β€” relevance (3 few-shot)
34
+ 5. **Check answer relevancy** β€” no hallucinations (5 few-shot)
35
+ 6. **Check answer accuracy** β€” fact-check against source (5 few-shot)
36
+ 7. **Context repair** β€” REWORD/PASS/FAIL source references
37
+ 8. **Conversation generation** β€” optional multi-turn dialogue
38
+ 9. **Format** as ShareGPT JSONL
39
+
40
+ ### Question Styles
41
+
42
+ Standard, Comparison, Followup, Hallucination (I don't know), Negative, Open-ended, Vague
43
+
44
+ ## Backends
45
+
46
+ | Backend | Speed | Key |
47
+ |---------|-------|-----|
48
+ | WebGPU (Nemotron-3-Nano 4B) | 20-40 t/s | No (browser) |
49
+ | OpenRouter | Fast | Yes (free tier) |
50
+ | Groq Free | Very fast | Yes (free) |
51
+ | HuggingFace Inference | Moderate | Yes |
52
+ | Custom Endpoint | Varies | Yes |
53
+
54
+ ## CLI
55
+
56
+ ```bash
57
+ python app.py cli input.txt \
58
+ --url http://localhost:5000/v1 \
59
+ --api-key your-key \
60
+ --model mistralai/mistral-7b-instruct \
61
+ --num-questions 5 \
62
+ --chunk-size 8000 \
63
+ --output dataset.jsonl \
64
+ --generate-conversations
65
+ ```
66
+
67
+ Options: `--skip-filter`, `--skip-validation`, `--temperature 0.7`, `--generate-conversations`
68
+
69
+ ## API + MCP
70
+
71
+ - REST API: `POST /api/generate_dataset`
72
+ - MCP Server: enabled (`mcp_server=True`)
73
+
74
+ ## Credits
75
+
76
+ [augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong β€” all prompt templates faithfully ported with original few-shot examples.
app.py ADDED
The diff for this file is too large to render. See raw diff
 
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ gradio[mcp]>=6.0.0
2
+ openai>=1.0.0
3
+ httpx
4
+ nltk