File size: 2,516 Bytes
ba79c5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
title: Augmentoolkit CPU
emoji: πŸ“Š
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 6.10.0
app_file: app.py
pinned: false
license: mit
python_version: "3.10"
---

# Augmentoolkit β€” Synthetic Dataset Generator

Generate synthetic training datasets from text using LLM APIs or WebGPU browser inference. Faithful port of [augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong.

## Pipelines

| Pipeline | Purpose | Status |
|----------|---------|--------|
| **Factual Q&A** | Educational Q&A pairs with 9-step validation | Full |
| **RPToolkit** | Roleplay/character training data (scenes, stories, archetypes) | Prompts ready, pipeline stub |
| **Representation Variation** | Rephrase content as essays, lists, JSON, XML, logic chains | Prompts ready, pipeline stub |
| **RAG Training** | RAG success/failure conversation data | Prompts ready, pipeline stub |
| **Correction** | Fix/improve generated data | Prompts ready, pipeline stub |

## Factual Q&A Pipeline (9 steps)

1. **Chunk** text into paragraphs
2. **Filter** chunks for suitability (5 few-shot examples)
3. **Generate** Q&A pairs (5 few-shot, 7 question styles)
4. **Check questions** β€” relevance (3 few-shot)
5. **Check answer relevancy** β€” no hallucinations (5 few-shot)
6. **Check answer accuracy** β€” fact-check against source (5 few-shot)
7. **Context repair** β€” REWORD/PASS/FAIL source references
8. **Conversation generation** β€” optional multi-turn dialogue
9. **Format** as ShareGPT JSONL

### Question Styles

Standard, Comparison, Followup, Hallucination (I don't know), Negative, Open-ended, Vague

## Backends

| Backend | Speed | Key |
|---------|-------|-----|
| WebGPU (Nemotron-3-Nano 4B) | 20-40 t/s | No (browser) |
| OpenRouter | Fast | Yes (free tier) |
| Groq Free | Very fast | Yes (free) |
| HuggingFace Inference | Moderate | Yes |
| Custom Endpoint | Varies | Yes |

## CLI

```bash
python app.py cli input.txt \
  --url http://localhost:5000/v1 \
  --api-key your-key \
  --model mistralai/mistral-7b-instruct \
  --num-questions 5 \
  --chunk-size 8000 \
  --output dataset.jsonl \
  --generate-conversations
```

Options: `--skip-filter`, `--skip-validation`, `--temperature 0.7`, `--generate-conversations`

## API + MCP

- REST API: `POST /api/generate_dataset`
- MCP Server: enabled (`mcp_server=True`)

## Credits

[augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong β€” all prompt templates faithfully ported with original few-shot examples.