Spaces:

Luminia
/

Augmentoolkit

Running

App Files Files Community

Nekochu commited on Mar 28

Commit

ba79c5a

0 Parent(s):

Augmentoolkit - 5 pipelines, 7 question styles, CLI + WebGPU

Browse files

Files changed (4) hide show

.gitignore +6 -0
README.md +76 -0
app.py +0 -0
requirements.txt +4 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,6 @@

+.venv/
+__pycache__/
+*.pyc
+.specs/
+*.jsonl

README.md ADDED Viewed

	@@ -0,0 +1,76 @@

+---
+title: Augmentoolkit CPU
+emoji: 📊
+colorFrom: green
+colorTo: blue
+sdk: gradio
+sdk_version: 6.10.0
+app_file: app.py
+pinned: false
+license: mit
+python_version: "3.10"
+---
+# Augmentoolkit — Synthetic Dataset Generator
+Generate synthetic training datasets from text using LLM APIs or WebGPU browser inference. Faithful port of [augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong.
+## Pipelines
+| Pipeline | Purpose | Status |
+|----------|---------|--------|
+| **Factual Q&A** | Educational Q&A pairs with 9-step validation | Full |
+| **RPToolkit** | Roleplay/character training data (scenes, stories, archetypes) | Prompts ready, pipeline stub |
+| **Representation Variation** | Rephrase content as essays, lists, JSON, XML, logic chains | Prompts ready, pipeline stub |
+| **RAG Training** | RAG success/failure conversation data | Prompts ready, pipeline stub |
+| **Correction** | Fix/improve generated data | Prompts ready, pipeline stub |
+## Factual Q&A Pipeline (9 steps)
+1. **Chunk** text into paragraphs
+2. **Filter** chunks for suitability (5 few-shot examples)
+3. **Generate** Q&A pairs (5 few-shot, 7 question styles)
+4. **Check questions** — relevance (3 few-shot)
+5. **Check answer relevancy** — no hallucinations (5 few-shot)
+6. **Check answer accuracy** — fact-check against source (5 few-shot)
+7. **Context repair** — REWORD/PASS/FAIL source references
+8. **Conversation generation** — optional multi-turn dialogue
+9. **Format** as ShareGPT JSONL
+### Question Styles
+Standard, Comparison, Followup, Hallucination (I don't know), Negative, Open-ended, Vague
+## Backends
+| Backend | Speed | Key |
+|---------|-------|-----|
+| WebGPU (Nemotron-3-Nano 4B) | 20-40 t/s | No (browser) |
+| OpenRouter | Fast | Yes (free tier) |
+| Groq Free | Very fast | Yes (free) |
+| HuggingFace Inference | Moderate | Yes |
+| Custom Endpoint | Varies | Yes |
+## CLI
+```bash
+python app.py cli input.txt \
+  --url http://localhost:5000/v1 \
+  --api-key your-key \
+  --model mistralai/mistral-7b-instruct \
+  --num-questions 5 \
+  --chunk-size 8000 \
+  --output dataset.jsonl \
+  --generate-conversations
+```
+Options: `--skip-filter`, `--skip-validation`, `--temperature 0.7`, `--generate-conversations`
+## API + MCP
+- REST API: `POST /api/generate_dataset`
+- MCP Server: enabled (`mcp_server=True`)
+## Credits
+[augmentoolkit](https://github.com/e-p-armstrong/augmentoolkit) by e-p-armstrong — all prompt templates faithfully ported with original few-shot examples.

app.py ADDED Viewed

The diff for this file is too large to render. See raw diff

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+gradio[mcp]>=6.0.0
+openai>=1.0.0
+httpx
+nltk