Augmentoolkit / README.md
Nekochu's picture
Augmentoolkit - 5 pipelines, 7 question styles, CLI + WebGPU
ba79c5a

A newer version of the Gradio SDK is available: 6.11.0

Upgrade
metadata
title: Augmentoolkit CPU
emoji: πŸ“Š
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 6.10.0
app_file: app.py
pinned: false
license: mit
python_version: '3.10'

Augmentoolkit β€” Synthetic Dataset Generator

Generate synthetic training datasets from text using LLM APIs or WebGPU browser inference. Faithful port of augmentoolkit by e-p-armstrong.

Pipelines

Pipeline Purpose Status
Factual Q&A Educational Q&A pairs with 9-step validation Full
RPToolkit Roleplay/character training data (scenes, stories, archetypes) Prompts ready, pipeline stub
Representation Variation Rephrase content as essays, lists, JSON, XML, logic chains Prompts ready, pipeline stub
RAG Training RAG success/failure conversation data Prompts ready, pipeline stub
Correction Fix/improve generated data Prompts ready, pipeline stub

Factual Q&A Pipeline (9 steps)

  1. Chunk text into paragraphs
  2. Filter chunks for suitability (5 few-shot examples)
  3. Generate Q&A pairs (5 few-shot, 7 question styles)
  4. Check questions β€” relevance (3 few-shot)
  5. Check answer relevancy β€” no hallucinations (5 few-shot)
  6. Check answer accuracy β€” fact-check against source (5 few-shot)
  7. Context repair β€” REWORD/PASS/FAIL source references
  8. Conversation generation β€” optional multi-turn dialogue
  9. Format as ShareGPT JSONL

Question Styles

Standard, Comparison, Followup, Hallucination (I don't know), Negative, Open-ended, Vague

Backends

Backend Speed Key
WebGPU (Nemotron-3-Nano 4B) 20-40 t/s No (browser)
OpenRouter Fast Yes (free tier)
Groq Free Very fast Yes (free)
HuggingFace Inference Moderate Yes
Custom Endpoint Varies Yes

CLI

python app.py cli input.txt \
  --url http://localhost:5000/v1 \
  --api-key your-key \
  --model mistralai/mistral-7b-instruct \
  --num-questions 5 \
  --chunk-size 8000 \
  --output dataset.jsonl \
  --generate-conversations

Options: --skip-filter, --skip-validation, --temperature 0.7, --generate-conversations

API + MCP

  • REST API: POST /api/generate_dataset
  • MCP Server: enabled (mcp_server=True)

Credits

augmentoolkit by e-p-armstrong β€” all prompt templates faithfully ported with original few-shot examples.