updated readme
Browse files
README.md
CHANGED
|
@@ -13,17 +13,47 @@ tags:
|
|
| 13 |
- question-generation
|
| 14 |
- reward-modeling
|
| 15 |
---
|
| 16 |
-
# distill-pipeline —
|
| 17 |
|
| 18 |
-
`distill-pipeline` is a modular Node.js pipeline that
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
## What it does
|
| 21 |
-
|
| 22 |
-
- **
|
| 23 |
-
-
|
| 24 |
-
-
|
| 25 |
-
|
| 26 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
## Quickstart
|
| 29 |
|
|
|
|
| 13 |
- question-generation
|
| 14 |
- reward-modeling
|
| 15 |
---
|
| 16 |
+
# distill-pipeline — modular synthetic data engine (thinking + instruct)
|
| 17 |
|
| 18 |
+
`distill-pipeline` is a modular **Node.js synthetic data pipeline** that reads JSONL inputs, runs generation + verification + reward, and writes JSONL outputs. It supports both:
|
| 19 |
+
|
| 20 |
+
- **“Thinking” generators** that produce visible reasoning, and
|
| 21 |
+
- **“Instruct” generators** that produce direct answers,
|
| 22 |
+
|
| 23 |
+
with **separate caches and outputs** so you can compare styles without mixing artefacts.
|
| 24 |
+
|
| 25 |
+
Rather than owning retrieval, `distill-pipeline` is designed as the **middle layer** in a stack: you feed it JSONL chunks or questions (for example, from [`distill-rag`](https://huggingface.co/htaf/distill-rag) or your own tooling), and it orchestrates the LLM stages to produce clean, reusable synthetic data.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
|
| 29 |
## What it does
|
| 30 |
+
|
| 31 |
+
- **JSONL-first pipeline**
|
| 32 |
+
- Reads JSONL chunks (default `data/rag_chunks.jsonl`) or static question seeds (`test_samples/seed_questions.jsonl`).
|
| 33 |
+
- Writes accepted samples as JSONL into `gold/*.jsonl`.
|
| 34 |
+
|
| 35 |
+
- **Two pipeline modes**
|
| 36 |
+
- **Thinking mode:** question → reasoning-style answer → verification → reward.
|
| 37 |
+
- **Instruct mode:** instruction → direct answer pairs, for fine-tuning assistants.
|
| 38 |
+
- Each mode has its own cache + output paths so you can run them independently.
|
| 39 |
+
|
| 40 |
+
- **Retrieval-agnostic, RAG-friendly**
|
| 41 |
+
- Works with plain JSONL; any RAG stack or pre-processing step that can emit JSONL chunks can plug in.
|
| 42 |
+
- Optional “question-first” mode uses context chunks (or Elasticsearch) to generate questions from your corpus.
|
| 43 |
+
|
| 44 |
+
- **Stage-based and cache-heavy**
|
| 45 |
+
- Questions, generations, verifications, and rewards are **cached on disk** (JSONL).
|
| 46 |
+
- You can change prompts or models and reuse existing work instead of re-running everything.
|
| 47 |
+
|
| 48 |
+
- **Local-first providers**
|
| 49 |
+
- Built to run locally with **Ollama** as the default provider for all stages.
|
| 50 |
+
- Also supports OpenAI/HTTP-style providers, plus mock providers for tests/benchmarks.
|
| 51 |
+
|
| 52 |
+
- **Monitoring and benchmarks**
|
| 53 |
+
- Live HUD (`scripts/live_bench.mjs`) for real-time throughput/accept-rate monitoring.
|
| 54 |
+
- Benchmark script (`scripts/bench_pipeline.mjs`) to measure pipeline speed without burning GPU on real models.
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
|
| 58 |
## Quickstart
|
| 59 |
|