htaf commited on
Commit
2d8f132
·
1 Parent(s): 7ecc945

updated readme

Browse files
Files changed (1) hide show
  1. README.md +38 -8
README.md CHANGED
@@ -13,17 +13,47 @@ tags:
13
  - question-generation
14
  - reward-modeling
15
  ---
16
- # distill-pipeline — retrieval-augmented distillation (thinking + instruct)
17
 
18
- `distill-pipeline` is a modular Node.js pipeline that ingests JSONL data, runs retrieval + generation + verification + reward, and writes JSONL outputs. It supports both “thinking” (reasoning-style) and “instruct” (direct answer) generators with separate caches/outputs so you can compare styles without mixing artifacts.
 
 
 
 
 
 
 
 
 
19
 
20
  ## What it does
21
- - **JSONL in/out:** Reads JSONL chunks (default `data/rag_chunks.jsonl`) or static question seeds (`test_samples/seed_questions.jsonl`); writes accepted samples to JSONL (`gold/*.jsonl`).
22
- - **Dual modes:** Thinking generations with visible reasoning, or instruct generations for direct Q/A; each can use its own cache/output.
23
- - **RAG-friendly:** Question-first mode uses JSONL chunks (or Elasticsearch) to generate questions from context; static mode runs given questions.
24
- - **Disk caches (JSONL):** Questions, generations, verifications, and rewards are cached to skip already-completed work.
25
- - **Local-first providers:** Ollama/OpenAI/HTTP, plus mock providers for fast tests/benchmarks.
26
- - **Monitoring:** Live HUD + benchmark scripts to show pipeline speed without verbose logs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## Quickstart
29
 
 
13
  - question-generation
14
  - reward-modeling
15
  ---
16
+ # distill-pipeline — modular synthetic data engine (thinking + instruct)
17
 
18
+ `distill-pipeline` is a modular **Node.js synthetic data pipeline** that reads JSONL inputs, runs generation + verification + reward, and writes JSONL outputs. It supports both:
19
+
20
+ - **“Thinking” generators** that produce visible reasoning, and
21
+ - **“Instruct” generators** that produce direct answers,
22
+
23
+ with **separate caches and outputs** so you can compare styles without mixing artefacts.
24
+
25
+ Rather than owning retrieval, `distill-pipeline` is designed as the **middle layer** in a stack: you feed it JSONL chunks or questions (for example, from [`distill-rag`](https://huggingface.co/htaf/distill-rag) or your own tooling), and it orchestrates the LLM stages to produce clean, reusable synthetic data.
26
+
27
+ ---
28
 
29
  ## What it does
30
+
31
+ - **JSONL-first pipeline**
32
+ - Reads JSONL chunks (default `data/rag_chunks.jsonl`) or static question seeds (`test_samples/seed_questions.jsonl`).
33
+ - Writes accepted samples as JSONL into `gold/*.jsonl`.
34
+
35
+ - **Two pipeline modes**
36
+ - **Thinking mode:** question → reasoning-style answer → verification → reward.
37
+ - **Instruct mode:** instruction → direct answer pairs, for fine-tuning assistants.
38
+ - Each mode has its own cache + output paths so you can run them independently.
39
+
40
+ - **Retrieval-agnostic, RAG-friendly**
41
+ - Works with plain JSONL; any RAG stack or pre-processing step that can emit JSONL chunks can plug in.
42
+ - Optional “question-first” mode uses context chunks (or Elasticsearch) to generate questions from your corpus.
43
+
44
+ - **Stage-based and cache-heavy**
45
+ - Questions, generations, verifications, and rewards are **cached on disk** (JSONL).
46
+ - You can change prompts or models and reuse existing work instead of re-running everything.
47
+
48
+ - **Local-first providers**
49
+ - Built to run locally with **Ollama** as the default provider for all stages.
50
+ - Also supports OpenAI/HTTP-style providers, plus mock providers for tests/benchmarks.
51
+
52
+ - **Monitoring and benchmarks**
53
+ - Live HUD (`scripts/live_bench.mjs`) for real-time throughput/accept-rate monitoring.
54
+ - Benchmark script (`scripts/bench_pipeline.mjs`) to measure pipeline speed without burning GPU on real models.
55
+
56
+ ---
57
 
58
  ## Quickstart
59