htaf commited on
Commit
f1653fe
·
1 Parent(s): 2739b3a

add HF metadata and repo info

Browse files
Files changed (2) hide show
  1. README.md +84 -283
  2. package.json +11 -2
README.md CHANGED
@@ -1,305 +1,106 @@
1
- Here is a **clean, successor-ready `README.md`** for your `distill-pipeline` repo.
2
-
3
- It:
4
-
5
- * matches your actual codebase **right now**
6
- * includes the new **question generation** subsystem
7
- * documents both **static** and **question-first** seed modes
8
- * gives correct usage for `npm run pipeline`
9
- * shows environment variables clearly
10
- * stays pragmatic and Canadian-English-friendly
11
- * is concise enough for GitHub, but complete enough for onboarding a new engineer
12
-
13
- Paste it into:
14
-
15
- ```
16
- distill-pipeline/README.md
17
- ```
18
-
19
  ---
20
-
21
- # **distill-pipeline**
22
-
23
- *A modular, retrieval-augmented LLM distillation system.*
24
-
25
- This project runs a multi-stage reasoning pipeline:
26
-
27
- 1. **Question Generation** (optional)
28
- 2. **Retrieval** from a distill-rag Elasticsearch index
29
- 3. **Generator** (teacher model)
30
- 4. **Verifier** (format, alignment, tone)
31
- 5. **Reward Model** (scoring)
32
- 6. **Gold Writer** (clean JSONL dataset)
33
-
34
- The pipeline is designed for **bootstrapped distillation**, where each cycle improves the model and the dataset.
35
- All components run locally and support multiple providers (Ollama, HTTP, OpenAI, vLLM).
36
-
37
- ---
38
-
39
- # **Features**
40
-
41
- ### ✔ Retrieval-augmented generation
42
-
43
- Hybrid RRF search (BM25 + dense embeddings) via **distill-rag**.
44
-
45
- ### ✔ Modular LLM stages
46
-
47
- Each stage uses a provider implementing:
48
-
49
- ```js
50
- async generate(prompt, options?)
51
- ```
52
-
53
- ### ✔ Question generation from chunks
54
-
55
- LLM extracts focused questions directly from transcript chunks.
56
- Ideal for large-scale bootstrap distillation.
57
-
58
- ### ✔ Multiple providers
59
-
60
- Configured per-stage using environment variables:
61
-
62
- ```
63
- GENERATOR_PROVIDER
64
- VERIFIER_PROVIDER
65
- REWARD_PROVIDER
66
- QUESTION_PROVIDER
67
- ```
68
-
69
- Providers currently supported:
70
-
71
- * Ollama
72
- * OpenAI
73
- * HTTP endpoint
74
- * (future) vLLM server
75
-
76
- ### ✔ Fully tested
77
-
78
- All pure modules include Vitest coverage:
79
-
80
- * retrieval (mock + real ES)
81
- * generator, verifier, reward
82
- * question generation
83
- * provider router
84
- * pipeline integration (mock)
85
- * JSONL cache, PASS/FAIL verifier parsing, generator parsing (thought/thinking/answer)
86
-
87
  ---
88
 
89
- # **Project Structure**
90
 
91
- ```
92
- prompts/
93
- generator_prompt.txt
94
- verifier_prompt.txt
95
- reward_prompt.txt
96
- question_prompt.txt
97
-
98
- src/
99
- pipeline/
100
- pipeline.mjs
101
- pipeline_cli.mjs
102
- providers/
103
- provider.mjs
104
- ollama_provider.mjs
105
- openai_provider.mjs
106
- http_provider.mjs
107
- retrieval/
108
- retrieval.mjs
109
- generator/
110
- generator_core.mjs
111
- verifier/
112
- verifier_core.mjs
113
- reward/
114
- reward_core.mjs
115
- question/
116
- question_core.mjs
117
- question_cli.mjs
118
 
119
- test_samples/
120
- seed_questions.jsonl
 
 
 
 
 
 
 
 
121
 
122
- gold/
123
- (pipeline output)
124
-
125
- tests/
126
- *.test.mjs
127
  ```
128
-
129
- ---
130
-
131
- # **Installation**
132
-
133
- ```bash
134
- git clone https://github.com/yourname/distill-pipeline
135
- cd distill-pipeline
136
- npm install
137
- ```
138
-
139
- You also need a running **distill-rag** instance with:
140
-
141
- * Elasticsearch index
142
- * embedding server (Ollama or HTTP)
143
-
144
- ---
145
-
146
- # **Configuration**
147
-
148
- All runtime settings are configured via `.env`.
149
-
150
- A common example:
151
-
152
- ```env
153
- # Elasticsearch (from distill-rag)
154
  ES_NODE=http://localhost:9200
155
  ES_INDEX=quo_distill_index
156
-
157
- # Embedding server
158
  EMBED_URL=http://localhost:11434/api/embeddings
159
  EMBED_MODEL=mxbai-embed-large
160
 
161
- # Provider backends
162
  GENERATOR_PROVIDER=ollama
163
  VERIFIER_PROVIDER=ollama
164
  REWARD_PROVIDER=ollama
165
  QUESTION_PROVIDER=ollama
166
 
167
- # Stage-specific models
168
  GENERATOR_MODEL=qwen3-vl:8b-thinking
169
  VERIFIER_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
170
  REWARD_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
171
  QUESTION_MODEL=qwen2.5-7b-instruct
172
- ```
173
-
174
- ---
175
-
176
- # **Running the Pipeline**
177
-
178
- There are **two seed modes**.
179
-
180
- ---
181
-
182
- ## **1. Static Seed Mode** *(default)*
183
-
184
- Reads questions from:
185
-
186
- ```
187
- test_samples/seed_questions.jsonl
188
- ```
189
-
190
- Run:
191
-
192
- ```bash
193
- npm run pipeline -- --limit 20 --verbose
194
- ```
195
-
196
- ---
197
-
198
- ## **2. Question-First Mode (auto-generate questions)**
199
-
200
- The pipeline will:
201
-
202
- * fetch chunks from distill-rag,
203
- * run question extraction,
204
- * feed each question into the main pipeline.
205
-
206
- Enable this mode:
207
-
208
- ```bash
209
- PIPELINE_SEED_MODE=question-first npm run pipeline -- --limit 20 --verbose
210
- ```
211
-
212
- ---
213
-
214
- # **Outputs**
215
-
216
- Accepted samples are written to:
217
-
218
- ```
219
- gold/pipeline_gold.jsonl
220
- ```
221
-
222
- Each record contains:
223
-
224
- ```json
225
- {
226
- "question": "...",
227
- "context": [...],
228
- "sample": { ... },
229
- "verifier": { ... },
230
- "reward": { ... }
231
- }
232
- ```
233
-
234
- This file is ready for use in QLoRA SFT training.
235
-
236
- ---
237
-
238
- # **Running Tests**
239
-
240
- ```bash
241
- npm test
242
- ```
243
-
244
- All core logic modules are covered:
245
-
246
- ```
247
- 9 test files
248
- 27 tests
249
- 0 failures
250
- ```
251
-
252
- ---
253
-
254
- # **How to Extend**
255
-
256
- ## Add a new model provider
257
-
258
- Implement:
259
-
260
- ```js
261
- class MyProvider {
262
- constructor(stage) { ... }
263
- async generate(prompt, opts) { ... }
264
- }
265
- ```
266
-
267
- Then register it in:
268
-
269
- ```
270
- src/providers/provider.mjs
271
- ```
272
-
273
- ## Add a new pipeline stage
274
-
275
- Follow the existing structure:
276
-
277
- * create `src/<stage>/<stage>_core.mjs`
278
- * add prompt in `prompts/`
279
- * add a test in `tests/`
280
-
281
- ---
282
-
283
- # **Development Notes**
284
-
285
- * Avoid mixing CLI logic with pipeline logic — all pure functions are in `*_core.mjs`.
286
- * Providers must always return **JSON-parseable** output.
287
- * Retrieval expects a working **distill-rag** index with BM25 + vector embeddings.
288
- * Reward model may be swapped later for your custom HTTP reward server.
289
-
290
- ---
291
-
292
- # **License**
293
-
294
- MIT (or update as needed).
295
-
296
- ---
297
-
298
- If you want:
299
-
300
- ✓ a shorter GitHub-friendly description
301
- ✓ a more polished badge/header section
302
- ✓ install instructions tailored to your exact environment
303
- ✓ a separate `USAGE.md`
304
 
305
- Just ask.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ title: distill-pipeline
4
+ tags:
5
+ - distillation
6
+ - retrieval-augmented-generation
7
+ - pipeline
8
+ - nodejs
9
+ - ollama
10
+ - instruct
11
+ - question-generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
+ # distill-pipeline
15
 
16
+ Modular, retrieval-augmented distillation with question generation, verifier, and reward stages. Supports “thinking” and “instruct” generator flows with separate caches/outputs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
+ ## Quickstart
19
+ - Clone: `git clone https://github.com/elspru/distill-pipeline && cd distill-pipeline`
20
+ - Install: `npm install`
21
+ - Thinking pipeline (question-first, random walk):
22
+ `PIPELINE_SEED_MODE=question-first PIPELINE_RANDOM_WALK=1 npm run pipeline -- --limit 5 --verbose`
23
+ - Instruct pipeline (separate cache/output):
24
+ `INSTRUCT_PIPELINE=1 INSTRUCT_GENERATOR_MODEL=<model> PIPELINE_CACHE_DIR=data/cache_instruct npm run pipeline -- --out gold/pipeline_gold_instruct.jsonl --verbose`
25
+ - Continuous loops (stop with Ctrl+C):
26
+ `scripts/run_thinking_continuous.sh`
27
+ `INSTRUCT_GENERATOR_MODEL=<model> scripts/run_instruct_continuous.sh`
28
 
29
+ ## Configuration (see `.env.example`)
 
 
 
 
30
  ```
31
+ # Retrieval
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ES_NODE=http://localhost:9200
33
  ES_INDEX=quo_distill_index
 
 
34
  EMBED_URL=http://localhost:11434/api/embeddings
35
  EMBED_MODEL=mxbai-embed-large
36
 
37
+ # Providers per stage
38
  GENERATOR_PROVIDER=ollama
39
  VERIFIER_PROVIDER=ollama
40
  REWARD_PROVIDER=ollama
41
  QUESTION_PROVIDER=ollama
42
 
43
+ # Models
44
  GENERATOR_MODEL=qwen3-vl:8b-thinking
45
  VERIFIER_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
46
  REWARD_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
47
  QUESTION_MODEL=qwen2.5-7b-instruct
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
+ # Instruct-only generator
50
+ INSTRUCT_PIPELINE=0
51
+ INSTRUCT_GENERATOR_MODEL=phi-4-instruct
52
+ INSTRUCT_GENERATOR_PROVIDER=ollama
53
+
54
+ # Pipeline knobs
55
+ PIPELINE_SEED_MODE=question-first
56
+ PIPELINE_RANDOM_WALK=0 # set 1 for shuffled chunks
57
+ QUESTION_MAX_PER_CHUNK=5
58
+ # PIPELINE_CHUNK_LIMIT=10
59
+ # PIPELINE_CACHE_DIR=data/cache # override (e.g., data/cache_instruct)
60
+ ```
61
+
62
+ ## Key scripts
63
+ - `npm run pipeline` — main pipeline CLI (`--limit`, `--out`, `--chunk-limit`, `--verbose`).
64
+ - `scripts/run_thinking_continuous.sh` — loop thinking pipeline with random walk.
65
+ - `scripts/run_instruct_continuous.sh` — loop instruct pipeline (needs `INSTRUCT_GENERATOR_MODEL`).
66
+ - `scripts/try_generator_prompt.sh` — send generator prompt with cached chunk/question (`--random`, `-r` for reasoning).
67
+ - `scripts/cache_report.mjs` — cache stats; set `CACHE_REPORT_MODE=thinking|instruct|both` or `PIPELINE_CACHE_DIR=...`.
68
+
69
+ ## Outputs
70
+ - Gold JSONL default: `gold/pipeline_gold.jsonl` (instruct default: `gold/pipeline_gold_instruct.jsonl`).
71
+ - Sample gold: `samples/pipeline_gold_sample.jsonl`.
72
+ - Cache defaults: `data/cache` (thinking) and `data/cache_instruct` (instruct); both gitignored.
73
+
74
+ ## Hugging Face / GitHub distribution
75
+ - License: Apache-2.0 (`LICENSE`).
76
+ - CI: `.github/workflows/ci.yml` runs `npm test` on push/PR.
77
+ - Push to GitHub:
78
+ ```
79
+ git remote add origin https://github.com/elspru/distill-pipeline
80
+ git push origin main
81
+ ```
82
+ - Push to Hugging Face (user: htaf):
83
+ ```
84
+ git lfs install
85
+ git remote add hf https://huggingface.co/htaf/distill-pipeline
86
+ git push origin main
87
+ git push hf main
88
+ ```
89
+ - Publish code + prompts + `samples/pipeline_gold_sample.jsonl`. Keep caches/gold outputs out (gitignored).
90
+
91
+ ## Project structure
92
+ ```
93
+ prompts/ # stage prompts
94
+ src/ # pipeline, providers, stages
95
+ tests/ # Vitest
96
+ data/ # rag chunks (jsonl), cache (ignored)
97
+ gold/ # outputs (ignored)
98
+ scripts/ # tooling + runners
99
+ samples/pipeline_gold_sample.jsonl
100
+ ```
101
+
102
+ ## Testing
103
+ `npm test`
104
+
105
+ ## License
106
+ Apache-2.0
package.json CHANGED
@@ -1,7 +1,16 @@
1
  {
2
  "name": "distill-pipeline",
3
- "version": "1.0.0",
4
  "type": "module",
 
 
 
 
 
 
 
 
 
5
  "scripts": {
6
  "test": "vitest --run",
7
  "pipeline": "node ./src/pipeline/pipeline_cli.js",
@@ -10,4 +19,4 @@
10
  "devDependencies": {
11
  "vitest": "^1.6.0"
12
  }
13
- }
 
1
  {
2
  "name": "distill-pipeline",
3
+ "version": "1.1.0",
4
  "type": "module",
5
+ "license": "Apache-2.0",
6
+ "repository": {
7
+ "type": "git",
8
+ "url": "https://github.com/elspru/distill-pipeline.git"
9
+ },
10
+ "bugs": {
11
+ "url": "https://github.com/elspru/distill-pipeline/issues"
12
+ },
13
+ "homepage": "https://github.com/elspru/distill-pipeline#readme",
14
  "scripts": {
15
  "test": "vitest --run",
16
  "pipeline": "node ./src/pipeline/pipeline_cli.js",
 
19
  "devDependencies": {
20
  "vitest": "^1.6.0"
21
  }
22
+ }