distill-pipeline / ARCHITECTURE.md
htaf's picture
modularized pipeline
5464613
Absolutely β€” here is a **fully updated, successor-ready `ARCHITECTURE.md`** that reflects your *current* codebase, including:
* the **new question generator subsystem**,
* the **multi-provider architecture**,
* the **modular pipeline** (generator/verifier/reward),
* the **retrieval stage**,
* the **batch orchestrator**,
* and the **modularization plan** you're moving toward.
It will not clobber anything.
It is aligned with your present repo after the latest commits and tests.
You can paste this directly as:
```
distill-pipeline/ARCHITECTURE.md
```
---
# **ARCHITECTURE.md**
*Distill-Pipeline β€” System Architecture & Successor Notes*
*(Node.js, ESM, Ollama/vLLM/OpenAI providers, Vitest-tested)*
---
# **1. Purpose**
`distill-pipeline` is a modular, retrieval-augmented LLM distillation engine.
It produces high-quality *gold data* by running each question through:
1. **retrieval** (hybrid RAG via distill-rag)
2. **generator** (teacher model)
3. **verifier** (alignment/format checker)
4. **reward model** (scoring)
5. **gold writer** (JSONL builder)
It also includes a **question generation** module to extract questions directly from RAG chunks, enabling true content-first distillation.
The system is built for offline, local distillation on consumer GPUs (your 3090 + 3060).
---
# **2. High-Level Flow**
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Chunk Source β”‚ ← distill-rag index
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
(optional) Question Generation
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Retrieval β”‚ (hybrid BM25 + dense)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Generator β”‚ (LLM teacher)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Verifier β”‚ (LLM)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Reward Model β”‚ (LLM critic)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Gold Writer β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
# **3. Directory Layout**
Your repo structure (as of now, after modularization):
```
distill-pipeline/
prompts/
generator_prompt.txt
verifier_prompt.txt
reward_prompt.txt
question_prompt.txt
src/
pipeline/
pipeline.mjs
pipeline_cli.mjs
providers/
provider.mjs
ollama_provider.mjs
openai_provider.mjs
http_provider.mjs
retrieval/
retrieval.mjs
generator/
generator_core.mjs
verifier/
verifier_core.mjs
reward/
reward_core.mjs
question/
question_core.mjs
question_cli.mjs
gold/
(generated JSONL files)
test_samples/
seed_questions.jsonl ← for static mode
tests/
generator_core.test.mjs
verifier_core.test.mjs
reward_core.test.mjs
provider.mock.test.mjs
pipeline.mock.test.mjs
retrieval.real.test.mjs
retrieval.mock.test.mjs
gold_core.test.mjs
question_core.test.mjs
.env
package.json
ARCHITECTURE.md
ROADMAP.md
```
Everything is now properly separated into **pure core modules**, each with **Vitest tests**.
---
# **4. Core Modules**
Below is a top-down view.
---
## **4.1 Provider System (src/providers/)**
This system routes each pipeline stage to a backend:
* `OllamaProvider`
* `OpenAIProvider`
* `HttpProvider`
* future: `vLLMProvider`
All providers expose:
```js
async generate(prompt, options?)
```
The dispatcher:
```js
loadProviderFor("generator" | "verifier" | "reward" | "question")
```
Selects backend using env:
```
GENERATOR_PROVIDER=ollama
VERIFIER_PROVIDER=ollama
REWARD_PROVIDER=ollama
QUESTION_PROVIDER=ollama
```
And uses stage-specific model names:
```
GENERATOR_MODEL=qwen3-vl:8b-thinking
VERIFIER_MODEL=patronus:8b
REWARD_MODEL=patronus:8b
QUESTION_MODEL=qwen2.5-7b-instruct
```
This architecture is clean, extensible, and fully testable.
---
## **4.2 Retrieval (src/retrieval/retrieval.mjs)**
Your retrieval layer connects to the **distill-rag** Elasticsearch index.
Supports:
* BM25
* Dense vector KNN
* Hybrid RRF
* optional future HyDE
The key export:
```js
export async function hybridSearch(query, k)
```
You already have real + mock tests for this module.
βœ” This module is stable.
---
## **4.3 Generator (src/generator/generator_core.mjs)**
Pure function:
```js
async function runGenerator(query, contextChunks, provider)
```
Pipeline:
* loads generator prompt template
* merges context chunks into a context string
* invokes provider.generate
* JSON-parses output
* returns:
```js
{
query,
context,
raw,
parsed
}
```
βœ“ fully test-covered
βœ“ easy to replace provider/model
---
## **4.4 Verifier (src/verifier/verifier_core.mjs)**
Pure function:
```js
async function runVerifier(sample, provider)
```
Applies:
* structural JSON check
* alignment/tone check
* error correction fallback
Returns:
```js
{
ok: boolean,
raw,
parsed,
sample
}
```
βœ“ test-covered
---
## **4.5 Reward Model (src/reward/reward_core.mjs)**
Pure scoring function:
```js
async function runReward(sample, provider)
```
* loads reward prompt
* calls provider
* ensures `score` is numeric
* computes `ok` based on positivity
βœ“ test-covered
(This will eventually be replaced with your Skywork or Nemotron reward server.)
---
## **4.6 Question Generation (src/question/question_core.mjs)**
Your newest subsystem.
```js
async function runQuestionGeneration(chunk, provider, maxQuestions)
```
Flow:
1. Take a raw content chunk (from distill-rag)
2. Prompt an LLM to extract 1–N questions
3. Parse/repair JSON
4. Return array of questions
Used when:
```
PIPELINE_SEED_MODE=question-first
```
So the pipeline becomes:
```
chunk β†’ questions β†’ retrieval β†’ generator β†’ ...
```
βœ“ test-covered
βœ“ modular
βœ“ will become core for bootstrap distillation
---
## **4.7 Pipeline Orchestrator (src/pipeline/pipeline.mjs)**
This is the master controller.
Key functions:
### `runPipelineStep({ question, verbose })`
Performs:
1. retrieval
2. generator
3. verifier
4. reward
and returns:
```
{
status: 'accepted' | 'generator_failed' | ...,
question,
context,
gen,
ver,
rew
}
```
Extensive verbose logging is built in:
```
[retrieval] ...
[generator] ...
[verifier] ...
[reward] ...
```
### `runPipelineBatch({ seedsPath, limit, verbose })`
Iterates over seeds:
* static seed mode (default)
* or question-first mode (pending)
Writes accepted samples via:
### `appendGoldRecord(outPath, record)`
---
# **5. Seed Modes**
There are two entry strategies:
---
## **5.1 Static Question Mode**
```
PIPELINE_SEED_MODE=static
```
Loads:
```
test_samples/seed_questions.jsonl
```
Simple and deterministic.
---
## **5.2 Question-First Mode** *(recommended)*
```
PIPELINE_SEED_MODE=question-first
```
Pipeline:
```
for each chunk:
questions = runQuestionGeneration(chunk)
for each question:
runPipelineStep(question)
```
This is the correct mode for massive bootstrap distillation because not every chunk answers the same static seed questions.
This mode uses:
* `QUESTION_PROVIDER`
* `QUESTION_MODEL`
---
# **6. Modularization Status**
Already modular:
* generator_core.mjs
* verifier_core.mjs
* reward_core.mjs
* provider.mjs
* question_core.mjs
* retrieval.mjs
Partially modular:
* pipeline.mjs (big but structured)
* pipeline_cli.mjs (needs handling for dynamic seed mode)
Planned:
```
pipeline/
retrieval_stage.mjs
generator_stage.mjs
verifier_stage.mjs
reward_stage.mjs
gold_writer.mjs
```
This matches the ROADMAP.
---
# **7. What Can Be Tested**
All pure modules have unit tests:
| Module | Tested? | Notes |
| ------------------- | -------- | -------------- |
| generator_core | βœ“ | mock provider |
| verifier_core | βœ“ | mock provider |
| reward_core | βœ“ | mock provider |
| question_core | βœ“ | mock provider |
| provider dispatcher | βœ“ | dispatch logic |
| retrieval | βœ“βœ“ | mock + real ES |
| pipeline (mock) | βœ“ | integration |
| pipeline (real) | optional | can add later |
Your test suite is healthy:
```
9 files, 27 tests β†’ all pass
```
---
# **8. Logging & Verbose Mode**
All stages print diagnostics when `verbose` is passed to:
```
npm run pipeline -- --verbose
```
Includes:
* first chunk preview
* raw LLM output
* parsed JSON
* acceptance status
* error messages
---
# **9. Future Extensions**
As per ROADMAP:
* split pipeline into smaller modules
* improved QG (HyDE, retries, JSON repair)
* dedupe (minhash)
* gold dataset quality metrics
* full distillation cycle (generator β†’ verifier β†’ reward β†’ training β†’ new generator)
---
# **10. Successor Notes**
This project is:
* entirely Node.js ESM
* fully testable end-to-end
* GPU-agnostic
* provider-agnostic
* prompt-driven
* safe to modify when modularized
Golden rule:
> Never mix CLI code with pipeline logic.
> Put everything pure into `*_core.mjs`, test it, then wrap it in CLI tools.
---
If you'd like, I can also:
βœ“ generate the **next version** of pipeline modularization
βœ“ implement `PIPELINE_SEED_MODE=question-first` fully
βœ“ add a **chunk loader** so QG works immediately
βœ“ produce a **Mermaid architecture diagram**
βœ“ produce a **successor prompt** to embed in the repo
Just tell me.