modularized pipeline

5464613 3 months ago

10 kB

	Absolutely — here is a fully updated, successor-ready `ARCHITECTURE.md` that reflects your current codebase, including:

	* the new question generator subsystem,
	* the multi-provider architecture,
	* the modular pipeline (generator/verifier/reward),
	* the retrieval stage,
	* the batch orchestrator,
	* and the modularization plan you're moving toward.

	It will not clobber anything.
	It is aligned with your present repo after the latest commits and tests.

	You can paste this directly as:

	```
	distill-pipeline/ARCHITECTURE.md
	```

	---

	# ARCHITECTURE.md

	Distill-Pipeline — System Architecture & Successor Notes
	(Node.js, ESM, Ollama/vLLM/OpenAI providers, Vitest-tested)

	---

	# 1. Purpose

	`distill-pipeline` is a modular, retrieval-augmented LLM distillation engine.
	It produces high-quality gold data by running each question through:

	1. retrieval (hybrid RAG via distill-rag)
	2. generator (teacher model)
	3. verifier (alignment/format checker)
	4. reward model (scoring)
	5. gold writer (JSONL builder)

	It also includes a question generation module to extract questions directly from RAG chunks, enabling true content-first distillation.

	The system is built for offline, local distillation on consumer GPUs (your 3090 + 3060).

	---

	# 2. High-Level Flow

	```
	┌────────────────┐
	│ Chunk Source │ ← distill-rag index
	└──────┬─────────┘
	▼
	(optional) Question Generation
	▼
	┌───────────────┐
	│ Retrieval │ (hybrid BM25 + dense)
	└──────┬────────┘
	▼
	┌───────────────┐
	│ Generator │ (LLM teacher)
	└──────┬────────┘
	▼
	┌───────────────┐
	│ Verifier │ (LLM)
	└──────┬────────┘
	▼
	┌───────────────┐
	│ Reward Model │ (LLM critic)
	└──────┬────────┘
	▼
	┌───────────────┐
	│ Gold Writer │
	└───────────────┘
	```

	---

	# 3. Directory Layout

	Your repo structure (as of now, after modularization):

	```
	distill-pipeline/
	prompts/
	generator_prompt.txt
	verifier_prompt.txt
	reward_prompt.txt
	question_prompt.txt

	src/
	pipeline/
	pipeline.mjs
	pipeline_cli.mjs
	providers/
	provider.mjs
	ollama_provider.mjs
	openai_provider.mjs
	http_provider.mjs
	retrieval/
	retrieval.mjs
	generator/
	generator_core.mjs
	verifier/
	verifier_core.mjs
	reward/
	reward_core.mjs
	question/
	question_core.mjs
	question_cli.mjs

	gold/
	(generated JSONL files)

	test_samples/
	seed_questions.jsonl ← for static mode

	tests/
	generator_core.test.mjs
	verifier_core.test.mjs
	reward_core.test.mjs
	provider.mock.test.mjs
	pipeline.mock.test.mjs
	retrieval.real.test.mjs
	retrieval.mock.test.mjs
	gold_core.test.mjs
	question_core.test.mjs

	.env
	package.json
	ARCHITECTURE.md
	ROADMAP.md
	```

	Everything is now properly separated into pure core modules, each with Vitest tests.

	---

	# 4. Core Modules

	Below is a top-down view.

	---

	## 4.1 Provider System (src/providers/)

	This system routes each pipeline stage to a backend:

	* `OllamaProvider`
	* `OpenAIProvider`
	* `HttpProvider`
	* future: `vLLMProvider`

	All providers expose:

	```js
	async generate(prompt, options?)
	```

	The dispatcher:

	```js
	loadProviderFor("generator" \| "verifier" \| "reward" \| "question")
	```

	Selects backend using env:

	```
	GENERATOR_PROVIDER=ollama
	VERIFIER_PROVIDER=ollama
	REWARD_PROVIDER=ollama
	QUESTION_PROVIDER=ollama
	```

	And uses stage-specific model names:

	```
	GENERATOR_MODEL=qwen3-vl:8b-thinking
	VERIFIER_MODEL=patronus:8b
	REWARD_MODEL=patronus:8b
	QUESTION_MODEL=qwen2.5-7b-instruct
	```

	This architecture is clean, extensible, and fully testable.

	---

	## 4.2 Retrieval (src/retrieval/retrieval.mjs)

	Your retrieval layer connects to the distill-rag Elasticsearch index.

	Supports:

	* BM25
	* Dense vector KNN
	* Hybrid RRF
	* optional future HyDE

	The key export:

	```js
	export async function hybridSearch(query, k)
	```

	You already have real + mock tests for this module.

	✔ This module is stable.

	---

	## 4.3 Generator (src/generator/generator_core.mjs)

	Pure function:

	```js
	async function runGenerator(query, contextChunks, provider)
	```

	Pipeline:

	* loads generator prompt template
	* merges context chunks into a context string
	* invokes provider.generate
	* JSON-parses output
	* returns:

	```js
	{
	query,
	context,
	raw,
	parsed
	}
	```

	✓ fully test-covered
	✓ easy to replace provider/model

	---

	## 4.4 Verifier (src/verifier/verifier_core.mjs)

	Pure function:

	```js
	async function runVerifier(sample, provider)
	```

	Applies:

	* structural JSON check
	* alignment/tone check
	* error correction fallback

	Returns:

	```js
	{
	ok: boolean,
	raw,
	parsed,
	sample
	}
	```

	✓ test-covered

	---

	## 4.5 Reward Model (src/reward/reward_core.mjs)

	Pure scoring function:

	```js
	async function runReward(sample, provider)
	```

	* loads reward prompt
	* calls provider
	* ensures `score` is numeric
	* computes `ok` based on positivity

	✓ test-covered

	(This will eventually be replaced with your Skywork or Nemotron reward server.)

	---

	## 4.6 Question Generation (src/question/question_core.mjs)

	Your newest subsystem.

	```js
	async function runQuestionGeneration(chunk, provider, maxQuestions)
	```

	Flow:

	1. Take a raw content chunk (from distill-rag)
	2. Prompt an LLM to extract 1–N questions
	3. Parse/repair JSON
	4. Return array of questions

	Used when:

	```
	PIPELINE_SEED_MODE=question-first
	```

	So the pipeline becomes:

	```
	chunk → questions → retrieval → generator → ...
	```

	✓ test-covered
	✓ modular
	✓ will become core for bootstrap distillation

	---

	## 4.7 Pipeline Orchestrator (src/pipeline/pipeline.mjs)

	This is the master controller.

	Key functions:

	### `runPipelineStep({ question, verbose })`

	Performs:

	1. retrieval
	2. generator
	3. verifier
	4. reward

	and returns:

	```
	{
	status: 'accepted' \| 'generator_failed' \| ...,
	question,
	context,
	gen,
	ver,
	rew
	}
	```

	Extensive verbose logging is built in:

	```
	[retrieval] ...
	[generator] ...
	[verifier] ...
	[reward] ...
	```

	### `runPipelineBatch({ seedsPath, limit, verbose })`

	Iterates over seeds:

	* static seed mode (default)
	* or question-first mode (pending)

	Writes accepted samples via:

	### `appendGoldRecord(outPath, record)`

	---

	# 5. Seed Modes

	There are two entry strategies:

	---

	## 5.1 Static Question Mode

	```
	PIPELINE_SEED_MODE=static
	```

	Loads:

	```
	test_samples/seed_questions.jsonl
	```

	Simple and deterministic.

	---

	## 5.2 Question-First Mode (recommended)

	```
	PIPELINE_SEED_MODE=question-first
	```

	Pipeline:

	```
	for each chunk:
	questions = runQuestionGeneration(chunk)
	for each question:
	runPipelineStep(question)
	```

	This is the correct mode for massive bootstrap distillation because not every chunk answers the same static seed questions.

	This mode uses:

	* `QUESTION_PROVIDER`
	* `QUESTION_MODEL`

	---

	# 6. Modularization Status

	Already modular:

	* generator_core.mjs
	* verifier_core.mjs
	* reward_core.mjs
	* provider.mjs
	* question_core.mjs
	* retrieval.mjs

	Partially modular:

	* pipeline.mjs (big but structured)
	* pipeline_cli.mjs (needs handling for dynamic seed mode)

	Planned:

	```
	pipeline/
	retrieval_stage.mjs
	generator_stage.mjs
	verifier_stage.mjs
	reward_stage.mjs
	gold_writer.mjs
	```

	This matches the ROADMAP.

	---

	# 7. What Can Be Tested

	All pure modules have unit tests:

	\| Module \| Tested? \| Notes \|
	\| ------------------- \| -------- \| -------------- \|
	\| generator_core \| ✓ \| mock provider \|
	\| verifier_core \| ✓ \| mock provider \|
	\| reward_core \| ✓ \| mock provider \|
	\| question_core \| ✓ \| mock provider \|
	\| provider dispatcher \| ✓ \| dispatch logic \|
	\| retrieval \| ✓✓ \| mock + real ES \|
	\| pipeline (mock) \| ✓ \| integration \|
	\| pipeline (real) \| optional \| can add later \|

	Your test suite is healthy:

	```
	9 files, 27 tests → all pass
	```

	---

	# 8. Logging & Verbose Mode

	All stages print diagnostics when `verbose` is passed to:

	```
	npm run pipeline -- --verbose
	```

	Includes:

	* first chunk preview
	* raw LLM output
	* parsed JSON
	* acceptance status
	* error messages

	---

	# 9. Future Extensions

	As per ROADMAP:

	* split pipeline into smaller modules
	* improved QG (HyDE, retries, JSON repair)
	* dedupe (minhash)
	* gold dataset quality metrics
	* full distillation cycle (generator → verifier → reward → training → new generator)

	---

	# 10. Successor Notes

	This project is:

	* entirely Node.js ESM
	* fully testable end-to-end
	* GPU-agnostic
	* provider-agnostic
	* prompt-driven
	* safe to modify when modularized

	Golden rule:

	> Never mix CLI code with pipeline logic.
	> Put everything pure into `*_core.mjs`, test it, then wrap it in CLI tools.

	---

	If you'd like, I can also:

	✓ generate the next version of pipeline modularization
	✓ implement `PIPELINE_SEED_MODE=question-first` fully
	✓ add a chunk loader so QG works immediately
	✓ produce a Mermaid architecture diagram
	✓ produce a successor prompt to embed in the repo

	Just tell me.