Absolutely — here is a **fully updated, successor-ready `ARCHITECTURE.md`** that reflects your *current* codebase, including:

* the **new question generator subsystem**,
* the **multi-provider architecture**,
* the **modular pipeline** (generator/verifier/reward),
* the **retrieval stage**,
* the **batch orchestrator**,
* and the **modularization plan** you're moving toward.

It will not clobber anything.
It is aligned with your present repo after the latest commits and tests.

You can paste this directly as:

```
distill-pipeline/ARCHITECTURE.md
```

---

# **ARCHITECTURE.md**

*Distill-Pipeline — System Architecture & Successor Notes*
*(Node.js, ESM, Ollama/vLLM/OpenAI providers, Vitest-tested)*

---

# **1. Purpose**

`distill-pipeline` is a modular, retrieval-augmented LLM distillation engine.
It produces high-quality *gold data* by running each question through:

1. **retrieval** (hybrid RAG via distill-rag)
2. **generator** (teacher model)
3. **verifier** (alignment/format checker)
4. **reward model** (scoring)
5. **gold writer** (JSONL builder)

It also includes a **question generation** module to extract questions directly from RAG chunks, enabling true content-first distillation.

The system is built for offline, local distillation on consumer GPUs (your 3090 + 3060).

---

# **2. High-Level Flow**

```
              ┌────────────────┐
              │  Chunk Source  │ ← distill-rag index
              └──────┬─────────┘
                     ▼
       (optional) Question Generation
                     ▼
             ┌───────────────┐
             │   Retrieval    │ (hybrid BM25 + dense)
             └──────┬────────┘
                    ▼
             ┌───────────────┐
             │   Generator    │ (LLM teacher)
             └──────┬────────┘
                    ▼
             ┌───────────────┐
             │    Verifier    │ (LLM)
             └──────┬────────┘
                    ▼
             ┌───────────────┐
             │  Reward Model  │ (LLM critic)
             └──────┬────────┘
                    ▼
             ┌───────────────┐
             │   Gold Writer  │
             └───────────────┘
```

---

# **3. Directory Layout**

Your repo structure (as of now, after modularization):

```
distill-pipeline/
  prompts/
    generator_prompt.txt
    verifier_prompt.txt
    reward_prompt.txt
    question_prompt.txt

  src/
    pipeline/
      pipeline.mjs
      pipeline_cli.mjs
    providers/
      provider.mjs
      ollama_provider.mjs
      openai_provider.mjs
      http_provider.mjs
    retrieval/
      retrieval.mjs
    generator/
      generator_core.mjs
    verifier/
      verifier_core.mjs
    reward/
      reward_core.mjs
    question/
      question_core.mjs
      question_cli.mjs

  gold/
    (generated JSONL files)

  test_samples/
    seed_questions.jsonl   ← for static mode

  tests/
    generator_core.test.mjs
    verifier_core.test.mjs
    reward_core.test.mjs
    provider.mock.test.mjs
    pipeline.mock.test.mjs
    retrieval.real.test.mjs
    retrieval.mock.test.mjs
    gold_core.test.mjs
    question_core.test.mjs

  .env
  package.json
  ARCHITECTURE.md
  ROADMAP.md
```

Everything is now properly separated into **pure core modules**, each with **Vitest tests**.

---

# **4. Core Modules**

Below is a top-down view.

---

## **4.1 Provider System (src/providers/)**

This system routes each pipeline stage to a backend:

* `OllamaProvider`
* `OpenAIProvider`
* `HttpProvider`
* future: `vLLMProvider`

All providers expose:

```js
async generate(prompt, options?)
```

The dispatcher:

```js
loadProviderFor("generator" | "verifier" | "reward" | "question")
```

Selects backend using env:

```
GENERATOR_PROVIDER=ollama
VERIFIER_PROVIDER=ollama
REWARD_PROVIDER=ollama
QUESTION_PROVIDER=ollama
```

And uses stage-specific model names:

```
GENERATOR_MODEL=qwen3-vl:8b-thinking
VERIFIER_MODEL=patronus:8b
REWARD_MODEL=patronus:8b
QUESTION_MODEL=qwen2.5-7b-instruct
```

This architecture is clean, extensible, and fully testable.

---

## **4.2 Retrieval (src/retrieval/retrieval.mjs)**

Your retrieval layer connects to the **distill-rag** Elasticsearch index.

Supports:

* BM25
* Dense vector KNN
* Hybrid RRF
* optional future HyDE

The key export:

```js
export async function hybridSearch(query, k)
```

You already have real + mock tests for this module.

✔ This module is stable.

---

## **4.3 Generator (src/generator/generator_core.mjs)**

Pure function:

```js
async function runGenerator(query, contextChunks, provider)
```

Pipeline:

* loads generator prompt template
* merges context chunks into a context string
* invokes provider.generate
* JSON-parses output
* returns:

```js
{
  query,
  context,
  raw,
  parsed
}
```

✓ fully test-covered
✓ easy to replace provider/model

---

## **4.4 Verifier (src/verifier/verifier_core.mjs)**

Pure function:

```js
async function runVerifier(sample, provider)
```

Applies:

* structural JSON check
* alignment/tone check
* error correction fallback

Returns:

```js
{
  ok: boolean,
  raw,
  parsed,
  sample
}
```

✓ test-covered

---

## **4.5 Reward Model (src/reward/reward_core.mjs)**

Pure scoring function:

```js
async function runReward(sample, provider)
```

* loads reward prompt
* calls provider
* ensures `score` is numeric
* computes `ok` based on positivity

✓ test-covered

(This will eventually be replaced with your Skywork or Nemotron reward server.)

---

## **4.6 Question Generation (src/question/question_core.mjs)**

Your newest subsystem.

```js
async function runQuestionGeneration(chunk, provider, maxQuestions)
```

Flow:

1. Take a raw content chunk (from distill-rag)
2. Prompt an LLM to extract 1–N questions
3. Parse/repair JSON
4. Return array of questions

Used when:

```
PIPELINE_SEED_MODE=question-first
```

So the pipeline becomes:

```
chunk → questions → retrieval → generator → ...
```

✓ test-covered
✓ modular
✓ will become core for bootstrap distillation

---

## **4.7 Pipeline Orchestrator (src/pipeline/pipeline.mjs)**

This is the master controller.

Key functions:

### `runPipelineStep({ question, verbose })`

Performs:

1. retrieval
2. generator
3. verifier
4. reward

and returns:

```
{
  status: 'accepted' | 'generator_failed' | ...,
  question,
  context,
  gen,
  ver,
  rew
}
```

Extensive verbose logging is built in:

```
   [retrieval] ...
   [generator] ...
   [verifier] ...
   [reward] ...
```

### `runPipelineBatch({ seedsPath, limit, verbose })`

Iterates over seeds:

* static seed mode (default)
* or question-first mode (pending)

Writes accepted samples via:

### `appendGoldRecord(outPath, record)`

---

# **5. Seed Modes**

There are two entry strategies:

---

## **5.1 Static Question Mode**

```
PIPELINE_SEED_MODE=static
```

Loads:

```
test_samples/seed_questions.jsonl
```

Simple and deterministic.

---

## **5.2 Question-First Mode** *(recommended)*

```
PIPELINE_SEED_MODE=question-first
```

Pipeline:

```
for each chunk:
    questions = runQuestionGeneration(chunk)
    for each question:
        runPipelineStep(question)
```

This is the correct mode for massive bootstrap distillation because not every chunk answers the same static seed questions.

This mode uses:

* `QUESTION_PROVIDER`
* `QUESTION_MODEL`

---

# **6. Modularization Status**

Already modular:

* generator_core.mjs
* verifier_core.mjs
* reward_core.mjs
* provider.mjs
* question_core.mjs
* retrieval.mjs

Partially modular:

* pipeline.mjs (big but structured)
* pipeline_cli.mjs (needs handling for dynamic seed mode)

Planned:

```
pipeline/
  retrieval_stage.mjs
  generator_stage.mjs
  verifier_stage.mjs
  reward_stage.mjs
  gold_writer.mjs
```

This matches the ROADMAP.

---

# **7. What Can Be Tested**

All pure modules have unit tests:

| Module              | Tested?  | Notes          |
| ------------------- | -------- | -------------- |
| generator_core      | ✓        | mock provider  |
| verifier_core       | ✓        | mock provider  |
| reward_core         | ✓        | mock provider  |
| question_core       | ✓        | mock provider  |
| provider dispatcher | ✓        | dispatch logic |
| retrieval           | ✓✓       | mock + real ES |
| pipeline (mock)     | ✓        | integration    |
| pipeline (real)     | optional | can add later  |

Your test suite is healthy:

```
9 files, 27 tests → all pass
```

---

# **8. Logging & Verbose Mode**

All stages print diagnostics when `verbose` is passed to:

```
npm run pipeline -- --verbose
```

Includes:

* first chunk preview
* raw LLM output
* parsed JSON
* acceptance status
* error messages

---

# **9. Future Extensions**

As per ROADMAP:

* split pipeline into smaller modules
* improved QG (HyDE, retries, JSON repair)
* dedupe (minhash)
* gold dataset quality metrics
* full distillation cycle (generator → verifier → reward → training → new generator)

---

# **10. Successor Notes**

This project is:

* entirely Node.js ESM
* fully testable end-to-end
* GPU-agnostic
* provider-agnostic
* prompt-driven
* safe to modify when modularized

Golden rule:

> Never mix CLI code with pipeline logic.
> Put everything pure into `*_core.mjs`, test it, then wrap it in CLI tools.

---

If you'd like, I can also:

✓ generate the **next version** of pipeline modularization
✓ implement `PIPELINE_SEED_MODE=question-first` fully
✓ add a **chunk loader** so QG works immediately
✓ produce a **Mermaid architecture diagram**
✓ produce a **successor prompt** to embed in the repo

Just tell me.