modularized pipeline

5464613 3 months ago

10 kB

Absolutely — here is a fully updated, successor-ready ARCHITECTURE.md that reflects your current codebase, including:

the new question generator subsystem,
the multi-provider architecture,
the modular pipeline (generator/verifier/reward),
the retrieval stage,
the batch orchestrator,
and the modularization plan you're moving toward.

It will not clobber anything. It is aligned with your present repo after the latest commits and tests.

You can paste this directly as:

distill-pipeline/ARCHITECTURE.md

ARCHITECTURE.md

Distill-Pipeline — System Architecture & Successor Notes (Node.js, ESM, Ollama/vLLM/OpenAI providers, Vitest-tested)

1. Purpose

distill-pipeline is a modular, retrieval-augmented LLM distillation engine. It produces high-quality gold data by running each question through:

retrieval (hybrid RAG via distill-rag)
generator (teacher model)
verifier (alignment/format checker)
reward model (scoring)
gold writer (JSONL builder)

It also includes a question generation module to extract questions directly from RAG chunks, enabling true content-first distillation.

The system is built for offline, local distillation on consumer GPUs (your 3090 + 3060).

2. High-Level Flow

              ┌────────────────┐
              │  Chunk Source  │ ← distill-rag index
              └──────┬─────────┘
                     ▼
       (optional) Question Generation
                     ▼
             ┌───────────────┐
             │   Retrieval    │ (hybrid BM25 + dense)
             └──────┬────────┘
                    ▼
             ┌───────────────┐
             │   Generator    │ (LLM teacher)
             └──────┬────────┘
                    ▼
             ┌───────────────┐
             │    Verifier    │ (LLM)
             └──────┬────────┘
                    ▼
             ┌───────────────┐
             │  Reward Model  │ (LLM critic)
             └──────┬────────┘
                    ▼
             ┌───────────────┐
             │   Gold Writer  │
             └───────────────┘

3. Directory Layout

Your repo structure (as of now, after modularization):

distill-pipeline/
  prompts/
    generator_prompt.txt
    verifier_prompt.txt
    reward_prompt.txt
    question_prompt.txt

  src/
    pipeline/
      pipeline.mjs
      pipeline_cli.mjs
    providers/
      provider.mjs
      ollama_provider.mjs
      openai_provider.mjs
      http_provider.mjs
    retrieval/
      retrieval.mjs
    generator/
      generator_core.mjs
    verifier/
      verifier_core.mjs
    reward/
      reward_core.mjs
    question/
      question_core.mjs
      question_cli.mjs

  gold/
    (generated JSONL files)

  test_samples/
    seed_questions.jsonl   ← for static mode

  tests/
    generator_core.test.mjs
    verifier_core.test.mjs
    reward_core.test.mjs
    provider.mock.test.mjs
    pipeline.mock.test.mjs
    retrieval.real.test.mjs
    retrieval.mock.test.mjs
    gold_core.test.mjs
    question_core.test.mjs

  .env
  package.json
  ARCHITECTURE.md
  ROADMAP.md

Everything is now properly separated into pure core modules, each with Vitest tests.

4. Core Modules

Below is a top-down view.

4.1 Provider System (src/providers/)

This system routes each pipeline stage to a backend:

OllamaProvider
OpenAIProvider
HttpProvider
future: vLLMProvider

All providers expose:

async generate(prompt, options?)

The dispatcher:

loadProviderFor("generator" | "verifier" | "reward" | "question")

Selects backend using env:

GENERATOR_PROVIDER=ollama
VERIFIER_PROVIDER=ollama
REWARD_PROVIDER=ollama
QUESTION_PROVIDER=ollama

And uses stage-specific model names:

GENERATOR_MODEL=qwen3-vl:8b-thinking
VERIFIER_MODEL=patronus:8b
REWARD_MODEL=patronus:8b
QUESTION_MODEL=qwen2.5-7b-instruct

This architecture is clean, extensible, and fully testable.

4.2 Retrieval (src/retrieval/retrieval.mjs)

Your retrieval layer connects to the distill-rag Elasticsearch index.

Supports:

BM25
Dense vector KNN
Hybrid RRF
optional future HyDE

The key export:

export async function hybridSearch(query, k)

You already have real + mock tests for this module.

✔ This module is stable.

4.3 Generator (src/generator/generator_core.mjs)

Pure function:

async function runGenerator(query, contextChunks, provider)

Pipeline:

loads generator prompt template
merges context chunks into a context string
invokes provider.generate
JSON-parses output
returns:

{
  query,
  context,
  raw,
  parsed
}

✓ fully test-covered ✓ easy to replace provider/model

4.4 Verifier (src/verifier/verifier_core.mjs)

Pure function:

async function runVerifier(sample, provider)

Applies:

structural JSON check
alignment/tone check
error correction fallback

Returns:

{
  ok: boolean,
  raw,
  parsed,
  sample
}

✓ test-covered

4.5 Reward Model (src/reward/reward_core.mjs)

Pure scoring function:

async function runReward(sample, provider)

loads reward prompt
calls provider
ensures score is numeric
computes ok based on positivity

✓ test-covered

(This will eventually be replaced with your Skywork or Nemotron reward server.)

4.6 Question Generation (src/question/question_core.mjs)

Your newest subsystem.

async function runQuestionGeneration(chunk, provider, maxQuestions)

Flow:

Take a raw content chunk (from distill-rag)
Prompt an LLM to extract 1–N questions
Parse/repair JSON
Return array of questions

Used when:

PIPELINE_SEED_MODE=question-first

So the pipeline becomes:

chunk → questions → retrieval → generator → ...

✓ test-covered ✓ modular ✓ will become core for bootstrap distillation

4.7 Pipeline Orchestrator (src/pipeline/pipeline.mjs)

This is the master controller.

Key functions:

`runPipelineStep({ question, verbose })`

Performs:

retrieval
generator
verifier
reward

and returns:

{
  status: 'accepted' | 'generator_failed' | ...,
  question,
  context,
  gen,
  ver,
  rew
}

Extensive verbose logging is built in:

   [retrieval] ...
   [generator] ...
   [verifier] ...
   [reward] ...

`runPipelineBatch({ seedsPath, limit, verbose })`

Iterates over seeds:

static seed mode (default)
or question-first mode (pending)

Writes accepted samples via:

`appendGoldRecord(outPath, record)`

5. Seed Modes

There are two entry strategies:

5.1 Static Question Mode

PIPELINE_SEED_MODE=static

Loads:

test_samples/seed_questions.jsonl

Simple and deterministic.

5.2 Question-First Mode (recommended)

PIPELINE_SEED_MODE=question-first

Pipeline:

for each chunk:
    questions = runQuestionGeneration(chunk)
    for each question:
        runPipelineStep(question)

This is the correct mode for massive bootstrap distillation because not every chunk answers the same static seed questions.

This mode uses:

QUESTION_PROVIDER
QUESTION_MODEL

6. Modularization Status

Already modular:

generator_core.mjs
verifier_core.mjs
reward_core.mjs
provider.mjs
question_core.mjs
retrieval.mjs

Partially modular:

pipeline.mjs (big but structured)
pipeline_cli.mjs (needs handling for dynamic seed mode)

Planned:

pipeline/
  retrieval_stage.mjs
  generator_stage.mjs
  verifier_stage.mjs
  reward_stage.mjs
  gold_writer.mjs

This matches the ROADMAP.

7. What Can Be Tested

All pure modules have unit tests:

Module	Tested?	Notes
generator_core	✓	mock provider
verifier_core	✓	mock provider
reward_core	✓	mock provider
question_core	✓	mock provider
provider dispatcher	✓	dispatch logic
retrieval	✓✓	mock + real ES
pipeline (mock)	✓	integration
pipeline (real)	optional	can add later

Your test suite is healthy:

9 files, 27 tests → all pass

8. Logging & Verbose Mode

All stages print diagnostics when verbose is passed to:

npm run pipeline -- --verbose

Includes:

first chunk preview
raw LLM output
parsed JSON
acceptance status
error messages

9. Future Extensions

As per ROADMAP:

split pipeline into smaller modules
improved QG (HyDE, retries, JSON repair)
dedupe (minhash)
gold dataset quality metrics
full distillation cycle (generator → verifier → reward → training → new generator)

10. Successor Notes

This project is:

entirely Node.js ESM
fully testable end-to-end
GPU-agnostic
provider-agnostic
prompt-driven
safe to modify when modularized

Golden rule:

Never mix CLI code with pipeline logic. Put everything pure into *_core.mjs, test it, then wrap it in CLI tools.

If you'd like, I can also:

✓ generate the next version of pipeline modularization ✓ implement PIPELINE_SEED_MODE=question-first fully ✓ add a chunk loader so QG works immediately ✓ produce a Mermaid architecture diagram ✓ produce a successor prompt to embed in the repo

Just tell me.