distill-pipeline / ARCHITECTURE.md
htaf's picture
modularized pipeline
5464613

Absolutely β€” here is a fully updated, successor-ready ARCHITECTURE.md that reflects your current codebase, including:

  • the new question generator subsystem,
  • the multi-provider architecture,
  • the modular pipeline (generator/verifier/reward),
  • the retrieval stage,
  • the batch orchestrator,
  • and the modularization plan you're moving toward.

It will not clobber anything. It is aligned with your present repo after the latest commits and tests.

You can paste this directly as:

distill-pipeline/ARCHITECTURE.md

ARCHITECTURE.md

Distill-Pipeline β€” System Architecture & Successor Notes (Node.js, ESM, Ollama/vLLM/OpenAI providers, Vitest-tested)


1. Purpose

distill-pipeline is a modular, retrieval-augmented LLM distillation engine. It produces high-quality gold data by running each question through:

  1. retrieval (hybrid RAG via distill-rag)
  2. generator (teacher model)
  3. verifier (alignment/format checker)
  4. reward model (scoring)
  5. gold writer (JSONL builder)

It also includes a question generation module to extract questions directly from RAG chunks, enabling true content-first distillation.

The system is built for offline, local distillation on consumer GPUs (your 3090 + 3060).


2. High-Level Flow

              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚  Chunk Source  β”‚ ← distill-rag index
              β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β–Ό
       (optional) Question Generation
                     β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚   Retrieval    β”‚ (hybrid BM25 + dense)
             β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚   Generator    β”‚ (LLM teacher)
             β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚    Verifier    β”‚ (LLM)
             β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚  Reward Model  β”‚ (LLM critic)
             β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚   Gold Writer  β”‚
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. Directory Layout

Your repo structure (as of now, after modularization):

distill-pipeline/
  prompts/
    generator_prompt.txt
    verifier_prompt.txt
    reward_prompt.txt
    question_prompt.txt

  src/
    pipeline/
      pipeline.mjs
      pipeline_cli.mjs
    providers/
      provider.mjs
      ollama_provider.mjs
      openai_provider.mjs
      http_provider.mjs
    retrieval/
      retrieval.mjs
    generator/
      generator_core.mjs
    verifier/
      verifier_core.mjs
    reward/
      reward_core.mjs
    question/
      question_core.mjs
      question_cli.mjs

  gold/
    (generated JSONL files)

  test_samples/
    seed_questions.jsonl   ← for static mode

  tests/
    generator_core.test.mjs
    verifier_core.test.mjs
    reward_core.test.mjs
    provider.mock.test.mjs
    pipeline.mock.test.mjs
    retrieval.real.test.mjs
    retrieval.mock.test.mjs
    gold_core.test.mjs
    question_core.test.mjs

  .env
  package.json
  ARCHITECTURE.md
  ROADMAP.md

Everything is now properly separated into pure core modules, each with Vitest tests.


4. Core Modules

Below is a top-down view.


4.1 Provider System (src/providers/)

This system routes each pipeline stage to a backend:

  • OllamaProvider
  • OpenAIProvider
  • HttpProvider
  • future: vLLMProvider

All providers expose:

async generate(prompt, options?)

The dispatcher:

loadProviderFor("generator" | "verifier" | "reward" | "question")

Selects backend using env:

GENERATOR_PROVIDER=ollama
VERIFIER_PROVIDER=ollama
REWARD_PROVIDER=ollama
QUESTION_PROVIDER=ollama

And uses stage-specific model names:

GENERATOR_MODEL=qwen3-vl:8b-thinking
VERIFIER_MODEL=patronus:8b
REWARD_MODEL=patronus:8b
QUESTION_MODEL=qwen2.5-7b-instruct

This architecture is clean, extensible, and fully testable.


4.2 Retrieval (src/retrieval/retrieval.mjs)

Your retrieval layer connects to the distill-rag Elasticsearch index.

Supports:

  • BM25
  • Dense vector KNN
  • Hybrid RRF
  • optional future HyDE

The key export:

export async function hybridSearch(query, k)

You already have real + mock tests for this module.

βœ” This module is stable.


4.3 Generator (src/generator/generator_core.mjs)

Pure function:

async function runGenerator(query, contextChunks, provider)

Pipeline:

  • loads generator prompt template
  • merges context chunks into a context string
  • invokes provider.generate
  • JSON-parses output
  • returns:
{
  query,
  context,
  raw,
  parsed
}

βœ“ fully test-covered βœ“ easy to replace provider/model


4.4 Verifier (src/verifier/verifier_core.mjs)

Pure function:

async function runVerifier(sample, provider)

Applies:

  • structural JSON check
  • alignment/tone check
  • error correction fallback

Returns:

{
  ok: boolean,
  raw,
  parsed,
  sample
}

βœ“ test-covered


4.5 Reward Model (src/reward/reward_core.mjs)

Pure scoring function:

async function runReward(sample, provider)
  • loads reward prompt
  • calls provider
  • ensures score is numeric
  • computes ok based on positivity

βœ“ test-covered

(This will eventually be replaced with your Skywork or Nemotron reward server.)


4.6 Question Generation (src/question/question_core.mjs)

Your newest subsystem.

async function runQuestionGeneration(chunk, provider, maxQuestions)

Flow:

  1. Take a raw content chunk (from distill-rag)
  2. Prompt an LLM to extract 1–N questions
  3. Parse/repair JSON
  4. Return array of questions

Used when:

PIPELINE_SEED_MODE=question-first

So the pipeline becomes:

chunk β†’ questions β†’ retrieval β†’ generator β†’ ...

βœ“ test-covered βœ“ modular βœ“ will become core for bootstrap distillation


4.7 Pipeline Orchestrator (src/pipeline/pipeline.mjs)

This is the master controller.

Key functions:

runPipelineStep({ question, verbose })

Performs:

  1. retrieval
  2. generator
  3. verifier
  4. reward

and returns:

{
  status: 'accepted' | 'generator_failed' | ...,
  question,
  context,
  gen,
  ver,
  rew
}

Extensive verbose logging is built in:

   [retrieval] ...
   [generator] ...
   [verifier] ...
   [reward] ...

runPipelineBatch({ seedsPath, limit, verbose })

Iterates over seeds:

  • static seed mode (default)
  • or question-first mode (pending)

Writes accepted samples via:

appendGoldRecord(outPath, record)


5. Seed Modes

There are two entry strategies:


5.1 Static Question Mode

PIPELINE_SEED_MODE=static

Loads:

test_samples/seed_questions.jsonl

Simple and deterministic.


5.2 Question-First Mode (recommended)

PIPELINE_SEED_MODE=question-first

Pipeline:

for each chunk:
    questions = runQuestionGeneration(chunk)
    for each question:
        runPipelineStep(question)

This is the correct mode for massive bootstrap distillation because not every chunk answers the same static seed questions.

This mode uses:

  • QUESTION_PROVIDER
  • QUESTION_MODEL

6. Modularization Status

Already modular:

  • generator_core.mjs
  • verifier_core.mjs
  • reward_core.mjs
  • provider.mjs
  • question_core.mjs
  • retrieval.mjs

Partially modular:

  • pipeline.mjs (big but structured)
  • pipeline_cli.mjs (needs handling for dynamic seed mode)

Planned:

pipeline/
  retrieval_stage.mjs
  generator_stage.mjs
  verifier_stage.mjs
  reward_stage.mjs
  gold_writer.mjs

This matches the ROADMAP.


7. What Can Be Tested

All pure modules have unit tests:

Module Tested? Notes
generator_core βœ“ mock provider
verifier_core βœ“ mock provider
reward_core βœ“ mock provider
question_core βœ“ mock provider
provider dispatcher βœ“ dispatch logic
retrieval βœ“βœ“ mock + real ES
pipeline (mock) βœ“ integration
pipeline (real) optional can add later

Your test suite is healthy:

9 files, 27 tests β†’ all pass

8. Logging & Verbose Mode

All stages print diagnostics when verbose is passed to:

npm run pipeline -- --verbose

Includes:

  • first chunk preview
  • raw LLM output
  • parsed JSON
  • acceptance status
  • error messages

9. Future Extensions

As per ROADMAP:

  • split pipeline into smaller modules
  • improved QG (HyDE, retries, JSON repair)
  • dedupe (minhash)
  • gold dataset quality metrics
  • full distillation cycle (generator β†’ verifier β†’ reward β†’ training β†’ new generator)

10. Successor Notes

This project is:

  • entirely Node.js ESM
  • fully testable end-to-end
  • GPU-agnostic
  • provider-agnostic
  • prompt-driven
  • safe to modify when modularized

Golden rule:

Never mix CLI code with pipeline logic. Put everything pure into *_core.mjs, test it, then wrap it in CLI tools.


If you'd like, I can also:

βœ“ generate the next version of pipeline modularization βœ“ implement PIPELINE_SEED_MODE=question-first fully βœ“ add a chunk loader so QG works immediately βœ“ produce a Mermaid architecture diagram βœ“ produce a successor prompt to embed in the repo

Just tell me.