modularized pipeline

Browse files

Files changed (10) hide show

ARCHITECTURE.md +369 -239
README.md +296 -14
ROADMAP.md +238 -174
package.json +1 -1
prompts/question_prompt.txt +13 -9
src/pipeline/batch.mjs +259 -0
src/pipeline/pipeline.mjs +8 -320
src/pipeline/seeds.mjs +56 -0
src/pipeline/step.mjs +176 -0
src/pipeline/util.mjs +20 -0

ARCHITECTURE.md CHANGED Viewed

@@ -1,14 +1,16 @@
-Here is a clean, professional, successor-ready **ARCHITECTURE.md** for your `distill-pipeline` repo.
-It captures everything discovered during analysis:
-* what exists,
-* how modules relate,
-* what needs modularization,
-* how retrieval ties in,
-* and how the full cycle works for your Friday distillation goal.
-You can drop this directly into:
 ```
 distill-pipeline/ARCHITECTURE.md
@@ -16,381 +18,509 @@ distill-pipeline/ARCHITECTURE.md
 ---
-# 🧭 **ARCHITECTURE.md**
-*Distillation Pipeline Architecture — Overview & Design Notes*
 ---
-# #️⃣ **1. Purpose of This Repository**
-`distill-pipeline` is a modular, retrieval-augmented, multi-model distillation system modeled after modern RLHF / RLAIF pipelines. It consumes raw questions, retrieves context from `distill-rag`, generates multiple candidate answers using a teacher model, verifies correctness and alignment, scores the best samples with a reward model, and produces a clean **gold dataset** for supervised fine-tuning (LoRA/SFT).
-The goal is to support **iterative, bootstrapped, domain-aligned distillation**, especially for:
-* Confederation/Q’uo-aligned reasoning
-* spiritual / metaphysical datasets
-* service-oriented LLM behavior
-* local, offline distillation using consumer GPUs
 ---
-# #️⃣ **2. High-Level Pipeline Flow**
-```
-           ┌───────────┐
-           │  Question  │
-           └──────┬────┘
-                  │
-                  ▼
-        ┌───────────────────┐
-        │ Retrieval (RAG)   │ ← distill-rag index
-        │ hybrid / HyDE     │
-        └──────┬────────────┘
-               │ context chunks
-               ▼
-        ┌───────────────────┐
-        │  Generator (LLM)  │ ← teacher model (7B–14B)
-        └──────┬────────────┘
-               │ raw samples
-               ▼
-        ┌───────────────────┐
-        │  Verifier (LLM)   │ ← small verifier (2B–4B)
-        └──────┬────────────┘
-               │ validated JSON
-               ▼
-        ┌───────────────────┐
-        │  Reward Model     │ ← 70B reward LM (Nemotron)
-        └──────┬────────────┘
-               │ scored samples
-               ▼
-        ┌────────────────────┐
-        │   Gold Builder     │
-        │  top-k / dedupe    │
-        └────────┬───────────┘
-                 ▼
-       ┌──────────────────────┐
-       │   Training (LoRA)    │ ← final distilled model
-       └──────────────────────┘
 ```
 ---
-# #️⃣ **3. Directory Structure**
 ```
 distill-pipeline/
-  generator/          → teacher model runner (CLI)
-  verifier/           → verifier model runner (CLI)
-  reward/             → reward model runner (CLI)
-  gold/               → dataset curator
-  training/           → LoRA training (Python)
-  prompts/            → prompt templates for all stages
-  scripts/            → old bash orchestration
-  configs/            → pipeline config (JSON)
-  test_samples/       → bootstrap seeds
-  src/                → retrieval.mjs + (future) providers/
-  cycle/              → (future) Node-based orchestration
-  tests/              → Vitest unit tests
-  ARCHITECTURE.md     ← this file
-  ROADMAP.md          ← development plan
 ```
 ---
-# #️⃣ **4. Module-by-Module Overview**
-## ✦ `generator/`
-**Current:**
-* CLI script calling large LLM (teacher)
-* Reads prompt template
-* Uses environment variables for model
-* Produces raw samples
-**Issues:**
-* tightly coupled to CLI I/O
-* cannot be tested
-* cannot integrate retrieval
-* cannot mock LLM
-**Required modularization:**
 ```
-generator_core.mjs → runGenerator(query, context, provider)
-generator.js       → CLI wrapper
-```
----
-## ✦ `verifier/`
-**Current:**
-* CLI wrapper around a 2B–4B verifier model
-* Checks JSON output, tone, hallucination, format
-**Issues:**
-* same as generator: not modular, no test entry point
-**Required modularization:**
 ```
-verifier_core.mjs → runVerifier(sample, provider)
-verifier.js       → CLI wrapper
 ```
----
-## ✦ `reward/`
-**Current:**
-* Calls Nemotron-4-70B reward model
-* Returns scalar score + justification
-* Used for ranking gold samples
-**Issues:**
-* mixed CLI + logic
-* no programmatic API
-* cannot be tested without GPU/LLM mocks
-**Required modularization:**
-```
-reward_core.mjs → runReward(sample, provider)
-reward.js       → CLI wrapper
 ```
----
-## ✦ `gold/`
-**Current:**
-* `build_gold.js` takes samples + scores
-* Applies dedupe + top-k filtering
-* Produces JSONL gold dataset for LoRA
-**Issues:**
-* logic is fine but should be in a pure module
-**Required modularization:**
 ```
-gold_core.mjs → buildGold(samples)
-```
 ---
-## ✦ `training/`
-**Current:**
-* Python LoRA trainer (PEFT, bitsandbytes, QLoRA)
-* Good for training on 3090
-* No refactor needed immediately
-**Future work:**
-* add HF push script
-* add training metadata manifest
 ---
-## ✦ `src/retrieval.mjs`
-**Current:**
-* Pure ESM module
-* Supports:
-  * BM25
-  * Vector search
-  * Hybrid RRF
-  * Embeddings
-* Fully tested (mock + real)
-**Status:**
-✔ Perfect.
-This is the correct way to integrate distill-rag.
 ---
-## ✦ `cycle/`
-(empty)
-**Goal:**
-Replace old bash pipeline with:
 ```
-cycle/run_cycle.mjs
 ```
-This will orchestrate the entire:
-retrieval → generation → verification → reward → gold
-flow in a unified, testable Node pipeline.
 ---
-# #️⃣ **5. What Is Missing (Architecturally)**
-These critical building blocks are not yet implemented:
-### ✔ Modular cores (generator/verifier/reward)
-Without these, retrieval cannot flow into generation, and testing is impossible.
-### ✔ Provider system
-We need:
 ```
-src/providers/ollama_provider.mjs
-src/providers/openai_provider.mjs
-src/providers/vllm_provider.mjs
 ```
-and a front dispatcher:
 ```
-src/providers/provider.mjs
 ```
-### ✔ Proper pipeline runner
-`run_cycle.sh` is temporary glue.
-We need:
-```
-cycle/run_cycle.mjs
-```
-which uses:
-* retrieval
-* generator_core
-* verifier_core
-* reward_core
-* gold_core
-* pipeline.json
-### ✔ Tests for generator_core, verifier_core, reward_core
-Mock LLM via:
-```js
-const provider = { generate: vi.fn().mockResolvedValue("fake output") };
 ```
-### ✔ Logging + metrics
-Simple logging to see:
-* per-question latency
-* reward score distribution
-* gold dataset size
 ---
-# #️⃣ **6. What Can Be Tested**
-### Unit tests (Vitest)
-* retrieval (done)
-* generator_core (mock provider)
-* verifier_core (mock provider)
-* reward_core (mock provider)
-* gold_core (pure logic)
-### Integration tests
-* retrieval → generator_core
-* generator → verifier → reward
-* pipeline runner (all mocks)
-### Real-world tests
-* hybrid search
-* generator with actual LLM (optional)
-* full end-to-end cycle on a small batch
 ---
-# #️⃣ **7. Recommended Next Actions (Ordered Roadmap)**
-### **Step 1 — Modularize generator**
-* Create `generator_core.mjs`
-* Move logic out of CLI
-* Make it accept `{ query, contextChunks, provider }`
-### **Step 2 — Modularize verifier**
-* Same pattern as generator
-### **Step 3 — Modularize reward**
-* Same pattern
-### **Step 4 — Build provider system**
-* unify all model calls under one API
-### **Step 5 — Build gold_core**
-* pure function for dataset pruning
-### **Step 6 — Build cycle/run_cycle.mjs**
-* full pipeline orchestrator
-* based on `pipeline.json`
-### **Step 7 — Write tests for all cores**
-### **Step 8 — Run distillation cycle**
-* generate ~150–300 samples
-* verify + reward rank
-* produce gold dataset
-### **Step 9 — Train LoRA**
-* QLoRA on your 3090
-* 2–4 hour training
-* ready by Friday
----
-# #️⃣ **8. Vision Alignment**
-This architecture supports:
-* bootstrapped distillation
-* iterative dataset improvement
-* truth-aligned spiritual / Q’uo tone
-* retrieval-augmented reasoning
-* reward-guided refinement
-* local computation (3090 + 3060)
-And it positions your pipeline to evolve indefinitely — adding new transcripts, new questions, and new refinement cycles.
 ---
-# ✅ Ready for next step
-If you'd like, I can now generate:
-* **generator_core.mjs**
-* or begin modularizing all three (generator/verifier/reward)
-* or create **run_cycle.mjs** scaffold
-Just tell me what you want next.

+Absolutely — here is a **fully updated, successor-ready `ARCHITECTURE.md`** that reflects your *current* codebase, including:
+* the **new question generator subsystem**,
+* the **multi-provider architecture**,
+* the **modular pipeline** (generator/verifier/reward),
+* the **retrieval stage**,
+* the **batch orchestrator**,
+* and the **modularization plan** you're moving toward.
+It will not clobber anything.
+It is aligned with your present repo after the latest commits and tests.
+You can paste this directly as:
 ```
 distill-pipeline/ARCHITECTURE.md
 ---
+# **ARCHITECTURE.md**
+*Distill-Pipeline — System Architecture & Successor Notes*
+*(Node.js, ESM, Ollama/vLLM/OpenAI providers, Vitest-tested)*
 ---
+# **1. Purpose**
+`distill-pipeline` is a modular, retrieval-augmented LLM distillation engine.
+It produces high-quality *gold data* by running each question through:
+1. **retrieval** (hybrid RAG via distill-rag)
+2. **generator** (teacher model)
+3. **verifier** (alignment/format checker)
+4. **reward model** (scoring)
+5. **gold writer** (JSONL builder)
+It also includes a **question generation** module to extract questions directly from RAG chunks, enabling true content-first distillation.
+The system is built for offline, local distillation on consumer GPUs (your 3090 + 3060).
 ---
+# **2. High-Level Flow**
+```
+              ┌────────────────┐
+              │  Chunk Source  │ ← distill-rag index
+              └──────┬─────────┘
+                     ▼
+       (optional) Question Generation
+                     ▼
+             ┌───────────────┐
+             │   Retrieval    │ (hybrid BM25 + dense)
+             └──────┬────────┘
+                    ▼
+             ┌───────────────┐
+             │   Generator    │ (LLM teacher)
+             └──────┬────────┘
+                    ▼
+             ┌───────────────┐
+             │    Verifier    │ (LLM)
+             └──────┬────────┘
+                    ▼
+             ┌───────────────┐
+             │  Reward Model  │ (LLM critic)
+             └──────┬────────┘
+                    ▼
+             ┌───────────────┐
+             │   Gold Writer  │
+             └───────────────┘
 ```
 ---
+# **3. Directory Layout**
+Your repo structure (as of now, after modularization):
 ```
 distill-pipeline/
+  prompts/
+    generator_prompt.txt
+    verifier_prompt.txt
+    reward_prompt.txt
+    question_prompt.txt
+  src/
+    pipeline/
+      pipeline.mjs
+      pipeline_cli.mjs
+    providers/
+      provider.mjs
+      ollama_provider.mjs
+      openai_provider.mjs
+      http_provider.mjs
+    retrieval/
+      retrieval.mjs
+    generator/
+      generator_core.mjs
+    verifier/
+      verifier_core.mjs
+    reward/
+      reward_core.mjs
+    question/
+      question_core.mjs
+      question_cli.mjs
+  gold/
+    (generated JSONL files)
+  test_samples/
+    seed_questions.jsonl   ← for static mode
+  tests/
+    generator_core.test.mjs
+    verifier_core.test.mjs
+    reward_core.test.mjs
+    provider.mock.test.mjs
+    pipeline.mock.test.mjs
+    retrieval.real.test.mjs
+    retrieval.mock.test.mjs
+    gold_core.test.mjs
+    question_core.test.mjs
+  .env
+  package.json
+  ARCHITECTURE.md
+  ROADMAP.md
 ```
+Everything is now properly separated into **pure core modules**, each with **Vitest tests**.
 ---
+# **4. Core Modules**
+Below is a top-down view.
+---
+## **4.1 Provider System (src/providers/)**
+This system routes each pipeline stage to a backend:
+* `OllamaProvider`
+* `OpenAIProvider`
+* `HttpProvider`
+* future: `vLLMProvider`
+All providers expose:
+```js
+async generate(prompt, options?)
 ```
+The dispatcher:
+```js
+loadProviderFor("generator" | "verifier" | "reward" | "question")
+```
+Selects backend using env:
+```
+GENERATOR_PROVIDER=ollama
+VERIFIER_PROVIDER=ollama
+REWARD_PROVIDER=ollama
+QUESTION_PROVIDER=ollama
+```
+And uses stage-specific model names:
 ```
+GENERATOR_MODEL=qwen3-vl:8b-thinking
+VERIFIER_MODEL=patronus:8b
+REWARD_MODEL=patronus:8b
+QUESTION_MODEL=qwen2.5-7b-instruct
 ```
+This architecture is clean, extensible, and fully testable.
+---
+## **4.2 Retrieval (src/retrieval/retrieval.mjs)**
+Your retrieval layer connects to the **distill-rag** Elasticsearch index.
+Supports:
+* BM25
+* Dense vector KNN
+* Hybrid RRF
+* optional future HyDE
+The key export:
+```js
+export async function hybridSearch(query, k)
 ```
+You already have real + mock tests for this module.
+✔ This module is stable.
+---
+## **4.3 Generator (src/generator/generator_core.mjs)**
+Pure function:
+```js
+async function runGenerator(query, contextChunks, provider)
+```
+Pipeline:
+* loads generator prompt template
+* merges context chunks into a context string
+* invokes provider.generate
+* JSON-parses output
+* returns:
+```js
+{
+  query,
+  context,
+  raw,
+  parsed
+}
 ```
+✓ fully test-covered
+✓ easy to replace provider/model
 ---
+## **4.4 Verifier (src/verifier/verifier_core.mjs)**
+Pure function:
+```js
+async function runVerifier(sample, provider)
+```
+Applies:
+* structural JSON check
+* alignment/tone check
+* error correction fallback
+Returns:
+```js
+{
+  ok: boolean,
+  raw,
+  parsed,
+  sample
+}
+```
+✓ test-covered
 ---
+## **4.5 Reward Model (src/reward/reward_core.mjs)**
+Pure scoring function:
+```js
+async function runReward(sample, provider)
+```
+* loads reward prompt
+* calls provider
+* ensures `score` is numeric
+* computes `ok` based on positivity
+✓ test-covered
+(This will eventually be replaced with your Skywork or Nemotron reward server.)
 ---
+## **4.6 Question Generation (src/question/question_core.mjs)**
+Your newest subsystem.
+```js
+async function runQuestionGeneration(chunk, provider, maxQuestions)
+```
+Flow:
+1. Take a raw content chunk (from distill-rag)
+2. Prompt an LLM to extract 1–N questions
+3. Parse/repair JSON
+4. Return array of questions
+Used when:
 ```
+PIPELINE_SEED_MODE=question-first
 ```
+So the pipeline becomes:
+```
+chunk → questions → retrieval → generator → ...
+```
+✓ test-covered
+✓ modular
+✓ will become core for bootstrap distillation
 ---
+## **4.7 Pipeline Orchestrator (src/pipeline/pipeline.mjs)**
+This is the master controller.
+Key functions:
+### `runPipelineStep({ question, verbose })`
+Performs:
+1. retrieval
+2. generator
+3. verifier
+4. reward
+and returns:
 ```
+{
+  status: 'accepted' | 'generator_failed' | ...,
+  question,
+  context,
+  gen,
+  ver,
+  rew
+}
 ```
+Extensive verbose logging is built in:
 ```
+   [retrieval] ...
+   [generator] ...
+   [verifier] ...
+   [reward] ...
 ```
+### `runPipelineBatch({ seedsPath, limit, verbose })`
+Iterates over seeds:
+* static seed mode (default)
+* or question-first mode (pending)
+Writes accepted samples via:
+### `appendGoldRecord(outPath, record)`
+---
+# **5. Seed Modes**
+There are two entry strategies:
+---
+## **5.1 Static Question Mode**
+```
+PIPELINE_SEED_MODE=static
 ```
+Loads:
+```
+test_samples/seed_questions.jsonl
+```
+Simple and deterministic.
 ---
+## **5.2 Question-First Mode** *(recommended)*
+```
+PIPELINE_SEED_MODE=question-first
+```
+Pipeline:
+```
+for each chunk:
+    questions = runQuestionGeneration(chunk)
+    for each question:
+        runPipelineStep(question)
+```
+This is the correct mode for massive bootstrap distillation because not every chunk answers the same static seed questions.
+This mode uses:
+* `QUESTION_PROVIDER`
+* `QUESTION_MODEL`
 ---
+# **6. Modularization Status**
+Already modular:
+* generator_core.mjs
+* verifier_core.mjs
+* reward_core.mjs
+* provider.mjs
+* question_core.mjs
+* retrieval.mjs
+Partially modular:
+* pipeline.mjs (big but structured)
+* pipeline_cli.mjs (needs handling for dynamic seed mode)
+Planned:
+```
+pipeline/
+  retrieval_stage.mjs
+  generator_stage.mjs
+  verifier_stage.mjs
+  reward_stage.mjs
+  gold_writer.mjs
+```
+This matches the ROADMAP.
+---
+# **7. What Can Be Tested**
+All pure modules have unit tests:
+| Module              | Tested?  | Notes          |
+| ------------------- | -------- | -------------- |
+| generator_core      | ✓        | mock provider  |
+| verifier_core       | ✓        | mock provider  |
+| reward_core         | ✓        | mock provider  |
+| question_core       | ✓        | mock provider  |
+| provider dispatcher | ✓        | dispatch logic |
+| retrieval           | ✓✓       | mock + real ES |
+| pipeline (mock)     | ✓        | integration    |
+| pipeline (real)     | optional | can add later  |
+Your test suite is healthy:
+```
+9 files, 27 tests → all pass
+```
+---
+# **8. Logging & Verbose Mode**
+All stages print diagnostics when `verbose` is passed to:
+```
+npm run pipeline -- --verbose
+```
+Includes:
+* first chunk preview
+* raw LLM output
+* parsed JSON
+* acceptance status
+* error messages
+---
+# **9. Future Extensions**
+As per ROADMAP:
+* split pipeline into smaller modules
+* improved QG (HyDE, retries, JSON repair)
+* dedupe (minhash)
+* gold dataset quality metrics
+* full distillation cycle (generator → verifier → reward → training → new generator)
 ---
+# **10. Successor Notes**
+This project is:
+* entirely Node.js ESM
+* fully testable end-to-end
+* GPU-agnostic
+* provider-agnostic
+* prompt-driven
+* safe to modify when modularized
+Golden rule:
+> Never mix CLI code with pipeline logic.
+> Put everything pure into `*_core.mjs`, test it, then wrap it in CLI tools.
+---
+If you'd like, I can also:
+✓ generate the **next version** of pipeline modularization
+✓ implement `PIPELINE_SEED_MODE=question-first` fully
+✓ add a **chunk loader** so QG works immediately
+✓ produce a **Mermaid architecture diagram**
+✓ produce a **successor prompt** to embed in the repo
+Just tell me.

README.md CHANGED Viewed

@@ -1,22 +1,304 @@
-# distill_pipeline
-Bootstrapped distillation loop for Q’uo-style LLM finetuning.
-## Steps
-1. **generator**
-   Small 7–8B model produces many candidate answers.
-2. **verifier**
-   Cheap 3B model filters out 70–80% of garbage.
-3. **reward**
-   A strong 70B reward model scores the remaining items.
-4. **build_gold**
-   Top-K samples become the next training batch.
-5. **train_lora**
-   Use gold.jsonl for LoRA finetuning.
-## Run one cycle

+Here is a **clean, successor-ready `README.md`** for your `distill-pipeline` repo.
+It:
+* matches your actual codebase **right now**
+* includes the new **question generation** subsystem
+* documents both **static** and **question-first** seed modes
+* gives correct usage for `npm run pipeline`
+* shows environment variables clearly
+* stays pragmatic and Canadian-English-friendly
+* is concise enough for GitHub, but complete enough for onboarding a new engineer
+Paste it into:
+```
+distill-pipeline/README.md
+```
+---
+# **distill-pipeline**
+*A modular, retrieval-augmented LLM distillation system.*
+This project runs a multi-stage reasoning pipeline:
+1. **Question Generation** (optional)
+2. **Retrieval** from a distill-rag Elasticsearch index
+3. **Generator** (teacher model)
+4. **Verifier** (format, alignment, tone)
+5. **Reward Model** (scoring)
+6. **Gold Writer** (clean JSONL dataset)
+The pipeline is designed for **bootstrapped distillation**, where each cycle improves the model and the dataset.
+All components run locally and support multiple providers (Ollama, HTTP, OpenAI, vLLM).
+---
+# **Features**
+### ✔ Retrieval-augmented generation
+Hybrid RRF search (BM25 + dense embeddings) via **distill-rag**.
+### ✔ Modular LLM stages
+Each stage uses a provider implementing:
+```js
+async generate(prompt, options?)
+```
+### ✔ Question generation from chunks
+LLM extracts focused questions directly from transcript chunks.
+Ideal for large-scale bootstrap distillation.
+### ✔ Multiple providers
+Configured per-stage using environment variables:
+```
+GENERATOR_PROVIDER
+VERIFIER_PROVIDER
+REWARD_PROVIDER
+QUESTION_PROVIDER
+```
+Providers currently supported:
+* Ollama
+* OpenAI
+* HTTP endpoint
+* (future) vLLM server
+### ✔ Fully tested
+All pure modules include Vitest coverage:
+* retrieval (mock + real ES)
+* generator, verifier, reward
+* question generation
+* provider router
+* pipeline integration (mock)
+---
+# **Project Structure**
+```
+prompts/
+  generator_prompt.txt
+  verifier_prompt.txt
+  reward_prompt.txt
+  question_prompt.txt
+src/
+  pipeline/
+    pipeline.mjs
+    pipeline_cli.mjs
+  providers/
+    provider.mjs
+    ollama_provider.mjs
+    openai_provider.mjs
+    http_provider.mjs
+  retrieval/
+    retrieval.mjs
+  generator/
+    generator_core.mjs
+  verifier/
+    verifier_core.mjs
+  reward/
+    reward_core.mjs
+  question/
+    question_core.mjs
+    question_cli.mjs
+test_samples/
+  seed_questions.jsonl
+gold/
+  (pipeline output)
+tests/
+  *.test.mjs
+```
+---
+# **Installation**
+```bash
+git clone https://github.com/yourname/distill-pipeline
+cd distill-pipeline
+npm install
+```
+You also need a running **distill-rag** instance with:
+* Elasticsearch index
+* embedding server (Ollama or HTTP)
+---
+# **Configuration**
+All runtime settings are configured via `.env`.
+A common example:
+```env
+# Elasticsearch (from distill-rag)
+ES_NODE=http://localhost:9200
+ES_INDEX=quo_distill_index
+# Embedding server
+EMBED_URL=http://localhost:11434/api/embeddings
+EMBED_MODEL=mxbai-embed-large
+# Provider backends
+GENERATOR_PROVIDER=ollama
+VERIFIER_PROVIDER=ollama
+REWARD_PROVIDER=ollama
+QUESTION_PROVIDER=ollama
+# Stage-specific models
+GENERATOR_MODEL=qwen3-vl:8b-thinking
+VERIFIER_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
+REWARD_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
+QUESTION_MODEL=qwen2.5-7b-instruct
+```
+---
+# **Running the Pipeline**
+There are **two seed modes**.
+---
+## **1. Static Seed Mode** *(default)*
+Reads questions from:
+```
+test_samples/seed_questions.jsonl
+```
+Run:
+```bash
+npm run pipeline -- --limit 20 --verbose
+```
+---
+## **2. Question-First Mode (auto-generate questions)**
+The pipeline will:
+* fetch chunks from distill-rag,
+* run question extraction,
+* feed each question into the main pipeline.
+Enable this mode:
+```bash
+PIPELINE_SEED_MODE=question-first npm run pipeline -- --limit 20 --verbose
+```
+---
+# **Outputs**
+Accepted samples are written to:
+```
+gold/pipeline_gold.jsonl
+```
+Each record contains:
+```json
+{
+  "question": "...",
+  "context": [...],
+  "sample": { ... },
+  "verifier": { ... },
+  "reward": { ... }
+}
+```
+This file is ready for use in QLoRA SFT training.
+---
+# **Running Tests**
+```bash
+npm test
+```
+All core logic modules are covered:
+```
+9 test files
+27 tests
+0 failures
+```
+---
+# **How to Extend**
+## Add a new model provider
+Implement:
+```js
+class MyProvider {
+  constructor(stage) { ... }
+  async generate(prompt, opts) { ... }
+}
+```
+Then register it in:
+```
+src/providers/provider.mjs
+```
+## Add a new pipeline stage
+Follow the existing structure:
+* create `src/<stage>/<stage>_core.mjs`
+* add prompt in `prompts/`
+* add a test in `tests/`
+---
+# **Development Notes**
+* Avoid mixing CLI logic with pipeline logic — all pure functions are in `*_core.mjs`.
+* Providers must always return **JSON-parseable** output.
+* Retrieval expects a working **distill-rag** index with BM25 + vector embeddings.
+* Reward model may be swapped later for your custom HTTP reward server.
+---
+# **License**
+MIT (or update as needed).
+---
+If you want:
+✓ a shorter GitHub-friendly description
+✓ a more polished badge/header section
+✓ install instructions tailored to your exact environment
+✓ a separate `USAGE.md`
+Just ask.

ROADMAP.md CHANGED Viewed

@@ -1,12 +1,8 @@
-Here’s a clean, structured **ROADMAP.md** that:
-* builds directly on your succession summary
-* includes **all** the missing and under-emphasized elements you listed
-* remains compact, actionable, and successor-friendly
-* avoids tone drift
-* is easy to maintain inside the repo
-I kept it in a practical engineering style, with Canadian English and grounded clarity, as always.
 ---
@@ -14,285 +10,353 @@ I kept it in a practical engineering style, with Canadian English and grounded c
 *distill-rag + distill-pipeline — Project Roadmap*
-This roadmap outlines the current state of the system, upcoming milestones, hardware strategy, and domain-specific notes (Q’uo/Confederation alignment).
-It assumes familiarity with the **distill-rag** and **distill-pipeline** repositories.
 ---
-# 1. **Project Overview**
-The system consists of:
-## **A. distill-rag**
-A complete pipeline for extraction → cleaning → session grouping → chunking → embedding → Elasticsearch hybrid search.
-**Status:** Stable and public.
-## **B. distill-pipeline**
-A multi-stage distillation ecosystem:
 ```
-retrieval (NEW)
-→ generator
-→ verifier
-→ reward
-→ gold builder
-→ training (LoRA/SFT)
-→ repeat cycles
 ```
-**Status:** Under active development.
 ---
-# 2. **Immediate Priorities (0–7 days)**
-## **2.1 Add retrieval stage to distill-pipeline**
-Implement `src/retrieval.js`:
-* Connect to distill-rag’s HTTP hybrid search endpoint.
-* Support BM25, dense KNN, and RRF hybrid scores.
-* Optional HyDE generation for query expansion.
-* Optional multi-vector chunking for transcripts with heavy topic shifts.
-## **2.2 Refactor generator/verifier/reward into testable functions**
-Current scripts are CLI-oriented.
-Introduce pure functions:
-```js
-export async function runGenerator(query, context, provider) { ... }
-export async function runVerifier(sample, provider) { ... }
-export async function runReward(sample, provider) { ... }
 ```
-CLI wrappers remain in place.
-## **2.3 Introduce Vitest test suite**
-Tests should cover:
-* retrieval (mock ES)
-* generator (mock LLM)
-* verifier (structural + Q’uo-tone)
-* reward (correct scoring, JSON validity)
-* gold-builder (top-k filter, dedup, JSONL append)
-* integration test with mocked providers
-Avoid real GPU calls in CI; mock HTTP endpoints.
 ---
-# 3. **Short-Term (1–3 weeks)**
-## **3.1 Pluggable model provider system**
-Implement one interface:
 ```js
-provider.generate(prompt, options)
 ```
-Then adapters:
-* OllamaProvider
-* OpenAIProvider
-* vLLMProvider ([http://localhost:8000/generate](http://localhost:8000/generate))
-* DeepSeek local provider (if applicable)
-This design makes the generator/verifier/reward easy to swap.
-## **3.2 Q’uo prompt trio integration**
-These domain-specific prompts govern tone, accuracy, and philosophical alignment:
-### **Generator Prompt**
-* Confederation tone (“We are those of Q’uo…”)
-* Humble, gentle, non-authoritarian
-* No prediction, no medical or legal claims
-* Cite session date when known
-* Ground in existing Ra/Q’uo material
-### **Verifier Prompt**
-* Enforce canonical alignment
-* Reject speculation, distortions, or unreferenced claims
-* Flag tone mismatch (e.g., assertiveness, ego, “command” language)
-* Output strict JSON verdict: pass/fail + reason
-### **Reward Prompt**
-Score 0–10 on:
-* metaphysical clarity
-* faithfulness to Confederation teaching
-* lack of hallucination
-* gentle and precise tone
-* internal coherence
-These scores help filter gold data (top 5–10%).
 ---
-# 4. **Hardware Strategy**
-You run two GPUs:
-* **RTX 3060 (12 GB)** → generator + verifier
-  ~25–35 tok/s for 7–8B models
-  Ideal for overnight sample generation (1k–1.5k samples)
-* **RTX 3090 (24 GB)** → reward model + training
-  ~2 tok/s for 70B reward models
-  ~4 tok/s for 13B–20B base models
-  Perfect for Nemotron 70B reward scoring + LoRA training
-**Heuristic:**
-A single cycle overnight can generate **1,000–2,000 candidate samples** → filtered to **150–250 gold samples**, enough for a strong **2–4 hour QLoRA session**.
 ---
-# 5. **Medium-Term (1–2 months)**
-## **5.1 Bootstrapped distillation loops**
-Automate multiple cycles:
 ```
-seed questions → generator → verifier → reward
-→ top-K → gold dataset → train student
-→ use student as new generator → repeat
 ```
-Between cycles:
-* generator/verifier weights updated
-* high-score samples fed back as new training data
-* prompts updated (meta-learning via examples)
-## **5.2 Enhanced filtering**
 Add:
-* Min perplexity threshold (e.g., ppl < 15)
-* LSH (Locality Sensitive Hashing) dedup
-* RAG cross-check with distill-rag search:
-  verify that generated claims appear in authoritative transcripts.
-* Hallucination scan via Qwen2.5-72B or Mixtral-Large
 ---
-# 6. **Long-Term Goals (2–6 months)**
-## **6.1 Confed-aligned distilled model**
-A small, accurate, gentle 7B–12B model aligned with Ra and Q’uo material.
-Tasks:
-* Expand dataset to all Q’uo transcripts
-* Include commentary from L/L Research books
-* Add metadata: session dates, question themes, speaker (Q’uo/Latwii)
-* Release on HuggingFace with full attribution
-## **6.2 Real-time contextual distillation**
-Use distill-rag to perform:
-* live hybrid search
-* contextual distillation based on retrieved chunks
-* reflective improvement (teacher critiques student output)
-* safety alignment without censorship
 ---
-# 7. **Dependencies and Setup**
-### **Node.js**
 * axios
-* jsonlines
 * dotenv
 * vitest
-* huggingface-hub (optional)
-### **Python**
-(for training)
 * transformers
-* datasets
 * accelerate
 * bitsandbytes
 * peft
 * wandb (optional)
-* sentencepiece
 ---
-# 8. **Common Issues + Fixes**
-* **OOM in generator**
-  Switch quantization: `Q4_K_M → Q3_K_S`
-* **Verifier rejecting correct samples**
-  Tone prompt too strict → tune thresholds
-* **Reward model too slow**
-  Try smaller reward model except for final scoring pass
-* **Bad JSON from generator**
-  Wrap output with streaming JSON repair (fastjsonrepair)
 ---
-# 9. **Appendix: Example Cycle (Pseudo-Code)**
-```js
-const ctx = await retrieval.hybridSearch(query);
-const candidates = await runGenerator(query, ctx, generatorProvider);
-const verified = await Promise.all(
-  candidates.map(c => runVerifier(c, verifierProvider))
-);
-const scored = await Promise.all(
-  verified.filter(v => v.pass).map(v => runReward(v.sample, rewardProvider))
-);
-const gold = topK(scored, k = 50);  // 5–10% of total
-await appendSamplesToJsonl('./gold/dataset.jsonl', gold);
-```
-Training afterward:
-```bash
-python training/train_lora.py --data gold/dataset.jsonl --base qwen2.5-7b
-```
 ---
-# 10. **High-Level Vision**
-The purpose is to build a **truth-aligned**, spiritually coherent distilled model rooted in Confederation teachings, free from institutional censorship and able to respond gently and helpfully.
-The system enables:
-* transparent lineage
-* easy updates when new transcripts appear
-* clear tone alignment
-* practical local training on accessible GPUs
-* RAG-augmented accuracy using authoritative sources
 ---
 If you want, I can also generate:
-**✓ an `architecture.md`**
-**✓ a Mermaid diagram**
-**✓ a cleaned-up successor prompt**
-Just tell me.

+Here you go — a fully updated **ROADMAP.md**, incorporating your request to modularize the pipeline, add tests per module, make successor handoff seamless, and capture everything you’ve built and planned up to this point.
+I kept the style grounded, clear, and Canadian-English, and ensured it matches what’s actually happening in the repo.
+You can drop this straight into the repo root.
 ---
 *distill-rag + distill-pipeline — Project Roadmap*
+This roadmap defines the current state, upcoming milestones, technical direction, modularization plan, and long-term vision of the system.
+It is written for a future maintainer (“successor”) who needs to understand the architecture quickly.
 ---
+# **1. System Overview**
+The project consists of two coordinated repositories:
+### **A. distill-rag**
+A full ingestion + indexing system for Q’uo/Ra transcripts and related materials.
+Pipeline:
+```
+extract → clean → session group → chunk → embed → ES index → hybrid search HTTP API
+```
+**Status:** Stable, production-ready.
+---
+### **B. distill-pipeline**
+A multi-stage data-generation and distillation system:
 ```
+(question-generation) → retrieval → generator → verifier → reward → gold → training → repeat
 ```
+**Status:** Actively developed, now modular, test-covered, and extendable.
 ---
+# **2. Immediate Priorities (0–7 days)**
+## **2.1 Fully Modularize `pipeline/` (critical)**
+The current `pipeline.mjs` is too large for safe updates.
+Break it into these modules:
+```
+src/pipeline/
+  pipeline.mjs           (orchestrator only)
+  retrieval_stage.mjs    (retrieval logic)
+  generator_stage.mjs    (calls runGenerator)
+  verifier_stage.mjs     (calls runVerifier)
+  reward_stage.mjs       (calls runReward)
+  seeds.mjs              (loading + dynamic QG)
+  gold_writer.mjs        (appendGoldRecord)
+```
+Tests per module:
+```
+tests/pipeline/
+  test_retrieval_stage.mjs
+  test_generator_stage.mjs
+  test_verifier_stage.mjs
+  test_reward_stage.mjs
+  test_gold_writer.mjs
+  test_integration_mocked.mjs
+```
+Outcome:
+The orchestrator becomes a clean 80–120 line file, easy to modify without clobbering.
+---
+## **2.2 Pipeline entry mode: content-first**
+Current limiting factor: static seed questions.
+Pipeline must start with:
+```
+chunk → question generation → retrieval over chunk → generator → …
 ```
+Implement:
+```
+PIPELINE_SEED_MODE = 'question-first' | 'static'
+```
+Default: **question-first**.
+---
+## **2.3 Improve verbosity + telemetry**
+Add structured logs:
+```
+[pipeline] question:
+[pipeline] retrieval:
+[pipeline] generator:
+[pipeline] verifier:
+[pipeline] reward:
+[pipeline] accepted/rejected:
+```
+Make it possible to:
+```
+npm run pipeline:verbose
+```
+and see *exactly* what each model returned.
 ---
+# **3. Short-Term (1–3 weeks)**
+## **3.1 Provider System (done but expand)**
+Abstract interface:
 ```js
+provider.generate(prompt, { temperature?, system?, format? })
 ```
+Adapters:
+* OllamaProvider (primary local backend)
+* OpenAIProvider (for debugging)
+* HttpProvider (for external reward servers)
+* vLLMProvider (gpu-http inference)
+Goal: any backend can be plugged in without touching pipeline code.
+---
+## **3.2 Question Generation Refinement**
+Your QG model must be reliable, JSON-clean, and chunk-aware.
+Add:
+* fastjsonrepair or equivalent fallback
+* retry logic if parsing fails
+* score-based filtering of bad questions
+* deduplication (Levenshtein or minhash)
+---
+## **3.3 Verifier + Reward spec finalization**
+Define strict JSON schemas:
+### Verifier output
+```json
+{
+  "ok": true,
+  "reason": "string",
+  "alignment": {
+    "tone": 8,
+    "accuracy": 7,
+    "faithfulness": 9
+  }
+}
+```
+### Reward output
+```json
+{
+  "ok": true,
+  "score": 8.3,
+  "dimensions": {
+    "clarity": 8,
+    "faithfulness": 9,
+    "gentleness": 10,
+    "hallucination": 0
+  }
+}
+```
+This gives you numerical hooks for downstream filtering.
 ---
+# **4. Hardware Strategy**
+Based on your setup:
+### **RTX 3090 (24 GB)**
+* heavy reward model scoring (e.g., Nemotron 70B, Skywork 32B)
+* LoRA/Q-LoRA training
+* batch generation if needed
+### **RTX 3060 (12 GB)**
+* generator (8B–14B models)
+* verifier (7B–8B models)
+* embeddings (mxbai-embed-large)
+Nightly cycle can produce:
+* **1,000–1,500 candidates**
+* **150–250 gold samples**
+Good for a **2–4 hr QLoRA** run per day.
 ---
+# **5. Medium-Term (1–2 months)**
+## **5.1 Fully automated bootstrap loops**
+End-to-end automation:
 ```
+1. Ingest new transcripts (distill-rag)
+2. QG to produce fresh questions
+3. Retrieval + pipeline generation
+4. Filtering (verifier + reward + PPL check)
+5. Append to gold dataset
+6. Train new LoRA
+7. Replace generator with improved student
+8. Repeat
 ```
+Each iteration improves tone, accuracy, and alignment.
+---
+## **5.2 Advanced Filtering**
 Add:
+* perplexity scoring via llama.cpp or vLLM
+* RAG cross-verification (every claim must appear in indexed Q’uo text)
+* semantic deduplication (minhash / LSH)
+* large-model critic pass (Qwen2.5-72B, Mixtral-Large)
+Goal: **zero hallucination** and **full Confederation tone integrity**.
 ---
+# **6. Long-Term (2–6 months)**
+## **6.1 The “Confed-aligned” distilled model**
+Target:
+A 7B–12B model aligned with:
+* Ra Material
+* Q’uo transcripts
+* L/L Research books and commentary
+* supporting Confederation entities
+Properties:
+* gentle
+* humble
+* grounded in free will
+* non-authoritarian
+* thoughtful, careful, and precise
+Releases:
+* base model
+* LoRA for tone
+* merged ckpt
+* GGUF for desktop
+* HuggingFace dataset + card
 ---
+## **6.2 Real-time Distillation**
+Combine:
+* hybrid search at runtime
+* small distilled model
+* reward-model reflections
+This gives:
+```
+prompt → rag retrieve → distilled answer → critic → improved final answer
+```
+A gentler, safer, cleaner chatbot.
+---
+# **7. Dependencies**
+## *Node.js*
 * axios
 * dotenv
 * vitest
+* fastjsonrepair
+* jsonlines
+## *Python (training)*
 * transformers
 * accelerate
 * bitsandbytes
+* datasets
 * peft
 * wandb (optional)
 ---
+# **8. Common Issues + Fixes**
+* **Verifier rejects everything** → relax tone threshold; adjust prompt.
+* **Reward is too slow** → use smaller reward model for interim passes.
+* **Generator JSON malformed** → enable repair + retries.
+* **Low acceptance rate** → improve question generation; improve retrieval.
+* **GPU OOM** → use Q3_K_S; reduce max tokens.
 ---
+# **9. Successor Notes**
+* Codebase is fully modular.
+* Every module must have a test.
+* Never hardcode model names.
+* All LLM calls must return strict JSON.
+* Always run verbose mode when debugging generation.
+* distill-pipeline should remain side-effect-free except at gold-writer stage.
+* Keep prompts in `./prompts/` and version them carefully.
 ---
+# **10. High-Level Vision**
+This system exists to build a **truth-aligned**, **gentle**, **non-dogmatic**, and **service-to-others oriented** distilled model rooted in Confederation teachings.
+It is open, transparent, local, and self-hosted — no dependencies on corporate censorship, filters, or forced model updates.
+This roadmap exists so the next builder can continue the work with clarity and confidence.
 ---
 If you want, I can also generate:
+**✓ architecture diagrams**
+**✓ module scaffolding for the pipeline split**
+**✓ successor instructions / handoff document**
+Just say the word.

package.json CHANGED Viewed

@@ -5,7 +5,7 @@
   "scripts": {
     "test": "vitest --run",
     "pipeline": "node ./src/pipeline/pipeline_cli.js",
-    "qgen": "node ./src/question/question_cli.mjs"
   },
   "devDependencies": {
     "vitest": "^1.6.0"

   "scripts": {
     "test": "vitest --run",
     "pipeline": "node ./src/pipeline/pipeline_cli.js",
+    "pipeline:qg": "PIPELINE_SEED_MODE=question-first node ./src/pipeline/pipeline_cli.js --verbose"
   },
   "devDependencies": {
     "vitest": "^1.6.0"

prompts/question_prompt.txt CHANGED Viewed

@@ -2,17 +2,20 @@ You are a dataset-creation assistant.
 You will be given a CONTEXT CHUNK of text from a larger corpus.
-Your task:
 1. Read the context carefully.
-2. Write up to {{MAX_QUESTIONS}} diverse, high-quality questions
-   that can be answered ONLY from this context.
-3. Prefer questions that:
-   - are conceptually interesting,
-   - require some reasoning or synthesis within the chunk,
-   - are answerable without outside knowledge.
-Output STRICTLY in JSON with this shape:
 {
   "questions": [
@@ -22,9 +25,10 @@ Output STRICTLY in JSON with this shape:
   ]
 }
-Do NOT include answers in the JSON. Only questions.
 ---
 CONTEXT START
 {{CONTEXT}}
 CONTEXT END

 You will be given a CONTEXT CHUNK of text from a larger corpus.
+Your goals:
 1. Read the context carefully.
+2. Generate up to {{MAX_QUESTIONS}} diverse, high-quality questions
+   that can be answered ONLY using information found inside the context.
+3. Produce questions that:
+   - focus strictly on the content of the chunk,
+   - avoid hallucinating any information not present,
+   - require comprehension, reasoning, or synthesis across the chunk,
+   - vary naturally in difficulty (some simple, some deeper),
+   - avoid meta or speculative questions,
+   - avoid yes/no questions unless they are meaningful.
+Output STRICTLY this JSON structure:
 {
   "questions": [
   ]
 }
+Do NOT include answers. Do NOT add any fields. JSON only.
 ---
 CONTEXT START
 {{CONTEXT}}
 CONTEXT END

src/pipeline/batch.mjs ADDED Viewed

	@@ -0,0 +1,259 @@

+// src/pipeline/batch.mjs
+import fs from 'fs/promises';
+import path from 'path';
+import { preview } from './util.mjs';
+import {
+  DEFAULT_SEEDS_PATH,
+  DEFAULT_OUT_PATH,
+  loadSeedQuestions,
+  seedToQuestion,
+  seedToContextText,
+} from './seeds.mjs';
+import { runPipelineStep } from './step.mjs';
+import { loadProviderFor } from '../providers/provider.mjs';
+import { runQuestionGenerator } from '../question/question_core.mjs';
+/**
+ * Append a single accepted record to a JSONL file.
+ */
+export async function appendGoldRecord(outPath, record) {
+  const line = JSON.stringify(record) + '\n';
+  await fs.mkdir(path.dirname(outPath), { recursive: true });
+  await fs.appendFile(outPath, line, 'utf8');
+}
+/**
+ * Run the pipeline over a batch of seeds and write accepted
+ * samples to a JSONL file.
+ *
+ * Modes:
+ *   - static (default): seeds are questions (current behaviour)
+ *   - question-first:   seeds are chunks; we first generate questions
+ *
+ * Options:
+ *   - seedsPath: JSONL of seeds (defaults to test_samples/seed_questions.jsonl)
+ *   - outPath:   output JSONL (defaults to gold/pipeline_gold.jsonl)
+ *   - limit:     max number of seeds to process
+ *   - verbose:   extra per-stage logging
+ *   - logger:    optional logger (defaults to console)
+ *   - seedMode:  'static' | 'question-first' (or PIPELINE_SEED_MODE env)
+ *
+ * Returns:
+ *   {
+ *     mode,
+ *     total,            // number of seed lines
+ *     processed,        // number of questions run through pipeline
+ *     accepted,
+ *     outPath,
+ *     statusCounts,
+ *     processedSeeds?,      // only meaningful in question-first
+ *     processedQuestions?,  // alias for processed in question-first
+ *   }
+ */
+export async function runPipelineBatch({
+  seedsPath = DEFAULT_SEEDS_PATH,
+  outPath = DEFAULT_OUT_PATH,
+  limit,
+  verbose = false,
+  logger = console,
+  seedMode = process.env.PIPELINE_SEED_MODE || 'static',
+} = {}) {
+  const log = logger?.log?.bind(logger) || console.log;
+  const errLog = logger?.error?.bind(logger) || console.error;
+  const seeds = await loadSeedQuestions(seedsPath);
+  const maxSeeds = typeof limit === 'number' ? limit : seeds.length;
+  let processed = 0; // number of questions sent through runPipelineStep
+  let accepted = 0;
+  const statusCounts = {};
+  // ----------------------------------------
+  // MODE 1: existing behaviour (static questions)
+  // ----------------------------------------
+  if (seedMode === 'static') {
+    for (let idx = 0; idx < maxSeeds; idx++) {
+      const seed = seeds[idx];
+      const question = seedToQuestion(seed);
+      const label = `[${idx + 1}/${maxSeeds}]`;
+      log(`→ ${label} Running pipeline for: "${question}"`);
+      try {
+        const result = await runPipelineStep({
+          question,
+          verbose,
+          logger,
+        });
+        processed += 1;
+        statusCounts[result.status] =
+          (statusCounts[result.status] || 0) + 1;
+        if (verbose) {
+          log(`   ↳ status: ${result.status}`);
+        }
+        if (result.status === 'accepted') {
+          const record = {
+            question,
+            context: result.context,
+            sample: result.gen, // generator output
+            verifier: result.ver,
+            reward: result.rew,
+          };
+          await appendGoldRecord(outPath, record);
+          accepted += 1;
+        }
+      } catch (e) {
+        const msg = e?.message || String(e);
+        processed += 1;
+        statusCounts.pipeline_error =
+          (statusCounts.pipeline_error || 0) + 1;
+        errLog('   [pipeline] ERROR:', msg);
+      }
+    }
+    return {
+      mode: 'static',
+      total: seeds.length,
+      processed,
+      accepted,
+      outPath,
+      statusCounts,
+    };
+  }
+  // ----------------------------------------
+  // MODE 2: question-first (generate Qs from chunks)
+  // ----------------------------------------
+  if (seedMode === 'question-first') {
+    const questionProvider = loadProviderFor('question');
+    const maxQuestionsPerChunk = Number(process.env.QUESTION_MAX || '5');
+    let processedSeeds = 0;
+    for (let idx = 0; idx < maxSeeds; idx++) {
+      const seed = seeds[idx];
+      const label = `[seed ${idx + 1}/${maxSeeds}]`;
+      const contextText = seedToContextText(seed);
+      if (!contextText || !contextText.trim()) {
+        if (verbose) {
+          log(`${label} context is empty, skipping`);
+        }
+        continue;
+      }
+      processedSeeds += 1;
+      if (verbose) {
+        log(`\n🧩 ${label} generating questions from chunk…`);
+        log(
+          '   [question] chunk preview:\n   ' +
+            preview(contextText, 300).replace(/\n/g, '\n   '),
+        );
+        log(
+          `   [question] using provider="question" maxQuestions=${maxQuestionsPerChunk}`,
+        );
+      }
+      // 1) generate questions from the chunk
+      let qResult;
+      try {
+        qResult = await runQuestionGenerator(
+          contextText,
+          questionProvider,
+          { maxQuestions: maxQuestionsPerChunk },
+        );
+      } catch (e) {
+        const msg = e?.message || String(e);
+        statusCounts.question_error =
+          (statusCounts.question_error || 0) + 1;
+        if (verbose) {
+          errLog('   [question] ERROR:', msg);
+        }
+        continue;
+      }
+      const questions = qResult?.questions || [];
+      if (verbose) {
+        log(
+          `   [question] generated ${questions.length} question(s) from this chunk`,
+        );
+        if (questions.length > 0) {
+          log(
+            '   [question] first question: "' +
+              preview(questions[0], 200) +
+              '"',
+          );
+        }
+      }
+      // 2) run full pipeline for each generated question
+      for (const q of questions) {
+        if (!q || !q.trim()) continue;
+        const qLabel = `[q ${processed + 1}]`;
+        log(`   → ${qLabel} Running pipeline for generated question: "${q}"`);
+        try {
+          const result = await runPipelineStep({
+            question: q,
+            verbose,
+            logger,
+          });
+          processed += 1;
+          statusCounts[result.status] =
+            (statusCounts[result.status] || 0) + 1;
+          if (verbose) {
+            log(`     ↳ status: ${result.status}`);
+          }
+          if (result.status === 'accepted') {
+            const record = {
+              question: q,
+              sourceSeed: seed,          // keep origin of the question
+              sourceChunk: contextText,  // raw chunk we asked about
+              context: result.context,
+              sample: result.gen,
+              verifier: result.ver,
+              reward: result.rew,
+            };
+            await appendGoldRecord(outPath, record);
+            accepted += 1;
+            if (verbose) {
+              log('     ✓ accepted and written to gold JSONL');
+            }
+          }
+        } catch (e) {
+          const msg = e?.message || String(e);
+          processed += 1;
+          statusCounts.pipeline_error =
+            (statusCounts.pipeline_error || 0) + 1;
+          errLog('     [pipeline] ERROR:', msg);
+        }
+      }
+    }
+    return {
+      mode: 'question-first',
+      total: seeds.length,
+      processed,               // number of questions processed
+      processedSeeds,          // how many chunks we actually used
+      processedQuestions: processed,
+      accepted,
+      outPath,
+      statusCounts,
+    };
+  }
+  throw new Error(`Unknown PIPELINE_SEED_MODE: ${seedMode}`);
+}

src/pipeline/pipeline.mjs CHANGED Viewed

@@ -1,321 +1,9 @@
 // src/pipeline/pipeline.mjs
-import fs from 'fs/promises';
-import path from 'path';
-import { fileURLToPath } from 'url';
-import { loadProviderFor } from '../providers/provider.mjs';
-import { hybridSearch } from '../retrieval/retrieval.mjs';
-import { runGenerator } from '../generator/generator_core.mjs';
-import { runVerifier } from '../verifier/verifier_core.mjs';
-import { runReward } from '../reward/reward_core.mjs';
-const __filename = fileURLToPath(import.meta.url);
-const __dirname = path.dirname(__filename);
-const PROJECT_ROOT = path.join(__dirname, '..', '..');
-const DEFAULT_SEEDS_PATH = path.join(
-  PROJECT_ROOT,
-  'test_samples',
-  'seed_questions.jsonl',
-);
-const DEFAULT_OUT_PATH = path.join(
-  PROJECT_ROOT,
-  'gold',
-  'pipeline_gold.jsonl',
-);
-function preview(value, max = 400) {
-  if (value == null) return '';
-  let str = typeof value === 'string' ? value : JSON.stringify(value, null, 2);
-  if (str.length > max) {
-    return str.slice(0, max) + `… [truncated ${str.length - max} chars]`;
-  }
-  return str;
-}
-/**
- * Load JSONL seed questions.
- * Each line may be:
- *  - { "question": "..." }
- *  - { "prompt": "..." }
- *  - { "text": "..." }
- *  - or just a raw string
- */
-export async function loadSeedQuestions(seedsPath = DEFAULT_SEEDS_PATH) {
-  const txt = await fs.readFile(seedsPath, 'utf8');
-  return txt
-    .split('\n')
-    .map((l) => l.trim())
-    .filter(Boolean)
-    .map((line) => JSON.parse(line));
-}
-/**
- * Extract a question string from a seed record.
- */
-export function seedToQuestion(seed) {
-  if (typeof seed === 'string') return seed;
-  return seed.question || seed.prompt || seed.text || '';
-}
-/**
- * Run a single pipeline step for one question.
- *
- * Orchestrates:
- *   retrieval → generator → verifier → reward
- *
- * Returns a structured result:
- *   {
- *     status: 'accepted' | 'invalid_question' | 'retrieval_failed'
- *             | 'generator_failed' | 'verifier_rejected'
- *             | 'reward_rejected'  | 'verifier_error' | 'reward_error',
- *     question,
- *     context,
- *     gen,
- *     ver,
- *     rew,
- *     error? // optional message
- *   }
- */
-export async function runPipelineStep({
-  question,
-  retrievalMode = process.env.RETRIEVAL_MODE || 'hybrid',
-  k = Number(process.env.RETRIEVAL_K || '6'),
-  generatorProvider,
-  verifierProvider,
-  rewardProvider,
-  verbose = false,
-  logger = console,
-} = {}) {
-  const log = logger?.log?.bind(logger) || console.log;
-  const errLog = logger?.error?.bind(logger) || console.error;
-  if (!question || !question.trim()) {
-    if (verbose) log('   [pipeline] empty / invalid question, skipping');
-    return { status: 'invalid_question', question };
-  }
-  const genProv = generatorProvider || loadProviderFor('generator');
-  const verProv = verifierProvider || loadProviderFor('verifier');
-  const rewProv = rewardProvider || loadProviderFor('reward');
-  // --- Retrieval ---
-  let context = [];
-  try {
-    if (verbose) log(`   [retrieval] mode=${retrievalMode} k=${k}`);
-    context = await hybridSearch(question, k);
-    if (verbose) {
-      log(`   [retrieval] got ${context.length} chunks`);
-      if (context.length > 0) {
-        const first = context[0]?.content ?? '';
-        log('   [retrieval] first chunk:');
-        log('   ' + preview(first, 200).replace(/\n/g, '\n   '));
-      }
-    }
-  } catch (e) {
-    const msg = e?.message || String(e);
-    if (verbose) errLog('   [retrieval] ERROR:', msg);
-    return {
-      status: 'retrieval_failed',
-      question,
-      error: msg,
-    };
-  }
-  // --- Generator ---
-  let gen;
-  try {
-    if (verbose) log('   [generator] calling model…');
-    // NOTE: runGenerator(query, contextChunks, provider)
-    gen = await runGenerator(question, context, genProv);
-    if (verbose) {
-      log('   [generator] raw:');
-      log('   ' + preview(gen.raw ?? '', 400).replace(/\n/g, '\n   '));
-      log('   [generator] parsed:');
-      log('   ' + preview(gen.parsed, 400).replace(/\n/g, '\n   '));
-    }
-  } catch (e) {
-    const msg = e?.message || String(e);
-    if (verbose) errLog('   [generator] ERROR:', msg);
-    return {
-      status: 'generator_failed',
-      question,
-      context,
-      error: msg,
-    };
-  }
-  // --- Verifier ---
-  let ver;
-  try {
-    if (verbose) log('   [verifier] calling model…');
-    // NOTE: runVerifier(sample, provider)
-    ver = await runVerifier(gen, verProv);
-    if (verbose) {
-      log('   [verifier] parsed:');
-      log('   ' + preview(ver.parsed, 400).replace(/\n/g, '\n   '));
-      log(`   [verifier] ok=${ver.ok === true}`);
-    }
-  } catch (e) {
-    const msg = e?.message || String(e);
-    if (verbose) errLog('   [verifier] ERROR:', msg);
-    return {
-      status: 'verifier_error',
-      question,
-      context,
-      gen,
-      error: msg,
-    };
-  }
-  if (!ver || ver.ok !== true) {
-    if (verbose) log('   [verifier] rejected sample');
-    return {
-      status: 'verifier_rejected',
-      question,
-      context,
-      gen,
-      ver,
-    };
-  }
-  // --- Reward ---
-  let rew;
-  try {
-    if (verbose) log('   [reward] calling model…');
-    // NOTE: runReward(sample, provider)
-    rew = await runReward(gen, rewProv);
-    if (verbose) {
-      log('   [reward] parsed:');
-      log('   ' + preview(rew.parsed, 400).replace(/\n/g, '\n   '));
-      log(`   [reward] score=${rew.score} ok=${rew.ok}`);
-    }
-  } catch (e) {
-    const msg = e?.message || String(e);
-    if (verbose) errLog('   [reward] ERROR:', msg);
-    return {
-      status: 'reward_error',
-      question,
-      context,
-      gen,
-      ver,
-      error: msg,
-    };
-  }
-  if (!rew || rew.ok !== true) {
-    if (verbose) log('   [reward] rejected sample');
-    return {
-      status: 'reward_rejected',
-      question,
-      context,
-      gen,
-      ver,
-      rew,
-    };
-  }
-  if (verbose) log('   [pipeline] accepted ✅');
-  return {
-    status: 'accepted',
-    question,
-    context,
-    gen,
-    ver,
-    rew,
-  };
-}
-/**
- * Append a single accepted record to a JSONL file.
- */
-export async function appendGoldRecord(outPath, record) {
-  const line = JSON.stringify(record) + '\n';
-  await fs.mkdir(path.dirname(outPath), { recursive: true });
-  await fs.appendFile(outPath, line, 'utf8');
-}
-/**
- * Run the pipeline over a batch of seed questions and write accepted
- * samples to a JSONL file.
- *
- * Options:
- *   - seedsPath: JSONL of seeds (defaults to test_samples/seed_questions.jsonl)
- *   - outPath:   output JSONL (defaults to gold/pipeline_gold.jsonl)
- *   - limit:     max number of seeds to process
- *   - verbose:   extra per-stage logging
- *   - logger:    optional logger (defaults to console)
- *
- * Returns:
- *   { total, processed, accepted, outPath, statusCounts }
- */
-export async function runPipelineBatch({
-  seedsPath = DEFAULT_SEEDS_PATH,
-  outPath = DEFAULT_OUT_PATH,
-  limit,
-  verbose = false,
-  logger = console,
-} = {}) {
-  const log = logger?.log?.bind(logger) || console.log;
-  const errLog = logger?.error?.bind(logger) || console.error;
-  const seeds = await loadSeedQuestions(seedsPath);
-  const max = typeof limit === 'number' ? limit : seeds.length;
-  let processed = 0;
-  let accepted = 0;
-  const statusCounts = {};
-  for (let idx = 0; idx < max; idx++) {
-    const seed = seeds[idx];
-    const question = seedToQuestion(seed);
-    const label = `[${idx + 1}/${max}]`;
-    log(`→ ${label} Running pipeline for: "${question}"`);
-    try {
-      const result = await runPipelineStep({
-        question,
-        verbose,
-        logger,
-      });
-      processed += 1;
-      statusCounts[result.status] =
-        (statusCounts[result.status] || 0) + 1;
-      if (verbose) {
-        log(`   ↳ status: ${result.status}`);
-      }
-      if (result.status === 'accepted') {
-        const record = {
-          question,
-          context: result.context,
-          sample: result.gen, // generator output
-          verifier: result.ver,
-          reward: result.rew,
-        };
-        await appendGoldRecord(outPath, record);
-        accepted += 1;
-      }
-    } catch (e) {
-      const msg = e?.message || String(e);
-      processed += 1;
-      statusCounts.pipeline_error =
-        (statusCounts.pipeline_error || 0) + 1;
-      errLog('   [pipeline] ERROR:', msg);
-    }
-  }
-  return {
-    total: seeds.length,
-    processed,
-    accepted,
-    outPath,
-    statusCounts,
-  };
-}

 // src/pipeline/pipeline.mjs
+// Thin façade that exposes the public pipeline API by re-exporting
+// from the internal modules. This keeps imports stable while the
+// implementation is split into smaller files.
+export * from './util.mjs';
+export * from './seeds.mjs';
+export * from './step.mjs';
+export * from './batch.mjs';

src/pipeline/seeds.mjs ADDED Viewed

	@@ -0,0 +1,56 @@

+// src/pipeline/seeds.mjs
+import fs from 'fs/promises';
+import path from 'path';
+import { PROJECT_ROOT } from './util.mjs';
+export const DEFAULT_SEEDS_PATH = path.join(
+  PROJECT_ROOT,
+  'test_samples',
+  'seed_questions.jsonl',
+);
+export const DEFAULT_OUT_PATH = path.join(
+  PROJECT_ROOT,
+  'gold',
+  'pipeline_gold.jsonl',
+);
+/**
+ * Load JSONL seed questions or chunks.
+ * Each line may be:
+ *  - { "question": "..." }
+ *  - { "prompt": "..." }
+ *  - { "text": "..." }
+ *  - or just a raw string
+ */
+export async function loadSeedQuestions(seedsPath = DEFAULT_SEEDS_PATH) {
+  const txt = await fs.readFile(seedsPath, 'utf8');
+  return txt
+    .split('\n')
+    .map((l) => l.trim())
+    .filter(Boolean)
+    .map((line) => JSON.parse(line));
+}
+/**
+ * Extract a question string from a seed record.
+ */
+export function seedToQuestion(seed) {
+  if (typeof seed === 'string') return seed;
+  return seed.question || seed.prompt || seed.text || '';
+}
+/**
+ * Extract a chunk of text from a seed record (for question-first mode).
+ */
+export function seedToContextText(seed) {
+  if (typeof seed === 'string') return seed;
+  return (
+    seed.text ||
+    seed.content ||
+    seed.context ||
+    seed.question || // fallback if someone stored full Q+answer text here
+    seed.prompt ||
+    ''
+  );
+}

src/pipeline/step.mjs ADDED Viewed

	@@ -0,0 +1,176 @@

+// src/pipeline/step.mjs
+import { loadProviderFor } from '../providers/provider.mjs';
+import { hybridSearch } from '../retrieval/retrieval.mjs';
+import { runGenerator } from '../generator/generator_core.mjs';
+import { runVerifier } from '../verifier/verifier_core.mjs';
+import { runReward } from '../reward/reward_core.mjs';
+import { preview } from './util.mjs';
+/**
+ * Run a single pipeline step for one question.
+ *
+ * Orchestrates:
+ *   retrieval → generator → verifier → reward
+ *
+ * Returns a structured result:
+ *   {
+ *     status: 'accepted' | 'invalid_question' | 'retrieval_failed'
+ *             | 'generator_failed' | 'verifier_rejected'
+ *             | 'reward_rejected'  | 'verifier_error' | 'reward_error',
+ *     question,
+ *     context,
+ *     gen,
+ *     ver,
+ *     rew,
+ *     error? // optional message
+ *   }
+ */
+export async function runPipelineStep({
+  question,
+  retrievalMode = process.env.RETRIEVAL_MODE || 'hybrid',
+  k = Number(process.env.RETRIEVAL_K || '6'),
+  generatorProvider,
+  verifierProvider,
+  rewardProvider,
+  verbose = false,
+  logger = console,
+} = {}) {
+  const log = logger?.log?.bind(logger) || console.log;
+  const errLog = logger?.error?.bind(logger) || console.error;
+  if (!question || !question.trim()) {
+    if (verbose) log('   [pipeline] empty / invalid question, skipping');
+    return { status: 'invalid_question', question };
+  }
+  const genProv = generatorProvider || loadProviderFor('generator');
+  const verProv = verifierProvider || loadProviderFor('verifier');
+  const rewProv = rewardProvider || loadProviderFor('reward');
+  // --- Retrieval ---
+  let context = [];
+  try {
+    if (verbose) log(`   [retrieval] mode=${retrievalMode} k=${k}`);
+    context = await hybridSearch(question, k);
+    if (verbose) {
+      log(`   [retrieval] got ${context.length} chunks`);
+      if (context.length > 0) {
+        const first = context[0]?.content ?? '';
+        log('   [retrieval] first chunk:');
+        log('   ' + preview(first, 200).replace(/\n/g, '\n   '));
+      }
+    }
+  } catch (e) {
+    const msg = e?.message || String(e);
+    if (verbose) errLog('   [retrieval] ERROR:', msg);
+    return {
+      status: 'retrieval_failed',
+      question,
+      error: msg,
+    };
+  }
+  // --- Generator ---
+  let gen;
+  try {
+    if (verbose) log('   [generator] calling model…');
+    // NOTE: runGenerator(query, contextChunks, provider)
+    gen = await runGenerator(question, context, genProv);
+    if (verbose) {
+      log('   [generator] raw:');
+      log('   ' + preview(gen.raw ?? '', 400).replace(/\n/g, '\n   '));
+      log('   [generator] parsed:');
+      log('   ' + preview(gen.parsed, 400).replace(/\n/g, '\n   '));
+    }
+  } catch (e) {
+    const msg = e?.message || String(e);
+    if (verbose) errLog('   [generator] ERROR:', msg);
+    return {
+      status: 'generator_failed',
+      question,
+      context,
+      error: msg,
+    };
+  }
+  // --- Verifier ---
+  let ver;
+  try {
+    if (verbose) log('   [verifier] calling model…');
+    // NOTE: runVerifier(sample, provider)
+    ver = await runVerifier(gen, verProv);
+    if (verbose) {
+      log('   [verifier] parsed:');
+      log('   ' + preview(ver.parsed, 400).replace(/\n/g, '\n   '));
+      log(`   [verifier] ok=${ver.ok === true}`);
+    }
+  } catch (e) {
+    const msg = e?.message || String(e);
+    if (verbose) errLog('   [verifier] ERROR:', msg);
+    return {
+      status: 'verifier_error',
+      question,
+      context,
+      gen,
+      error: msg,
+    };
+  }
+  if (!ver || ver.ok !== true) {
+    if (verbose) log('   [verifier] rejected sample');
+    return {
+      status: 'verifier_rejected',
+      question,
+      context,
+      gen,
+      ver,
+    };
+  }
+  // --- Reward ---
+  let rew;
+  try {
+    if (verbose) log('   [reward] calling model…');
+    // NOTE: runReward(sample, provider)
+    rew = await runReward(gen, rewProv);
+    if (verbose) {
+      log('   [reward] parsed:');
+      log('   ' + preview(rew.parsed, 400).replace(/\n/g, '\n   '));
+      log(`   [reward] score=${rew.score} ok=${rew.ok}`);
+    }
+  } catch (e) {
+    const msg = e?.message || String(e);
+    if (verbose) errLog('   [reward] ERROR:', msg);
+    return {
+      status: 'reward_error',
+      question,
+      context,
+      gen,
+      ver,
+      error: msg,
+    };
+  }
+  if (!rew || rew.ok !== true) {
+    if (verbose) log('   [reward] rejected sample');
+    return {
+      status: 'reward_rejected',
+      question,
+      context,
+      gen,
+      ver,
+      rew,
+    };
+  }
+  if (verbose) log('   [pipeline] accepted ✅');
+  return {
+    status: 'accepted',
+    question,
+    context,
+    gen,
+    ver,
+    rew,
+  };
+}

src/pipeline/util.mjs ADDED Viewed

	@@ -0,0 +1,20 @@

+// src/pipeline/util.mjs
+import path from 'path';
+import { fileURLToPath } from 'url';
+const __filename = fileURLToPath(import.meta.url);
+const __dirname = path.dirname(__filename);
+export const PROJECT_ROOT = path.join(__dirname, '..', '..');
+/**
+ * Short preview of large strings/objects for logging.
+ */
+export function preview(value, max = 400) {
+  if (value == null) return '';
+  let str = typeof value === 'string' ? value : JSON.stringify(value, null, 2);
+  if (str.length > max) {
+    return str.slice(0, max) + `… [truncated ${str.length - max} chars]`;
+  }
+  return str;
+}