htaf
/

distill-pipeline

@@ -1,305 +1,106 @@
-Here is a **clean, successor-ready `README.md`** for your `distill-pipeline` repo.
-It:
-* matches your actual codebase **right now**
-* includes the new **question generation** subsystem
-* documents both **static** and **question-first** seed modes
-* gives correct usage for `npm run pipeline`
-* shows environment variables clearly
-* stays pragmatic and Canadian-English-friendly
-* is concise enough for GitHub, but complete enough for onboarding a new engineer
-Paste it into:
-```
-distill-pipeline/README.md
-```
 ---
-# **distill-pipeline**
-*A modular, retrieval-augmented LLM distillation system.*
-This project runs a multi-stage reasoning pipeline:
-1. **Question Generation** (optional)
-2. **Retrieval** from a distill-rag Elasticsearch index
-3. **Generator** (teacher model)
-4. **Verifier** (format, alignment, tone)
-5. **Reward Model** (scoring)
-6. **Gold Writer** (clean JSONL dataset)
-The pipeline is designed for **bootstrapped distillation**, where each cycle improves the model and the dataset.
-All components run locally and support multiple providers (Ollama, HTTP, OpenAI, vLLM).
----
-# **Features**
-### ✔ Retrieval-augmented generation
-Hybrid RRF search (BM25 + dense embeddings) via **distill-rag**.
-### ✔ Modular LLM stages
-Each stage uses a provider implementing:
-```js
-async generate(prompt, options?)
-```
-### ✔ Question generation from chunks
-LLM extracts focused questions directly from transcript chunks.
-Ideal for large-scale bootstrap distillation.
-### ✔ Multiple providers
-Configured per-stage using environment variables:
-```
-GENERATOR_PROVIDER
-VERIFIER_PROVIDER
-REWARD_PROVIDER
-QUESTION_PROVIDER
-```
-Providers currently supported:
-* Ollama
-* OpenAI
-* HTTP endpoint
-* (future) vLLM server
-### ✔ Fully tested
-All pure modules include Vitest coverage:
-* retrieval (mock + real ES)
-* generator, verifier, reward
-* question generation
-* provider router
-* pipeline integration (mock)
-* JSONL cache, PASS/FAIL verifier parsing, generator parsing (thought/thinking/answer)
 ---
-# **Project Structure**
-```
-prompts/
-  generator_prompt.txt
-  verifier_prompt.txt
-  reward_prompt.txt
-  question_prompt.txt
-src/
-  pipeline/
-    pipeline.mjs
-    pipeline_cli.mjs
-  providers/
-    provider.mjs
-    ollama_provider.mjs
-    openai_provider.mjs
-    http_provider.mjs
-  retrieval/
-    retrieval.mjs
-  generator/
-    generator_core.mjs
-  verifier/
-    verifier_core.mjs
-  reward/
-    reward_core.mjs
-  question/
-    question_core.mjs
-    question_cli.mjs
-test_samples/
-  seed_questions.jsonl
-gold/
-  (pipeline output)
-tests/
-  *.test.mjs
 ```
----
-# **Installation**
-```bash
-git clone https://github.com/yourname/distill-pipeline
-cd distill-pipeline
-npm install
-```
-You also need a running **distill-rag** instance with:
-* Elasticsearch index
-* embedding server (Ollama or HTTP)
----
-# **Configuration**
-All runtime settings are configured via `.env`.
-A common example:
-```env
-# Elasticsearch (from distill-rag)
 ES_NODE=http://localhost:9200
 ES_INDEX=quo_distill_index
-# Embedding server
 EMBED_URL=http://localhost:11434/api/embeddings
 EMBED_MODEL=mxbai-embed-large
-# Provider backends
 GENERATOR_PROVIDER=ollama
 VERIFIER_PROVIDER=ollama
 REWARD_PROVIDER=ollama
 QUESTION_PROVIDER=ollama
-# Stage-specific models
 GENERATOR_MODEL=qwen3-vl:8b-thinking
 VERIFIER_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
 REWARD_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
 QUESTION_MODEL=qwen2.5-7b-instruct
-```
----
-# **Running the Pipeline**
-There are **two seed modes**.
----
-## **1. Static Seed Mode** *(default)*
-Reads questions from:
-```
-test_samples/seed_questions.jsonl
-```
-Run:
-```bash
-npm run pipeline -- --limit 20 --verbose
-```
----
-## **2. Question-First Mode (auto-generate questions)**
-The pipeline will:
-* fetch chunks from distill-rag,
-* run question extraction,
-* feed each question into the main pipeline.
-Enable this mode:
-```bash
-PIPELINE_SEED_MODE=question-first npm run pipeline -- --limit 20 --verbose
-```
----
-# **Outputs**
-Accepted samples are written to:
-```
-gold/pipeline_gold.jsonl
-```
-Each record contains:
-```json
-{
-  "question": "...",
-  "context": [...],
-  "sample": { ... },
-  "verifier": { ... },
-  "reward": { ... }
-}
-```
-This file is ready for use in QLoRA SFT training.
----
-# **Running Tests**
-```bash
-npm test
-```
-All core logic modules are covered:
-```
-9 test files
-27 tests
-0 failures
-```
----
-# **How to Extend**
-## Add a new model provider
-Implement:
-```js
-class MyProvider {
-  constructor(stage) { ... }
-  async generate(prompt, opts) { ... }
-}
-```
-Then register it in:
-```
-src/providers/provider.mjs
-```
-## Add a new pipeline stage
-Follow the existing structure:
-* create `src/<stage>/<stage>_core.mjs`
-* add prompt in `prompts/`
-* add a test in `tests/`
----
-# **Development Notes**
-* Avoid mixing CLI logic with pipeline logic — all pure functions are in `*_core.mjs`.
-* Providers must always return **JSON-parseable** output.
-* Retrieval expects a working **distill-rag** index with BM25 + vector embeddings.
-* Reward model may be swapped later for your custom HTTP reward server.
----
-# **License**
-MIT (or update as needed).
----
-If you want:
-✓ a shorter GitHub-friendly description
-✓ a more polished badge/header section
-✓ install instructions tailored to your exact environment
-✓ a separate `USAGE.md`
-Just ask.

 ---
+license: apache-2.0
+title: distill-pipeline
+tags:
+  - distillation
+  - retrieval-augmented-generation
+  - pipeline
+  - nodejs
+  - ollama
+  - instruct
+  - question-generation
 ---
+# distill-pipeline
+Modular, retrieval-augmented distillation with question generation, verifier, and reward stages. Supports “thinking” and “instruct” generator flows with separate caches/outputs.
+## Quickstart
+- Clone: `git clone https://github.com/elspru/distill-pipeline && cd distill-pipeline`
+- Install: `npm install`
+- Thinking pipeline (question-first, random walk):
+  `PIPELINE_SEED_MODE=question-first PIPELINE_RANDOM_WALK=1 npm run pipeline -- --limit 5 --verbose`
+- Instruct pipeline (separate cache/output):
+  `INSTRUCT_PIPELINE=1 INSTRUCT_GENERATOR_MODEL=<model> PIPELINE_CACHE_DIR=data/cache_instruct npm run pipeline -- --out gold/pipeline_gold_instruct.jsonl --verbose`
+- Continuous loops (stop with Ctrl+C):
+  `scripts/run_thinking_continuous.sh`
+  `INSTRUCT_GENERATOR_MODEL=<model> scripts/run_instruct_continuous.sh`
+## Configuration (see `.env.example`)
 ```
+# Retrieval
 ES_NODE=http://localhost:9200
 ES_INDEX=quo_distill_index
 EMBED_URL=http://localhost:11434/api/embeddings
 EMBED_MODEL=mxbai-embed-large
+# Providers per stage
 GENERATOR_PROVIDER=ollama
 VERIFIER_PROVIDER=ollama
 REWARD_PROVIDER=ollama
 QUESTION_PROVIDER=ollama
+# Models
 GENERATOR_MODEL=qwen3-vl:8b-thinking
 VERIFIER_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
 REWARD_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
 QUESTION_MODEL=qwen2.5-7b-instruct
+# Instruct-only generator
+INSTRUCT_PIPELINE=0
+INSTRUCT_GENERATOR_MODEL=phi-4-instruct
+INSTRUCT_GENERATOR_PROVIDER=ollama
+# Pipeline knobs
+PIPELINE_SEED_MODE=question-first
+PIPELINE_RANDOM_WALK=0        # set 1 for shuffled chunks
+QUESTION_MAX_PER_CHUNK=5
+# PIPELINE_CHUNK_LIMIT=10
+# PIPELINE_CACHE_DIR=data/cache   # override (e.g., data/cache_instruct)
+```
+## Key scripts
+- `npm run pipeline` — main pipeline CLI (`--limit`, `--out`, `--chunk-limit`, `--verbose`).
+- `scripts/run_thinking_continuous.sh` — loop thinking pipeline with random walk.
+- `scripts/run_instruct_continuous.sh` — loop instruct pipeline (needs `INSTRUCT_GENERATOR_MODEL`).
+- `scripts/try_generator_prompt.sh` — send generator prompt with cached chunk/question (`--random`, `-r` for reasoning).
+- `scripts/cache_report.mjs` — cache stats; set `CACHE_REPORT_MODE=thinking|instruct|both` or `PIPELINE_CACHE_DIR=...`.
+## Outputs
+- Gold JSONL default: `gold/pipeline_gold.jsonl` (instruct default: `gold/pipeline_gold_instruct.jsonl`).
+- Sample gold: `samples/pipeline_gold_sample.jsonl`.
+- Cache defaults: `data/cache` (thinking) and `data/cache_instruct` (instruct); both gitignored.
+## Hugging Face / GitHub distribution
+- License: Apache-2.0 (`LICENSE`).
+- CI: `.github/workflows/ci.yml` runs `npm test` on push/PR.
+- Push to GitHub:
+  ```
+  git remote add origin https://github.com/elspru/distill-pipeline
+  git push origin main
+  ```
+- Push to Hugging Face (user: htaf):
+  ```
+  git lfs install
+  git remote add hf https://huggingface.co/htaf/distill-pipeline
+  git push origin main
+  git push hf main
+  ```
+- Publish code + prompts + `samples/pipeline_gold_sample.jsonl`. Keep caches/gold outputs out (gitignored).
+## Project structure
+```
+prompts/                # stage prompts
+src/                    # pipeline, providers, stages
+tests/                  # Vitest
+data/                   # rag chunks (jsonl), cache (ignored)
+gold/                   # outputs (ignored)
+scripts/                # tooling + runners
+samples/pipeline_gold_sample.jsonl
+```
+## Testing
+`npm test`
+## License
+Apache-2.0

package.json CHANGED Viewed

@@ -1,7 +1,16 @@
 {
   "name": "distill-pipeline",
-  "version": "1.0.0",
   "type": "module",
   "scripts": {
     "test": "vitest --run",
     "pipeline": "node ./src/pipeline/pipeline_cli.js",
@@ -10,4 +19,4 @@
   "devDependencies": {
     "vitest": "^1.6.0"
   }
-}

 {
   "name": "distill-pipeline",
+  "version": "1.1.0",
   "type": "module",
+  "license": "Apache-2.0",
+  "repository": {
+    "type": "git",
+    "url": "https://github.com/elspru/distill-pipeline.git"
+  },
+  "bugs": {
+    "url": "https://github.com/elspru/distill-pipeline/issues"
+  },
+  "homepage": "https://github.com/elspru/distill-pipeline#readme",
   "scripts": {
     "test": "vitest --run",
     "pipeline": "node ./src/pipeline/pipeline_cli.js",
   "devDependencies": {
     "vitest": "^1.6.0"
   }
+}