htaf commited on
Commit
2739b3a
·
1 Parent(s): ecd21e2

added new instruct pipeline for faster generation

Browse files
USAGE.md CHANGED
@@ -67,6 +67,16 @@ If you need to cap questions per chunk:
67
  QUESTION_MAX_PER_CHUNK=3
68
  ```
69
 
 
 
 
 
 
 
 
 
 
 
70
  ---
71
 
72
  # 3. **Understanding Output**
@@ -108,6 +118,11 @@ GENERATOR_MODEL=qwen3-vl:8b-thinking
108
  VERIFIER_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
109
  REWARD_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
110
  QUESTION_MODEL=qwen2.5-7b-instruct
 
 
 
 
 
111
  ```
112
 
113
  Retrieval configuration:
@@ -124,6 +139,15 @@ EMBED_URL=http://localhost:11434/api/embeddings
124
  EMBED_MODEL=mxbai-embed-large
125
  ```
126
 
 
 
 
 
 
 
 
 
 
127
  ---
128
 
129
  # 5. **Debugging Tips**
@@ -155,6 +179,20 @@ node ./src/question/question_cli.mjs "Give me questions about freedom"
155
  node ./src/generator/generator_cli.mjs "What is love?"
156
  ```
157
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
  ---
159
 
160
  # 6. **Full End-to-End (Question-First) Verbose Run**
@@ -174,6 +212,57 @@ You should see:
174
  * verifier judgement
175
  * reward score
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  ---
178
 
179
  # 7. **Output Cleanup**
 
67
  QUESTION_MAX_PER_CHUNK=3
68
  ```
69
 
70
+ ### Random-walk over chunks
71
+
72
+ Shuffle the chunk order (crypto-random) to reduce ordering bias:
73
+
74
+ ```bash
75
+ PIPELINE_RANDOM_WALK=1 QUESTION_MAX_PER_CHUNK=3 npm run pipeline -- --limit 3 --chunk-limit 10 --verbose
76
+ ```
77
+
78
+ Equivalent toggle: `PIPELINE_CHUNK_ORDER=random`. `--chunk-limit` (or `PIPELINE_CHUNK_LIMIT`) caps how many chunks are sampled.
79
+
80
  ---
81
 
82
  # 3. **Understanding Output**
 
118
  VERIFIER_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
119
  REWARD_MODEL=tensortemplar/patronus-lynx:8b-instruct-q4_K_M
120
  QUESTION_MODEL=qwen2.5-7b-instruct
121
+
122
+ # instruct-only generator (optional)
123
+ INSTRUCT_PIPELINE=0
124
+ INSTRUCT_GENERATOR_MODEL=phi-4-instruct
125
+ INSTRUCT_GENERATOR_PROVIDER=ollama
126
  ```
127
 
128
  Retrieval configuration:
 
139
  EMBED_MODEL=mxbai-embed-large
140
  ```
141
 
142
+ General pipeline knobs:
143
+ ```
144
+ PIPELINE_SEED_MODE=question-first # or static
145
+ PIPELINE_RANDOM_WALK=0 # set 1 for shuffled chunks
146
+ QUESTION_MAX_PER_CHUNK=5
147
+ # PIPELINE_CHUNK_LIMIT=10 # optional chunk cap
148
+ # PIPELINE_CACHE_DIR=data/cache # override cache dir (e.g., data/cache_instruct)
149
+ ```
150
+
151
  ---
152
 
153
  # 5. **Debugging Tips**
 
179
  node ./src/generator/generator_cli.mjs "What is love?"
180
  ```
181
 
182
+ ### Try generator prompt with cached chunk/question:
183
+
184
+ ```bash
185
+ # picks first cached chunk + question
186
+ scripts/try_generator_prompt.sh
187
+
188
+ # pick specific cached chunk/question
189
+ scripts/try_generator_prompt.sh <chunk_id> <question_index>
190
+
191
+ # random cached chunk/question + reasoning mode
192
+ scripts/try_generator_prompt.sh --random -r
193
+ ```
194
+ Uses `data/cache/questions.jsonl` + `data/rag_chunks.jsonl` and injects them into `prompts/generator_prompt.txt`. It picks the first matching cached chunk ID in the rag file (or random with `--random`). Override via `PIPELINE_CACHE_DIR`, `RAG_CHUNKS_PATH`, `PROMPT_FILE`, `GENERATOR_MODEL`, `OLLAMA_URL`. If no cached chunk IDs match the rag chunks, run the pipeline once to populate the cache.
195
+
196
  ---
197
 
198
  # 6. **Full End-to-End (Question-First) Verbose Run**
 
212
  * verifier judgement
213
  * reward score
214
 
215
+ ### Long overnight runs
216
+
217
+ To let the pipeline keep going until you stop it, drop `--limit` (process all available seeds/chunks) or set a very high limit. For a basic “run all cached chunks with random walk” overnight job:
218
+
219
+ ```bash
220
+ PIPELINE_SEED_MODE=question-first \
221
+ PIPELINE_RANDOM_WALK=1 \
222
+ QUESTION_MAX_PER_CHUNK=5 \
223
+ npm run pipeline -- --verbose
224
+ ```
225
+
226
+ If you prefer a hard cap instead of truly unbounded, set `--limit <N>` or `PIPELINE_CHUNK_LIMIT`. To run it in a loop until you Ctrl+C:
227
+
228
+ ```bash
229
+ while true; do
230
+ PIPELINE_SEED_MODE=question-first PIPELINE_RANDOM_WALK=1 npm run pipeline -- --verbose
231
+ sleep 10
232
+ done
233
+ ```
234
+
235
+ # 7. **Instruct-only generator runs**
236
+
237
+ Use an instruct model without touching the default “thinking” pipeline:
238
+
239
+ ```bash
240
+ INSTRUCT_PIPELINE=1 \
241
+ INSTRUCT_GENERATOR_MODEL=<your-instruct-model> \
242
+ PIPELINE_CACHE_DIR=data/cache_instruct \
243
+ npm run pipeline -- --out gold/pipeline_gold_instruct.jsonl --verbose
244
+ ```
245
+
246
+ What it does:
247
+ - switches the generator model/provider to `INSTRUCT_GENERATOR_MODEL` (and `INSTRUCT_GENERATOR_PROVIDER` if set),
248
+ - keeps verifier/reward unchanged (configure `VERIFIER_MODEL`/`REWARD_MODEL` if you want lighter models),
249
+ - defaults the output to `gold/pipeline_gold_instruct.jsonl` unless you pass `--out`.
250
+
251
+ Keep caches separate by setting `PIPELINE_CACHE_DIR` for instruct runs (e.g., `data/cache_instruct`) so you don’t mix artifacts with the thinking pipeline. The default cache path is unchanged unless you override it.
252
+
253
+ ### One-liner scripts for continuous runs
254
+
255
+ Thinking pipeline (random-walk, no limit, restarts every 10s):
256
+ ```
257
+ scripts/run_thinking_continuous.sh
258
+ ```
259
+
260
+ Instruct pipeline (needs `INSTRUCT_GENERATOR_MODEL`, uses separate cache/output):
261
+ ```
262
+ INSTRUCT_GENERATOR_MODEL=<your-model> scripts/run_instruct_continuous.sh
263
+ ```
264
+ Configure `INSTRUCT_GENERATOR_PROVIDER`, `PIPELINE_CACHE_DIR`, or `INSTRUCT_OUT` as needed; stop with Ctrl+C.
265
+
266
  ---
267
 
268
  # 7. **Output Cleanup**
data/cache_instruct/generations.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
data/cache_instruct/questions.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
data/cache_instruct/rewards.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
data/cache_instruct/verifications.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
scripts/cache_report.mjs CHANGED
@@ -10,14 +10,39 @@ const __filename = fileURLToPath(import.meta.url);
10
  const __dirname = path.dirname(__filename);
11
  const PROJECT_ROOT = path.join(__dirname, '..');
12
 
13
- const CACHE_DIR = (() => {
14
- const custom = process.env.PIPELINE_CACHE_DIR;
15
- if (custom) {
16
- return path.isAbsolute(custom)
17
- ? custom
18
- : path.join(PROJECT_ROOT, custom);
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  }
20
- return path.join(PROJECT_ROOT, 'data', 'cache');
 
 
 
 
 
 
 
 
 
21
  })();
22
 
23
  const FILES = {
@@ -27,8 +52,8 @@ const FILES = {
27
  rewards: 'rewards.jsonl',
28
  };
29
 
30
- async function readJsonl(fileName) {
31
- const filePath = path.join(CACHE_DIR, fileName);
32
  try {
33
  const txt = await fs.readFile(filePath, 'utf8');
34
  return txt
@@ -54,47 +79,56 @@ function uniq(arr) {
54
  }
55
 
56
  async function main() {
57
- const questions = await readJsonl(FILES.questions);
58
- const generations = await readJsonl(FILES.generations);
59
- const verifications = await readJsonl(FILES.verifications);
60
- const rewards = await readJsonl(FILES.rewards);
61
-
62
- const chunkIds = uniq([
63
- ...questions.map((r) => r.chunk_id),
64
- ...generations.map((r) => r.chunk_id),
65
- ...verifications.map((r) => r.chunk_id),
66
- ...rewards.map((r) => r.chunk_id),
67
- ].filter(Boolean));
68
-
69
- const totalQuestions = questions.reduce((acc, r) => {
70
- if (Array.isArray(r.questions)) return acc + r.questions.length;
71
- if (Array.isArray(r.question_ids)) return acc + r.question_ids.length;
72
- return acc + 1;
73
- }, 0);
74
-
75
- const totalGenerations = generations.length;
76
- const totalVerifications = verifications.length;
77
- const totalRewards = rewards.length;
78
-
79
- const passedVerifications = verifications.filter((v) => v.ok === true).length;
80
- const passedRewards = rewards.filter((r) => r.ok === true).length;
81
-
82
- const rows = [
83
- ['Cache dir', CACHE_DIR],
84
- ['Unique chunks', chunkIds.length],
85
- ['Question records', questions.length],
86
- ['Questions total', totalQuestions],
87
- ['Generation records', totalGenerations],
88
- ['Verification records', totalVerifications],
89
- ['Verifications ok', passedVerifications],
90
- ['Reward records', totalRewards],
91
- ['Rewards ok', passedRewards],
92
- ];
 
 
 
 
 
 
 
 
93
 
94
- const colWidth = Math.max(...rows.map(([k]) => k.length)) + 2;
95
- for (const [key, val] of rows) {
96
- const pad = ' '.repeat(colWidth - key.length);
97
- console.log(`${key}:${pad}${val}`);
 
98
  }
99
  }
100
 
 
10
  const __dirname = path.dirname(__filename);
11
  const PROJECT_ROOT = path.join(__dirname, '..');
12
 
13
+ const DEFAULT_CACHE_DIR = path.join(PROJECT_ROOT, 'data', 'cache');
14
+ const INSTRUCT_CACHE_DIR = path.join(PROJECT_ROOT, 'data', 'cache_instruct');
15
+
16
+ const MODE = (() => {
17
+ const v = process.env.CACHE_REPORT_MODE;
18
+ if (!v) return 'both';
19
+ const s = String(v).toLowerCase();
20
+ if (['thinking', 'default'].includes(s)) return 'thinking';
21
+ if (['instruct'].includes(s)) return 'instruct';
22
+ if (['both', 'all'].includes(s)) return 'both';
23
+ return 'both';
24
+ })();
25
+
26
+ const customDir = process.env.PIPELINE_CACHE_DIR
27
+ ? (path.isAbsolute(process.env.PIPELINE_CACHE_DIR)
28
+ ? process.env.PIPELINE_CACHE_DIR
29
+ : path.join(PROJECT_ROOT, process.env.PIPELINE_CACHE_DIR))
30
+ : null;
31
+
32
+ const CACHE_DIRS = (() => {
33
+ if (customDir) {
34
+ return [{ label: 'custom', dir: customDir }];
35
  }
36
+ if (MODE === 'thinking') {
37
+ return [{ label: 'thinking (default)', dir: DEFAULT_CACHE_DIR }];
38
+ }
39
+ if (MODE === 'instruct') {
40
+ return [{ label: 'instruct', dir: INSTRUCT_CACHE_DIR }];
41
+ }
42
+ return [
43
+ { label: 'thinking (default)', dir: DEFAULT_CACHE_DIR },
44
+ { label: 'instruct', dir: INSTRUCT_CACHE_DIR },
45
+ ];
46
  })();
47
 
48
  const FILES = {
 
52
  rewards: 'rewards.jsonl',
53
  };
54
 
55
+ async function readJsonl(cacheDir, fileName) {
56
+ const filePath = path.join(cacheDir, fileName);
57
  try {
58
  const txt = await fs.readFile(filePath, 'utf8');
59
  return txt
 
79
  }
80
 
81
  async function main() {
82
+ if (customDir) {
83
+ console.log(`CACHE_REPORT_MODE=custom (PIPELINE_CACHE_DIR=${customDir})`);
84
+ } else {
85
+ console.log(`CACHE_REPORT_MODE=${MODE}`);
86
+ }
87
+
88
+ for (const { label, dir } of CACHE_DIRS) {
89
+ const questions = await readJsonl(dir, FILES.questions);
90
+ const generations = await readJsonl(dir, FILES.generations);
91
+ const verifications = await readJsonl(dir, FILES.verifications);
92
+ const rewards = await readJsonl(dir, FILES.rewards);
93
+
94
+ const chunkIds = uniq([
95
+ ...questions.map((r) => r.chunk_id),
96
+ ...generations.map((r) => r.chunk_id),
97
+ ...verifications.map((r) => r.chunk_id),
98
+ ...rewards.map((r) => r.chunk_id),
99
+ ].filter(Boolean));
100
+
101
+ const totalQuestions = questions.reduce((acc, r) => {
102
+ if (Array.isArray(r.questions)) return acc + r.questions.length;
103
+ if (Array.isArray(r.question_ids)) return acc + r.question_ids.length;
104
+ return acc + 1;
105
+ }, 0);
106
+
107
+ const totalGenerations = generations.length;
108
+ const totalVerifications = verifications.length;
109
+ const totalRewards = rewards.length;
110
+
111
+ const passedVerifications = verifications.filter((v) => v.ok === true).length;
112
+ const passedRewards = rewards.filter((r) => r.ok === true).length;
113
+
114
+ console.log(`\n== ${label} cache ==`);
115
+ const rows = [
116
+ ['Cache dir', dir],
117
+ ['Unique chunks', chunkIds.length],
118
+ ['Question records', questions.length],
119
+ ['Questions total', totalQuestions],
120
+ ['Generation records', totalGenerations],
121
+ ['Verification records', totalVerifications],
122
+ ['Verifications ok', passedVerifications],
123
+ ['Reward records', totalRewards],
124
+ ['Rewards ok', passedRewards],
125
+ ];
126
 
127
+ const colWidth = Math.max(...rows.map(([k]) => k.length)) + 2;
128
+ for (const [key, val] of rows) {
129
+ const pad = ' '.repeat(colWidth - key.length);
130
+ console.log(`${key}:${pad}${val}`);
131
+ }
132
  }
133
  }
134
 
scripts/run_instruct_continuous.sh ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ # Continuous instruct-only pipeline runner.
5
+ # - Uses separate cache/output to avoid mixing with thinking pipeline
6
+ # - Random-walk over chunks
7
+ # - No limit: processes all available chunks/questions; loop restarts after completion
8
+ #
9
+ # Required: set INSTRUCT_GENERATOR_MODEL (and optionally INSTRUCT_GENERATOR_PROVIDER).
10
+ # Stop with Ctrl+C.
11
+
12
+ ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
13
+
14
+ # Load .env if present
15
+ if [[ -f "$ROOT_DIR/.env" ]]; then
16
+ set -a
17
+ source "$ROOT_DIR/.env"
18
+ set +a
19
+ fi
20
+
21
+ if [[ -z "${INSTRUCT_GENERATOR_MODEL:-}" ]]; then
22
+ echo "❌ Please set INSTRUCT_GENERATOR_MODEL to your instruct model." >&2
23
+ exit 1
24
+ fi
25
+
26
+ while true; do
27
+ INSTRUCT_PIPELINE=1 \
28
+ INSTRUCT_GENERATOR_MODEL="$INSTRUCT_GENERATOR_MODEL" \
29
+ INSTRUCT_GENERATOR_PROVIDER="${INSTRUCT_GENERATOR_PROVIDER:-${GENERATOR_PROVIDER:-ollama}}" \
30
+ PIPELINE_CACHE_DIR="${PIPELINE_CACHE_DIR:-$ROOT_DIR/data/cache_instruct}" \
31
+ PIPELINE_SEED_MODE=question-first \
32
+ PIPELINE_RANDOM_WALK=1 \
33
+ QUESTION_MAX_PER_CHUNK="${QUESTION_MAX_PER_CHUNK:-5}" \
34
+ npm run pipeline -- \
35
+ --out "${INSTRUCT_OUT:-$ROOT_DIR/gold/pipeline_gold_instruct.jsonl}" \
36
+ --verbose
37
+
38
+ echo "Instruct run finished at $(date). Sleeping 10s before next loop..."
39
+ sleep 10
40
+ done
scripts/run_thinking_continuous.sh ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ # Continuous "thinking" pipeline runner.
5
+ # - Uses default thinking cache/output
6
+ # - Random-walk over chunks
7
+ # - No limit: processes all available chunks/questions; loop restarts after completion
8
+ #
9
+ # Stop with Ctrl+C.
10
+
11
+ ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
12
+
13
+ # Load .env if present
14
+ if [[ -f "$ROOT_DIR/.env" ]]; then
15
+ set -a
16
+ source "$ROOT_DIR/.env"
17
+ set +a
18
+ fi
19
+
20
+ while true; do
21
+ PIPELINE_SEED_MODE=question-first \
22
+ PIPELINE_RANDOM_WALK=1 \
23
+ QUESTION_MAX_PER_CHUNK="${QUESTION_MAX_PER_CHUNK:-5}" \
24
+ npm run pipeline -- --verbose
25
+
26
+ echo "Run finished at $(date). Sleeping 10s before next loop..."
27
+ sleep 10
28
+ done
scripts/try_generator_prompt.sh ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ # Usage:
5
+ # scripts/try_generator_prompt.sh [chunk_id] [question_index] [-r] [--random]
6
+ # - chunk_id: optional. default = first cached chunk in questions cache
7
+ # - question_index: 0-based index into the cached question list for that chunk (default 0)
8
+ # - -r / --reasoning: enable Ollama reasoning option
9
+ # - --random: pick a random cached chunk and random question (ignores positional args)
10
+ #
11
+ # Requirements: jq, node, cache populated (data/cache/questions.jsonl) and rag chunks file (data/rag_chunks.jsonl)
12
+
13
+ CHUNK_ID=""
14
+ QUESTION_INDEX=0
15
+ REASONING=0
16
+ RANDOM_MODE=0
17
+
18
+ while [[ $# -gt 0 ]]; do
19
+ case "$1" in
20
+ -r|--reasoning)
21
+ REASONING=1
22
+ shift
23
+ ;;
24
+ --random)
25
+ RANDOM_MODE=1
26
+ shift
27
+ ;;
28
+ *)
29
+ if [[ -z "$CHUNK_ID" ]]; then
30
+ CHUNK_ID="$1"
31
+ else
32
+ QUESTION_INDEX="$1"
33
+ fi
34
+ shift
35
+ ;;
36
+ esac
37
+ done
38
+
39
+ ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
40
+ CACHE_DIR="${PIPELINE_CACHE_DIR:-$ROOT_DIR/data/cache}"
41
+ QUESTIONS_FILE="${CACHE_DIR}/questions.jsonl"
42
+ RAG_PATH="${RAG_CHUNKS_PATH:-$ROOT_DIR/data/rag_chunks.jsonl}"
43
+ PROMPT_FILE="${PROMPT_FILE:-$ROOT_DIR/prompts/generator_prompt.txt}"
44
+ MODEL="${GENERATOR_MODEL:-${OLLAMA_MODEL:-qwen3-vl:8b-thinking}}"
45
+ OLLAMA_URL="${OLLAMA_URL:-http://localhost:11434}"
46
+
47
+ if [[ ! -f "$QUESTIONS_FILE" ]]; then
48
+ echo "❌ questions cache not found at $QUESTIONS_FILE" >&2
49
+ exit 1
50
+ fi
51
+
52
+ if [[ ! -f "$RAG_PATH" ]]; then
53
+ echo "❌ rag chunks file not found at $RAG_PATH" >&2
54
+ exit 1
55
+ fi
56
+
57
+ if [[ ! -f "$PROMPT_FILE" ]]; then
58
+ echo "❌ generator prompt not found at $PROMPT_FILE" >&2
59
+ exit 1
60
+ fi
61
+
62
+ NODE_OUTPUT="$(CHUNK_ID="$CHUNK_ID" QUESTION_INDEX="$QUESTION_INDEX" QUESTIONS_FILE="$QUESTIONS_FILE" RAG_PATH="$RAG_PATH" RANDOM_MODE="$RANDOM_MODE" node --input-type=module <<'NODE'
63
+ import fs from 'fs';
64
+ import crypto from 'crypto';
65
+
66
+ const chunkIdArg = process.env.CHUNK_ID || '';
67
+ const qIndex = Number(process.env.QUESTION_INDEX || '0');
68
+ const questionsFile = process.env.QUESTIONS_FILE;
69
+ const ragPath = process.env.RAG_PATH;
70
+ const randomMode = process.env.RANDOM_MODE === '1';
71
+
72
+ function normalizeText(text = '') {
73
+ return String(text).replace(/\s+/g, ' ').trim();
74
+ }
75
+
76
+ function chunkIdFromContent(content, sourceId) {
77
+ const base = normalizeText(content);
78
+ return crypto.createHash('sha256').update(`${base}|${sourceId ?? ''}`).digest('hex');
79
+ }
80
+
81
+ function fail(msg) {
82
+ console.error(msg);
83
+ process.exit(2);
84
+ }
85
+
86
+ const questionLines = fs.readFileSync(questionsFile, 'utf8')
87
+ .split('\n')
88
+ .map((l) => l.trim())
89
+ .filter(Boolean);
90
+ const records = questionLines.map((l) => {
91
+ try {
92
+ return JSON.parse(l);
93
+ } catch {
94
+ return null;
95
+ }
96
+ }).filter(Boolean);
97
+
98
+ if (records.length === 0) fail('No cached questions found.');
99
+
100
+ const ragLines = fs.readFileSync(ragPath, 'utf8')
101
+ .split('\n')
102
+ .map((l) => l.trim())
103
+ .filter(Boolean);
104
+
105
+ const ragMap = new Map();
106
+ ragLines.forEach((line, idx) => {
107
+ let obj;
108
+ try {
109
+ obj = JSON.parse(line);
110
+ } catch {
111
+ return;
112
+ }
113
+ const content =
114
+ obj.content ||
115
+ obj.text ||
116
+ obj.chunk ||
117
+ obj.body ||
118
+ '';
119
+ const sourceId =
120
+ obj.id ||
121
+ obj.session_key ||
122
+ obj.title ||
123
+ `jsonl-${idx}`;
124
+ const cid = chunkIdFromContent(content, sourceId);
125
+ ragMap.set(cid, { content, sourceId, source: obj });
126
+ });
127
+
128
+ const matchingRecords = records.filter((r) => ragMap.has(r.chunk_id));
129
+
130
+ let record = null;
131
+ if (chunkIdArg) {
132
+ record = records.find((r) => r.chunk_id === chunkIdArg);
133
+ if (!record) fail(`Chunk ${chunkIdArg} not found in questions cache.`);
134
+ if (!ragMap.has(record.chunk_id)) {
135
+ fail(`Chunk content for ${record.chunk_id} not found in ${ragPath}.`);
136
+ }
137
+ } else if (randomMode) {
138
+ if (matchingRecords.length === 0) {
139
+ fail('No cached chunk IDs match rag chunks. Run the pipeline to populate cache.');
140
+ }
141
+ record = matchingRecords[crypto.randomInt(matchingRecords.length)];
142
+ } else {
143
+ record = matchingRecords[0];
144
+ if (!record) {
145
+ fail('No cached chunk IDs match rag chunks. Run the pipeline to populate cache.');
146
+ }
147
+ }
148
+
149
+ const questions = record.questions || [];
150
+ let chosenQIndex = qIndex;
151
+ if (randomMode) {
152
+ chosenQIndex = questions.length > 0 ? crypto.randomInt(questions.length) : 0;
153
+ }
154
+ const question = questions?.[chosenQIndex];
155
+ if (!question) fail(`Question index ${qIndex} out of range for chunk ${record.chunk_id}.`);
156
+
157
+ const matchedChunk = ragMap.get(record.chunk_id);
158
+
159
+ console.log(JSON.stringify({
160
+ chunkId: record.chunk_id,
161
+ question,
162
+ questionIndex: chosenQIndex,
163
+ chunk: matchedChunk.content,
164
+ source: matchedChunk.source,
165
+ }));
166
+ NODE
167
+ )"
168
+
169
+ CHUNK_ID_RESOLVED="$(echo "$NODE_OUTPUT" | jq -r '.chunkId')"
170
+ QUESTION="$(echo "$NODE_OUTPUT" | jq -r '.question')"
171
+ CHUNK="$(echo "$NODE_OUTPUT" | jq -r '.chunk')"
172
+ QUESTION_INDEX="$(echo "$NODE_OUTPUT" | jq -r '.questionIndex')"
173
+
174
+ echo "🧩 Chunk: $CHUNK_ID_RESOLVED"
175
+ echo " Question [$QUESTION_INDEX]: $QUESTION"
176
+ echo " Model: $MODEL"
177
+ echo " Prompt file: $PROMPT_FILE"
178
+ echo "----------------------------------------------"
179
+ echo "$CHUNK" | head -n 20
180
+ echo "… (chunk truncated)"
181
+ echo "----------------------------------------------"
182
+
183
+ PROMPT="$(QUESTION="$QUESTION" CHUNK="$CHUNK" PROMPT_FILE="$PROMPT_FILE" node --input-type=module <<'NODE'
184
+ import fs from 'fs';
185
+ const tpl = fs.readFileSync(process.env.PROMPT_FILE, 'utf8');
186
+ const question = process.env.QUESTION;
187
+ const context = process.env.CHUNK;
188
+ const out = tpl
189
+ .split('{{QUESTION}}').join(question)
190
+ .split('{{CONTEXT}}').join(context);
191
+ process.stdout.write(out);
192
+ NODE
193
+ )"
194
+
195
+ PROMPT_JSON=$(printf '%s' "$PROMPT" | jq -Rs .)
196
+
197
+ if [[ "$REASONING" == "1" ]]; then
198
+ echo "🧠 Reasoning: ON"
199
+ OPTIONS='"options":{"reasoning":true},'
200
+ else
201
+ OPTIONS=""
202
+ fi
203
+
204
+ PAYLOAD=$(cat <<EOF
205
+ {
206
+ "model": "$MODEL",
207
+ "prompt": $PROMPT_JSON,
208
+ $OPTIONS
209
+ "stream": false
210
+ }
211
+ EOF
212
+ )
213
+
214
+ echo
215
+ echo "🚀 Sending to Ollama ($MODEL)…"
216
+ echo
217
+
218
+ RAW_RESPONSE=$(mktemp)
219
+ curl -s -X POST "$OLLAMA_URL/api/generate" \
220
+ -H "Content-Type: application/json" \
221
+ -d "$PAYLOAD" | tee "$RAW_RESPONSE" \
222
+ | jq 'del(.context)'
223
+
224
+
225
+ echo
226
+ echo "📝 Response text:"
227
+ jq -r '.response // .message // .output' "$RAW_RESPONSE"
src/generator/generator_core.mjs CHANGED
@@ -22,11 +22,16 @@ export async function runGenerator(question, contextChunks, provider) {
22
  .replace('{{QUESTION}}', question)
23
  .replace('{{CONTEXT}}', ctxText);
24
 
25
- const response = await provider.generate(prompt);
26
 
27
  // Normalize provider output: string or { response, thinking }
28
  const raw = typeof response === 'string' ? response : response?.response ?? '';
29
  const thinkingObj = typeof response === 'object' && response?.thinking ? response.thinking : null;
 
 
 
 
 
30
 
31
  let thought = null;
32
  let answer = raw?.trim?.() ?? raw;
@@ -192,7 +197,8 @@ export async function runGenerator(question, contextChunks, provider) {
192
  evidence,
193
  limitations,
194
  question,
195
- context: contextChunks
 
196
  };
197
  }
198
 
 
22
  .replace('{{QUESTION}}', question)
23
  .replace('{{CONTEXT}}', ctxText);
24
 
25
+ const response = await provider.generate(prompt, { includeJson: true });
26
 
27
  // Normalize provider output: string or { response, thinking }
28
  const raw = typeof response === 'string' ? response : response?.response ?? '';
29
  const thinkingObj = typeof response === 'object' && response?.thinking ? response.thinking : null;
30
+ const rawJson =
31
+ typeof response === 'object' && response?.fullResponse
32
+ ? (({ context, ...rest }) => rest)(response.fullResponse)
33
+ : null;
34
+
35
 
36
  let thought = null;
37
  let answer = raw?.trim?.() ?? raw;
 
197
  evidence,
198
  limitations,
199
  question,
200
+ context: contextChunks,
201
+ rawJson,
202
  };
203
  }
204
 
src/pipeline/pipeline_cli.js CHANGED
@@ -18,12 +18,18 @@ const DEFAULT_OUT = path.join(
18
  'gold',
19
  'pipeline_gold.jsonl',
20
  );
 
 
 
 
 
21
 
22
  function parseArgs(argv) {
23
  const args = argv.slice(2);
24
  let limit;
25
  let seedsPath;
26
  let outPath;
 
27
  let verbose = true; // default verbose on
28
  let seedMode; // optional CLI override
29
  let chunkLimit;
@@ -39,6 +45,7 @@ function parseArgs(argv) {
39
  i++;
40
  } else if (a === '--out') {
41
  outPath = args[i + 1];
 
42
  i++;
43
  } else if (a === '--chunk-limit') {
44
  const v = Number(args[i + 1]);
@@ -64,6 +71,7 @@ function parseArgs(argv) {
64
  limit,
65
  seedsPath: seedsPath || DEFAULT_SEEDS,
66
  outPath: outPath || DEFAULT_OUT,
 
67
  verbose,
68
  seedMode,
69
  chunkLimit,
@@ -75,24 +83,54 @@ async function main() {
75
  limit,
76
  seedsPath,
77
  outPath,
 
78
  verbose,
79
  seedMode: cliSeedMode,
80
  chunkLimit,
81
  } = parseArgs(process.argv);
82
 
83
- const generatorProvider = process.env.GENERATOR_PROVIDER || 'ollama';
84
- const verifierProvider = process.env.VERIFIER_PROVIDER || generatorProvider;
85
- const rewardProvider = process.env.REWARD_PROVIDER || generatorProvider;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
- const generatorModel =
88
- process.env.GENERATOR_MODEL ||
89
- process.env.OLLAMA_MODEL ||
90
- 'qwen3-vl:8b-thinking';
91
  const verifierModel =
92
  process.env.VERIFIER_MODEL || generatorModel;
93
  const rewardModel =
94
  process.env.REWARD_MODEL || verifierModel;
95
 
 
 
 
 
 
 
96
  // Resolve mode: CLI > env > default
97
  const mode =
98
  cliSeedMode || process.env.PIPELINE_SEED_MODE || 'question-first';
@@ -101,10 +139,10 @@ async function main() {
101
  console.log('🚀 Starting Distillation Pipeline');
102
  console.log(` Mode: ${mode}`);
103
  console.log(` Seeds: ${seedsPath}`);
104
- console.log(` Output: ${outPath}`);
105
  console.log(` Providers:`);
106
  console.log(
107
- ` generator: ${generatorProvider} (${generatorModel})`,
108
  );
109
  console.log(
110
  ` verifier: ${verifierProvider} (${verifierModel})`,
@@ -127,7 +165,7 @@ async function main() {
127
  try {
128
  const result = await runPipelineBatch({
129
  seedsPath,
130
- outPath,
131
  limit,
132
  chunkLimit,
133
  verbose,
 
18
  'gold',
19
  'pipeline_gold.jsonl',
20
  );
21
+ const DEFAULT_INSTRUCT_OUT = path.join(
22
+ PROJECT_ROOT,
23
+ 'gold',
24
+ 'pipeline_gold_instruct.jsonl',
25
+ );
26
 
27
  function parseArgs(argv) {
28
  const args = argv.slice(2);
29
  let limit;
30
  let seedsPath;
31
  let outPath;
32
+ let outPathProvided = false;
33
  let verbose = true; // default verbose on
34
  let seedMode; // optional CLI override
35
  let chunkLimit;
 
45
  i++;
46
  } else if (a === '--out') {
47
  outPath = args[i + 1];
48
+ outPathProvided = true;
49
  i++;
50
  } else if (a === '--chunk-limit') {
51
  const v = Number(args[i + 1]);
 
71
  limit,
72
  seedsPath: seedsPath || DEFAULT_SEEDS,
73
  outPath: outPath || DEFAULT_OUT,
74
+ outPathProvided,
75
  verbose,
76
  seedMode,
77
  chunkLimit,
 
83
  limit,
84
  seedsPath,
85
  outPath,
86
+ outPathProvided,
87
  verbose,
88
  seedMode: cliSeedMode,
89
  chunkLimit,
90
  } = parseArgs(process.argv);
91
 
92
+ const instructMode = (() => {
93
+ const v = process.env.INSTRUCT_PIPELINE;
94
+ if (!v) return false;
95
+ const s = String(v).toLowerCase();
96
+ return s === '1' || s === 'true' || s === 'yes';
97
+ })();
98
+
99
+ let effectiveOutPath = outPath;
100
+ if (instructMode && !outPathProvided && outPath === DEFAULT_OUT) {
101
+ effectiveOutPath = DEFAULT_INSTRUCT_OUT;
102
+ }
103
+
104
+ const baseGeneratorProvider = process.env.GENERATOR_PROVIDER || 'ollama';
105
+ const verifierProvider = process.env.VERIFIER_PROVIDER || baseGeneratorProvider;
106
+ const rewardProvider = process.env.REWARD_PROVIDER || baseGeneratorProvider;
107
+
108
+ const instructProvider =
109
+ process.env.INSTRUCT_GENERATOR_PROVIDER || baseGeneratorProvider;
110
+
111
+ const generatorModel = (() => {
112
+ const instructModel =
113
+ process.env.INSTRUCT_GENERATOR_MODEL ||
114
+ process.env.INSTRUCT_GENERATOR;
115
+ if (instructMode && instructModel) return instructModel;
116
+ return (
117
+ process.env.GENERATOR_MODEL ||
118
+ process.env.OLLAMA_MODEL ||
119
+ 'qwen3-vl:8b-thinking'
120
+ );
121
+ })();
122
 
 
 
 
 
123
  const verifierModel =
124
  process.env.VERIFIER_MODEL || generatorModel;
125
  const rewardModel =
126
  process.env.REWARD_MODEL || verifierModel;
127
 
128
+ if (instructMode) {
129
+ // steer pipeline stages to the instruct generator
130
+ process.env.GENERATOR_PROVIDER = instructProvider;
131
+ process.env.GENERATOR_MODEL = generatorModel;
132
+ }
133
+
134
  // Resolve mode: CLI > env > default
135
  const mode =
136
  cliSeedMode || process.env.PIPELINE_SEED_MODE || 'question-first';
 
139
  console.log('🚀 Starting Distillation Pipeline');
140
  console.log(` Mode: ${mode}`);
141
  console.log(` Seeds: ${seedsPath}`);
142
+ console.log(` Output: ${effectiveOutPath}`);
143
  console.log(` Providers:`);
144
  console.log(
145
+ ` generator: ${instructMode ? instructProvider : baseGeneratorProvider} (${generatorModel})`,
146
  );
147
  console.log(
148
  ` verifier: ${verifierProvider} (${verifierModel})`,
 
165
  try {
166
  const result = await runPipelineBatch({
167
  seedsPath,
168
+ outPath: effectiveOutPath,
169
  limit,
170
  chunkLimit,
171
  verbose,
src/pipeline/step.mjs CHANGED
@@ -203,6 +203,15 @@ export async function runPipelineStep({
203
  log(' [generator] raw response (JSON if parsable):');
204
  log(' ' + preview(rawDisplay, 2000).replace(/\n/g, '\n '));
205
  }
 
 
 
 
 
 
 
 
 
206
  }
207
  } catch (e) {
208
  const msg = e?.message || String(e);
 
203
  log(' [generator] raw response (JSON if parsable):');
204
  log(' ' + preview(rawDisplay, 2000).replace(/\n/g, '\n '));
205
  }
206
+ if (gen?.rawJson?.response) {
207
+ log(' [generator] ollama response text (full):');
208
+ log(' ' + preview(gen.rawJson.response, 2000).replace(/\n/g, '\n '));
209
+ }
210
+ if (gen?.rawJson) {
211
+ const jsonDisplay = JSON.stringify(gen.rawJson, null, 2);
212
+ log(' [generator] ollama full JSON:');
213
+ log(' ' + jsonDisplay.replace(/\n/g, '\n '));
214
+ }
215
  }
216
  } catch (e) {
217
  const msg = e?.message || String(e);
src/providers/ollama_provider.mjs CHANGED
@@ -57,7 +57,7 @@ export class OllamaProvider extends BaseProvider {
57
  * @param {string} prompt
58
  * @returns {Promise<string>} the model's response text
59
  */
60
- async generate(prompt) {
61
  const url = `${this.baseUrl}/api/generate`;
62
 
63
  const body = {
@@ -82,6 +82,15 @@ export class OllamaProvider extends BaseProvider {
82
  }
83
 
84
  const data = await res.json();
 
 
 
 
 
 
 
 
 
85
  // Standard Ollama /api/generate response uses `response`
86
  return data.response ?? '';
87
  }
 
57
  * @param {string} prompt
58
  * @returns {Promise<string>} the model's response text
59
  */
60
+ async generate(prompt, { includeJson = false } = {}) {
61
  const url = `${this.baseUrl}/api/generate`;
62
 
63
  const body = {
 
82
  }
83
 
84
  const data = await res.json();
85
+
86
+ if (includeJson) {
87
+ return {
88
+ response: data.response ?? '',
89
+ thinking: data.thinking,
90
+ fullResponse: data,
91
+ };
92
+ }
93
+
94
  // Standard Ollama /api/generate response uses `response`
95
  return data.response ?? '';
96
  }
tests/try_generator_prompt.test.mjs ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { describe, it, expect } from 'vitest';
2
+ import { mkdtempSync, writeFileSync, mkdirSync } from 'fs';
3
+ import { tmpdir } from 'os';
4
+ import path from 'path';
5
+ import { execFileSync } from 'child_process';
6
+ import { chunkIdFromContent } from '../src/pipeline/ids.mjs';
7
+
8
+ describe('scripts/try_generator_prompt.sh', () => {
9
+ it('prints the generator response using cached chunk/question', () => {
10
+ const workdir = mkdtempSync(path.join(tmpdir(), 'try-gen-'));
11
+ const cacheDir = path.join(workdir, 'cache');
12
+ mkdirSync(cacheDir, { recursive: true });
13
+
14
+ const content = 'Chunk content for testing.';
15
+ const sourceId = 'source-1';
16
+ const chunkId = chunkIdFromContent(content, sourceId);
17
+
18
+ const questionsFile = path.join(cacheDir, 'questions.jsonl');
19
+ writeFileSync(
20
+ questionsFile,
21
+ JSON.stringify({
22
+ chunk_id: chunkId,
23
+ questions: ['What is being asked?'],
24
+ question_ids: ['q1'],
25
+ provider: 'mock',
26
+ model: 'mock',
27
+ ts: Date.now(),
28
+ }) + '\n',
29
+ 'utf8',
30
+ );
31
+
32
+ const ragFile = path.join(workdir, 'rag.jsonl');
33
+ writeFileSync(
34
+ ragFile,
35
+ JSON.stringify({ id: sourceId, content }) + '\n',
36
+ 'utf8',
37
+ );
38
+
39
+ const promptFile = path.join(workdir, 'prompt.txt');
40
+ writeFileSync(
41
+ promptFile,
42
+ 'Q: {{QUESTION}}\nCTX: {{CONTEXT}}',
43
+ 'utf8',
44
+ );
45
+
46
+ // Mock Ollama response via file://
47
+ const mockApiDir = path.join(workdir, 'mock-ollama', 'api');
48
+ mkdirSync(mockApiDir, { recursive: true });
49
+ const mockResponsePath = path.join(mockApiDir, 'generate');
50
+ writeFileSync(
51
+ mockResponsePath,
52
+ JSON.stringify({ response: 'mock generator answer' }),
53
+ 'utf8',
54
+ );
55
+
56
+ const scriptPath = path.join(
57
+ path.dirname(new URL(import.meta.url).pathname),
58
+ '..',
59
+ 'scripts',
60
+ 'try_generator_prompt.sh',
61
+ );
62
+
63
+ const env = {
64
+ ...process.env,
65
+ PIPELINE_CACHE_DIR: cacheDir,
66
+ RAG_CHUNKS_PATH: ragFile,
67
+ PROMPT_FILE: promptFile,
68
+ OLLAMA_URL: `file://${path.join(workdir, 'mock-ollama')}`,
69
+ GENERATOR_MODEL: 'mock-model',
70
+ };
71
+
72
+ let output;
73
+ try {
74
+ output = execFileSync('bash', [scriptPath], {
75
+ env,
76
+ encoding: 'utf8',
77
+ });
78
+ } catch (err) {
79
+ output = err?.stdout?.toString?.() || '';
80
+ }
81
+
82
+ expect(output).toContain('mock generator answer');
83
+ expect(output).toContain(chunkId);
84
+ });
85
+ });