htaf commited on
Commit
ecd21e2
·
1 Parent(s): b14f0ba

handoff stuff

Browse files
AGENTS.md CHANGED
@@ -17,6 +17,7 @@
17
  - `REAL_ES=1 npm test` – exercise retrieval against a live Elasticsearch + embedding endpoint.
18
  - Red/green pathway: use `*_PROVIDER=mock` plus JSONL chunk source to dry-run (green) without models; switch to real providers for red runs and the cache will skip already-completed stages.
19
  - Verifier contract: models return JSON `{"REASONING": [...], "SCORE": <number|\"PASS\"|\"FAIL\">}`; SCORE >=0.5 or PASS → accepted. Prompt must remain unchanged; parsing is tolerant of the PASS/FAIL token format.
 
20
 
21
  ## Coding Style & Naming Conventions
22
  - ECMAScript modules (`type: "module"`); prefer `.mjs` for shared code.
 
17
  - `REAL_ES=1 npm test` – exercise retrieval against a live Elasticsearch + embedding endpoint.
18
  - Red/green pathway: use `*_PROVIDER=mock` plus JSONL chunk source to dry-run (green) without models; switch to real providers for red runs and the cache will skip already-completed stages.
19
  - Verifier contract: models return JSON `{"REASONING": [...], "SCORE": <number|\"PASS\"|\"FAIL\">}`; SCORE >=0.5 or PASS → accepted. Prompt must remain unchanged; parsing is tolerant of the PASS/FAIL token format.
20
+ - Generator output/logging: verbose runs show parsed `thought`, raw provider `thinking`, answer, confidence, evidence, limitations, and raw response (pretty-printed if JSON). Gold stores `answer`, `thought`, `raw`, `confidence`, `evidence`, `limitations`, `thinking`.
21
 
22
  ## Coding Style & Naming Conventions
23
  - ECMAScript modules (`type: "module"`); prefer `.mjs` for shared code.
HANDOFF.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Handoff Notes
2
+
3
+ ## Quick status
4
+ - Pipeline: question-first by default (JSONL chunks), deterministic chunk IDs, cache-backed (questions/gens/verifications/rewards).
5
+ - Generator parsing/logging: preserves provider `.thinking` plus parsed thought/answer/confidence/evidence/limitations; verbose logs show both parsed and raw (pretty-printed if JSON). Gold stores all of these fields.
6
+ - Verifier: expects distributor prompt output; accepts `SCORE` as number or `PASS/FAIL` (even noisy `PROMPT = PASS`); raw transcript logged in verbose.
7
+ - Scripts: `gold_preview.mjs` (shows thought/thinking/raw), `cache_report.mjs`, `regenerate_gold_from_cache.mjs`, `purge_mock_gold.mjs`.
8
+ - Tests: passing in writable env (read-only sandboxes block Vitest temp/cache writes).
9
+
10
+ ## Running
11
+ - Default verbose CLI: `npm run pipeline -- --limit N --verbose`.
12
+ - Question-first mode: `PIPELINE_SEED_MODE=question-first npm run pipeline -- --limit N --verbose`.
13
+ - Random walk over chunks: `PIPELINE_RANDOM_WALK=1` (or `PIPELINE_CHUNK_ORDER=random`) + optional `--chunk-limit`.
14
+ - Question cap per chunk: `QUESTION_MAX_PER_CHUNK` (e.g., 3).
15
+ - Preview gold: `node scripts/gold_preview.mjs --max-answer 2000 --limit 5`.
16
+ - Regenerate gold from cache: `node scripts/regenerate_gold_from_cache.mjs`.
17
+ - Purge mock Q1? gold: `node scripts/purge_mock_gold.mjs`.
18
+
19
+ ## What to watch
20
+ - Cache/gold should stay out of git; regenerate gold via cache script.
21
+ - Question provider must be reachable; otherwise question-first will spin through chunks with errors.
22
+ - Verifier prompt locked; parsing tolerates PASS/FAIL tokens and logs raw output.
23
+ - Generator prompt locked; parsing handles `.thinking` and Qwen-style answer blocks; thought/answer stay separate for verifier/reward.
24
+
25
+ ## Files of note
26
+ - `src/generator/generator_core.mjs`: parses provider responses; carries `.thinking`, thought, answer, confidence, evidence, limitations.
27
+ - `src/pipeline/step.mjs`: verbose logging of thought, thinking, answer, raw.
28
+ - `prompts/*`: locked prompts (do not edit generator/verifier without intent).
29
+ - `data/cache/*.jsonl`: intermediate cache (questions/gens/verifications/rewards); use `PIPELINE_CACHE_DIR` to redirect.
30
+ - `gold/pipeline_gold.jsonl`: output; rebuild via cache if needed.
31
+
32
+ ## Caveats
33
+ - Read-only environments will fail `npm test` due to /tmp and `.vite` writes; run tests where writes are allowed.
34
+ - Long generator outputs can bloat verifier context; consider truncation or smaller verifier model if needed.
README.md CHANGED
@@ -82,7 +82,7 @@ All pure modules include Vitest coverage:
82
  * question generation
83
  * provider router
84
  * pipeline integration (mock)
85
- * JSONL cache, PASS/FAIL verifier parsing
86
 
87
  ---
88
 
 
82
  * question generation
83
  * provider router
84
  * pipeline integration (mock)
85
+ * JSONL cache, PASS/FAIL verifier parsing, generator parsing (thought/thinking/answer)
86
 
87
  ---
88
 
USAGE.md CHANGED
@@ -39,7 +39,7 @@ Run the default pipeline (static seeds):
39
  npm run pipeline -- --limit 10
40
  ```
41
 
42
- Verbose run:
43
 
44
  ```bash
45
  npm run pipeline -- --limit 10 --verbose
@@ -77,7 +77,7 @@ All accepted gold samples are written to:
77
  gold/pipeline_gold.jsonl
78
  ```
79
 
80
- Each entry includes:
81
 
82
  ```json
83
  {
 
39
  npm run pipeline -- --limit 10
40
  ```
41
 
42
+ Verbose run (shows generator thought/thinking/answer/confidence/evidence/limitations and raw response):
43
 
44
  ```bash
45
  npm run pipeline -- --limit 10 --verbose
 
77
  gold/pipeline_gold.jsonl
78
  ```
79
 
80
+ Each entry includes (generator sample contains answer, thought, raw, confidence, evidence, limitations, thinking):
81
 
82
  ```json
83
  {
scripts/gold_preview.mjs CHANGED
@@ -107,6 +107,12 @@ async function main() {
107
 
108
  const q = obj.question || '[no question]';
109
  const ans = obj.sample?.answer || obj.sample?.raw || '[no answer]';
 
 
 
 
 
 
110
  const chunkId = obj.sourceChunkId || obj.context?.[0]?.id || '[unknown chunk]';
111
  const ctxSnippet = obj.context?.[0]?.content || obj.sourceChunk || '';
112
  const rew = obj.reward?.score ?? obj.reward?.ok;
@@ -117,6 +123,26 @@ async function main() {
117
  console.log(`Chunk: ${chunkId}`);
118
  console.log(`Q: ${preview(q, maxQuestion, full)}`);
119
  console.log(`A: ${preview(ans, maxAnswer, full)}`);
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
  if (ctxSnippet) console.log(`Ctx: ${preview(ctxSnippet, maxContext, full)}`);
121
  if (verOk !== undefined) console.log(`Verifier ok: ${verOk}${verScore !== undefined ? ` (score: ${verScore})` : ''}`);
122
  if (rew !== undefined) console.log(`Reward: ${rew}`);
 
107
 
108
  const q = obj.question || '[no question]';
109
  const ans = obj.sample?.answer || obj.sample?.raw || '[no answer]';
110
+ const rawGen = obj.sample?.raw;
111
+ const thought = obj.sample?.thought;
112
+ const thinking = obj.sample?.thinking;
113
+ const confidence = obj.sample?.confidence ?? obj.sample?.confidence_level;
114
+ const evidence = obj.sample?.evidence;
115
+ const limitations = obj.sample?.limitations;
116
  const chunkId = obj.sourceChunkId || obj.context?.[0]?.id || '[unknown chunk]';
117
  const ctxSnippet = obj.context?.[0]?.content || obj.sourceChunk || '';
118
  const rew = obj.reward?.score ?? obj.reward?.ok;
 
123
  console.log(`Chunk: ${chunkId}`);
124
  console.log(`Q: ${preview(q, maxQuestion, full)}`);
125
  console.log(`A: ${preview(ans, maxAnswer, full)}`);
126
+ if (thought !== undefined) {
127
+ const tVal =
128
+ typeof thought === 'string'
129
+ ? thought
130
+ : JSON.stringify(thought, null, 2);
131
+ console.log(`Thought: ${preview(tVal, maxAnswer, full)}`);
132
+ }
133
+ if (rawGen !== undefined) {
134
+ console.log(`Raw: ${preview(rawGen, maxAnswer, full)}`);
135
+ }
136
+ if (confidence !== undefined) console.log(`Gen confidence: ${confidence}`);
137
+ if (evidence) console.log(`Evidence: ${preview(Array.isArray(evidence) ? evidence.join(' | ') : evidence, 400, full)}`);
138
+ if (limitations) console.log(`Limitations: ${preview(limitations, 200, full)}`);
139
+ if (thinking !== undefined) {
140
+ const tVal =
141
+ typeof thinking === 'string'
142
+ ? thinking
143
+ : JSON.stringify(thinking, null, 2);
144
+ console.log(`Thinking: ${preview(tVal, maxAnswer, full)}`);
145
+ }
146
  if (ctxSnippet) console.log(`Ctx: ${preview(ctxSnippet, maxContext, full)}`);
147
  if (verOk !== undefined) console.log(`Verifier ok: ${verOk}${verScore !== undefined ? ` (score: ${verScore})` : ''}`);
148
  if (rew !== undefined) console.log(`Reward: ${rew}`);
src/generator/generator_core.mjs CHANGED
@@ -58,6 +58,17 @@ export async function runGenerator(question, contextChunks, provider) {
58
  thought = thinkingObj;
59
  }
60
 
 
 
 
 
 
 
 
 
 
 
 
61
  // Try parsing Qwen-style answer block first
62
  const parseAnswerBlock = (txt) => {
63
  if (!txt || typeof txt !== 'string') return null;
@@ -65,6 +76,29 @@ export async function runGenerator(question, contextChunks, provider) {
65
  const body = blockMatch ? blockMatch[1] : txt;
66
  const lines = body.split('\n').map((l) => l.trim()).filter(Boolean);
67
  const result = {};
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  for (const line of lines) {
69
  if (/^confidence:/i.test(line)) {
70
  const val = line.split(':')[1]?.trim();
@@ -98,6 +132,10 @@ export async function runGenerator(question, contextChunks, provider) {
98
  confidence = blockParsed.confidence ?? confidence;
99
  evidence = blockParsed.evidence ?? evidence;
100
  limitations = blockParsed.limitations ?? limitations;
 
 
 
 
101
  } else {
102
  // fallback: parse JSON if it's actually JSON
103
  const parsed = safeParse(raw);
@@ -127,6 +165,9 @@ export async function runGenerator(question, contextChunks, provider) {
127
  if (parsed.evidence) evidence = parsed.evidence;
128
  if (parsed.limitations) limitations = parsed.limitations;
129
  } else {
 
 
 
130
  // fallback: extract visible chain-of-thought tags if present
131
  const thinkMatch = typeof raw === 'string'
132
  ? raw.match(/<think>([\s\S]*?)<\/think>/i)
@@ -138,8 +179,13 @@ export async function runGenerator(question, contextChunks, provider) {
138
  }
139
  }
140
 
 
 
 
 
141
  return {
142
  raw,
 
143
  thought,
144
  answer,
145
  confidence,
 
58
  thought = thinkingObj;
59
  }
60
 
61
+ const extractThoughtBlock = (txt) => {
62
+ if (!txt || typeof txt !== 'string') return null;
63
+ const thoughtMatch = txt.match(/<\|thought\|>([\s\S]*?)<\|end_of_thought\|>/i);
64
+ if (thoughtMatch) return thoughtMatch[1].trim();
65
+
66
+ const understandingMatch = txt.match(/<understanding>[\s\S]*?(?=<\|answer\||<answer>|$)/i);
67
+ if (understandingMatch) return understandingMatch[0].trim();
68
+
69
+ return null;
70
+ };
71
+
72
  // Try parsing Qwen-style answer block first
73
  const parseAnswerBlock = (txt) => {
74
  if (!txt || typeof txt !== 'string') return null;
 
76
  const body = blockMatch ? blockMatch[1] : txt;
77
  const lines = body.split('\n').map((l) => l.trim()).filter(Boolean);
78
  const result = {};
79
+ // line-based fallbacks even without tags
80
+ const answerLine = txt.match(/^answer:\s*(.+)$/im);
81
+ if (answerLine) result.answer = answerLine[1].trim();
82
+ const confLine = txt.match(/^confidence:\s*(.+)$/im);
83
+ if (confLine) result.confidence = confLine[1].trim();
84
+ const evidenceLine = txt.match(/^evidence:\s*(.+)$/im);
85
+ if (evidenceLine) {
86
+ const evLine = evidenceLine[1].trim();
87
+ let ev = [];
88
+ const arrMatch = evLine.match(/\[(.*)\]/);
89
+ if (arrMatch) {
90
+ ev = arrMatch[1]
91
+ .split(/,(?=(?:[^'"]|'[^']*'|"[^"]*")*$)/)
92
+ .map((s) => s.replace(/^["'\s]+|["'\s]+$/g, ''))
93
+ .filter(Boolean);
94
+ } else {
95
+ ev = evLine.split(',').map((s) => s.replace(/^["'\s]+|["'\s]+$/g, '')).filter(Boolean);
96
+ }
97
+ result.evidence = ev;
98
+ }
99
+ const limLine = txt.match(/^limitations?:\s*(.+)$/im);
100
+ if (limLine) result.limitations = limLine[1].trim();
101
+
102
  for (const line of lines) {
103
  if (/^confidence:/i.test(line)) {
104
  const val = line.split(':')[1]?.trim();
 
132
  confidence = blockParsed.confidence ?? confidence;
133
  evidence = blockParsed.evidence ?? evidence;
134
  limitations = blockParsed.limitations ?? limitations;
135
+ if (!thought) {
136
+ const t = extractThoughtBlock(raw);
137
+ if (t) thought = t;
138
+ }
139
  } else {
140
  // fallback: parse JSON if it's actually JSON
141
  const parsed = safeParse(raw);
 
165
  if (parsed.evidence) evidence = parsed.evidence;
166
  if (parsed.limitations) limitations = parsed.limitations;
167
  } else {
168
+ // fallback: extract thought block or <think>
169
+ const tBlock = extractThoughtBlock(raw);
170
+ if (tBlock) thought = tBlock;
171
  // fallback: extract visible chain-of-thought tags if present
172
  const thinkMatch = typeof raw === 'string'
173
  ? raw.match(/<think>([\s\S]*?)<\/think>/i)
 
179
  }
180
  }
181
 
182
+ if (!thought && raw) {
183
+ thought = raw;
184
+ }
185
+
186
  return {
187
  raw,
188
+ thinking: thinkingObj,
189
  thought,
190
  answer,
191
  confidence,
src/pipeline/batch.mjs CHANGED
@@ -131,6 +131,7 @@ export async function runPipelineBatch({
131
  confidence: result.gen?.confidence,
132
  evidence: result.gen?.evidence,
133
  limitations: result.gen?.limitations,
 
134
  },
135
  verifier: result.ver,
136
  reward: result.rew,
 
131
  confidence: result.gen?.confidence,
132
  evidence: result.gen?.evidence,
133
  limitations: result.gen?.limitations,
134
+ thinking: result.gen?.thinking,
135
  },
136
  verifier: result.ver,
137
  reward: result.rew,
src/pipeline/step.mjs CHANGED
@@ -165,6 +165,14 @@ export async function runPipelineStep({
165
  log(' [generator] thought:');
166
  log(' ' + preview(thoughtPreview, 500).replace(/\n/g, '\n '));
167
  }
 
 
 
 
 
 
 
 
168
  log(' [generator] answer:');
169
  log(' ' + preview(gen?.answer ?? '', 400).replace(/\n/g, '\n '));
170
  if (gen?.confidence) {
@@ -184,6 +192,17 @@ export async function runPipelineStep({
184
  if (gen?.limitations) {
185
  log(' [generator] limitations: ' + preview(gen.limitations, 200));
186
  }
 
 
 
 
 
 
 
 
 
 
 
187
  }
188
  } catch (e) {
189
  const msg = e?.message || String(e);
 
165
  log(' [generator] thought:');
166
  log(' ' + preview(thoughtPreview, 500).replace(/\n/g, '\n '));
167
  }
168
+ if (gen?.thinking) {
169
+ const thinkingPreview =
170
+ typeof gen.thinking === 'string'
171
+ ? gen.thinking
172
+ : JSON.stringify(gen.thinking, null, 2);
173
+ log(' [generator] thinking (raw from provider):');
174
+ log(' ' + preview(thinkingPreview, 500).replace(/\n/g, '\n '));
175
+ }
176
  log(' [generator] answer:');
177
  log(' ' + preview(gen?.answer ?? '', 400).replace(/\n/g, '\n '));
178
  if (gen?.confidence) {
 
192
  if (gen?.limitations) {
193
  log(' [generator] limitations: ' + preview(gen.limitations, 200));
194
  }
195
+ if (gen?.raw) {
196
+ let rawDisplay = gen.raw;
197
+ try {
198
+ const parsed = JSON.parse(gen.raw);
199
+ rawDisplay = JSON.stringify(parsed, null, 2);
200
+ } catch {
201
+ // leave as string
202
+ }
203
+ log(' [generator] raw response (JSON if parsable):');
204
+ log(' ' + preview(rawDisplay, 2000).replace(/\n/g, '\n '));
205
+ }
206
  }
207
  } catch (e) {
208
  const msg = e?.message || String(e);
state_of_project.md CHANGED
@@ -6,6 +6,7 @@
6
  - Verifier parsing tolerates distributor format (`SCORE` as number or `PASS`/`FAIL` with noisy prefixes); caching and retry logic in place.
7
  - Tests: 42 passing (retrieval mock/real, generator, verifier, reward, pipeline behaviour, cache, full mock pipeline).
8
  - CLI defaults: verbose on, question-first, JSONL chunks; chunk/question limits respected.
 
9
 
10
  ## What needs attention
11
  - Real pipeline currently fails at question generation when Ollama/question model is unreachable; run requires a live Ollama with the specified model pulled.
 
6
  - Verifier parsing tolerates distributor format (`SCORE` as number or `PASS`/`FAIL` with noisy prefixes); caching and retry logic in place.
7
  - Tests: 42 passing (retrieval mock/real, generator, verifier, reward, pipeline behaviour, cache, full mock pipeline).
8
  - CLI defaults: verbose on, question-first, JSONL chunks; chunk/question limits respected.
9
+ - Generator parsing/logging: preserves provider `.thinking` (structured) and parsed thought/answer/confidence/evidence/limitations; verbose mode prints both parsed and raw (JSON pretty if parsable). Gold stores all generator fields.
10
 
11
  ## What needs attention
12
  - Real pipeline currently fails at question generation when Ollama/question model is unreachable; run requires a live Ollama with the specified model pulled.
tests/generator_core.test.mjs CHANGED
@@ -80,8 +80,8 @@ The final answer derived from the context.`;
80
  );
81
 
82
  expect(result.raw).toBe('Just a direct answer with no visible reasoning.');
83
- // No JSON or think tags means thought=null and answer = full output
84
- expect(result.thought).toBeNull();
85
  expect(result.answer).toBe('Just a direct answer with no visible reasoning.');
86
  });
87
 
@@ -102,4 +102,21 @@ The final answer derived from the context.`;
102
  expect(result.evidence).toEqual(['quote1 (loc1)', 'quote2 (loc2)']);
103
  expect(result.limitations).toBe('None');
104
  });
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  });
 
80
  );
81
 
82
  expect(result.raw).toBe('Just a direct answer with no visible reasoning.');
83
+ // No JSON or think tags means thought falls back to raw
84
+ expect(result.thought).toBe('Just a direct answer with no visible reasoning.');
85
  expect(result.answer).toBe('Just a direct answer with no visible reasoning.');
86
  });
87
 
 
102
  expect(result.evidence).toEqual(['quote1 (loc1)', 'quote2 (loc2)']);
103
  expect(result.limitations).toBe('None');
104
  });
105
+
106
+ it('parses legacy reasoning tags without answer block', async () => {
107
+ const fakeContext = [{ content: 'ctx' }];
108
+ const provider = {
109
+ generate: vi.fn(async () =>
110
+ `<understanding>step A</understanding>\n<reasoning_chain>step B</reasoning_chain>\nConfidence: Medium\nAnswer: Legacy answer\nEvidence: ["ev1 (loc)"]\nLimitations: None`
111
+ ),
112
+ };
113
+
114
+ const result = await runGenerator('Test?', fakeContext, provider);
115
+
116
+ expect(typeof result.thought).toBe('string');
117
+ expect(result.answer).toBe('Legacy answer');
118
+ expect(result.confidence).toBe('Medium');
119
+ expect(result.evidence).toEqual(['ev1 (loc)']);
120
+ expect(result.limitations).toBe('None');
121
+ });
122
  });