handoff stuff

Browse files

Files changed (10) hide show

AGENTS.md +1 -0
HANDOFF.md +34 -0
README.md +1 -1
USAGE.md +2 -2
scripts/gold_preview.mjs +26 -0
src/generator/generator_core.mjs +46 -0
src/pipeline/batch.mjs +1 -0
src/pipeline/step.mjs +19 -0
state_of_project.md +1 -0
tests/generator_core.test.mjs +19 -2

AGENTS.md CHANGED Viewed

@@ -17,6 +17,7 @@
 - `REAL_ES=1 npm test` – exercise retrieval against a live Elasticsearch + embedding endpoint.
 - Red/green pathway: use `*_PROVIDER=mock` plus JSONL chunk source to dry-run (green) without models; switch to real providers for red runs and the cache will skip already-completed stages.
 - Verifier contract: models return JSON `{"REASONING": [...], "SCORE": <number|\"PASS\"|\"FAIL\">}`; SCORE >=0.5 or PASS → accepted. Prompt must remain unchanged; parsing is tolerant of the PASS/FAIL token format.
 ## Coding Style & Naming Conventions
 - ECMAScript modules (`type: "module"`); prefer `.mjs` for shared code.

 - `REAL_ES=1 npm test` – exercise retrieval against a live Elasticsearch + embedding endpoint.
 - Red/green pathway: use `*_PROVIDER=mock` plus JSONL chunk source to dry-run (green) without models; switch to real providers for red runs and the cache will skip already-completed stages.
 - Verifier contract: models return JSON `{"REASONING": [...], "SCORE": <number|\"PASS\"|\"FAIL\">}`; SCORE >=0.5 or PASS → accepted. Prompt must remain unchanged; parsing is tolerant of the PASS/FAIL token format.
+- Generator output/logging: verbose runs show parsed `thought`, raw provider `thinking`, answer, confidence, evidence, limitations, and raw response (pretty-printed if JSON). Gold stores `answer`, `thought`, `raw`, `confidence`, `evidence`, `limitations`, `thinking`.
 ## Coding Style & Naming Conventions
 - ECMAScript modules (`type: "module"`); prefer `.mjs` for shared code.

HANDOFF.md ADDED Viewed

	@@ -0,0 +1,34 @@

+# Handoff Notes
+## Quick status
+- Pipeline: question-first by default (JSONL chunks), deterministic chunk IDs, cache-backed (questions/gens/verifications/rewards).
+- Generator parsing/logging: preserves provider `.thinking` plus parsed thought/answer/confidence/evidence/limitations; verbose logs show both parsed and raw (pretty-printed if JSON). Gold stores all of these fields.
+- Verifier: expects distributor prompt output; accepts `SCORE` as number or `PASS/FAIL` (even noisy `PROMPT = PASS`); raw transcript logged in verbose.
+- Scripts: `gold_preview.mjs` (shows thought/thinking/raw), `cache_report.mjs`, `regenerate_gold_from_cache.mjs`, `purge_mock_gold.mjs`.
+- Tests: passing in writable env (read-only sandboxes block Vitest temp/cache writes).
+## Running
+- Default verbose CLI: `npm run pipeline -- --limit N --verbose`.
+- Question-first mode: `PIPELINE_SEED_MODE=question-first npm run pipeline -- --limit N --verbose`.
+- Random walk over chunks: `PIPELINE_RANDOM_WALK=1` (or `PIPELINE_CHUNK_ORDER=random`) + optional `--chunk-limit`.
+- Question cap per chunk: `QUESTION_MAX_PER_CHUNK` (e.g., 3).
+- Preview gold: `node scripts/gold_preview.mjs --max-answer 2000 --limit 5`.
+- Regenerate gold from cache: `node scripts/regenerate_gold_from_cache.mjs`.
+- Purge mock Q1? gold: `node scripts/purge_mock_gold.mjs`.
+## What to watch
+- Cache/gold should stay out of git; regenerate gold via cache script.
+- Question provider must be reachable; otherwise question-first will spin through chunks with errors.
+- Verifier prompt locked; parsing tolerates PASS/FAIL tokens and logs raw output.
+- Generator prompt locked; parsing handles `.thinking` and Qwen-style answer blocks; thought/answer stay separate for verifier/reward.
+## Files of note
+- `src/generator/generator_core.mjs`: parses provider responses; carries `.thinking`, thought, answer, confidence, evidence, limitations.
+- `src/pipeline/step.mjs`: verbose logging of thought, thinking, answer, raw.
+- `prompts/*`: locked prompts (do not edit generator/verifier without intent).
+- `data/cache/*.jsonl`: intermediate cache (questions/gens/verifications/rewards); use `PIPELINE_CACHE_DIR` to redirect.
+- `gold/pipeline_gold.jsonl`: output; rebuild via cache if needed.
+## Caveats
+- Read-only environments will fail `npm test` due to /tmp and `.vite` writes; run tests where writes are allowed.
+- Long generator outputs can bloat verifier context; consider truncation or smaller verifier model if needed.

README.md CHANGED Viewed

@@ -82,7 +82,7 @@ All pure modules include Vitest coverage:
 * question generation
 * provider router
 * pipeline integration (mock)
-* JSONL cache, PASS/FAIL verifier parsing
 ---

 * question generation
 * provider router
 * pipeline integration (mock)
+* JSONL cache, PASS/FAIL verifier parsing, generator parsing (thought/thinking/answer)
 ---

USAGE.md CHANGED Viewed

@@ -39,7 +39,7 @@ Run the default pipeline (static seeds):
 npm run pipeline -- --limit 10
 ```
-Verbose run:
 ```bash
 npm run pipeline -- --limit 10 --verbose
@@ -77,7 +77,7 @@ All accepted gold samples are written to:
 gold/pipeline_gold.jsonl
 ```
-Each entry includes:
 ```json
 {

 npm run pipeline -- --limit 10
 ```
+Verbose run (shows generator thought/thinking/answer/confidence/evidence/limitations and raw response):
 ```bash
 npm run pipeline -- --limit 10 --verbose
 gold/pipeline_gold.jsonl
 ```
+Each entry includes (generator sample contains answer, thought, raw, confidence, evidence, limitations, thinking):
 ```json
 {

scripts/gold_preview.mjs CHANGED Viewed

@@ -107,6 +107,12 @@ async function main() {
     const q = obj.question || '[no question]';
     const ans = obj.sample?.answer || obj.sample?.raw || '[no answer]';
     const chunkId = obj.sourceChunkId || obj.context?.[0]?.id || '[unknown chunk]';
     const ctxSnippet = obj.context?.[0]?.content || obj.sourceChunk || '';
     const rew = obj.reward?.score ?? obj.reward?.ok;
@@ -117,6 +123,26 @@ async function main() {
     console.log(`Chunk: ${chunkId}`);
     console.log(`Q: ${preview(q, maxQuestion, full)}`);
     console.log(`A: ${preview(ans, maxAnswer, full)}`);
     if (ctxSnippet) console.log(`Ctx: ${preview(ctxSnippet, maxContext, full)}`);
     if (verOk !== undefined) console.log(`Verifier ok: ${verOk}${verScore !== undefined ? ` (score: ${verScore})` : ''}`);
     if (rew !== undefined) console.log(`Reward: ${rew}`);

     const q = obj.question || '[no question]';
     const ans = obj.sample?.answer || obj.sample?.raw || '[no answer]';
+    const rawGen = obj.sample?.raw;
+    const thought = obj.sample?.thought;
+    const thinking = obj.sample?.thinking;
+    const confidence = obj.sample?.confidence ?? obj.sample?.confidence_level;
+    const evidence = obj.sample?.evidence;
+    const limitations = obj.sample?.limitations;
     const chunkId = obj.sourceChunkId || obj.context?.[0]?.id || '[unknown chunk]';
     const ctxSnippet = obj.context?.[0]?.content || obj.sourceChunk || '';
     const rew = obj.reward?.score ?? obj.reward?.ok;
     console.log(`Chunk: ${chunkId}`);
     console.log(`Q: ${preview(q, maxQuestion, full)}`);
     console.log(`A: ${preview(ans, maxAnswer, full)}`);
+    if (thought !== undefined) {
+      const tVal =
+        typeof thought === 'string'
+          ? thought
+          : JSON.stringify(thought, null, 2);
+      console.log(`Thought: ${preview(tVal, maxAnswer, full)}`);
+    }
+    if (rawGen !== undefined) {
+      console.log(`Raw: ${preview(rawGen, maxAnswer, full)}`);
+    }
+    if (confidence !== undefined) console.log(`Gen confidence: ${confidence}`);
+    if (evidence) console.log(`Evidence: ${preview(Array.isArray(evidence) ? evidence.join(' | ') : evidence, 400, full)}`);
+    if (limitations) console.log(`Limitations: ${preview(limitations, 200, full)}`);
+    if (thinking !== undefined) {
+      const tVal =
+        typeof thinking === 'string'
+          ? thinking
+          : JSON.stringify(thinking, null, 2);
+      console.log(`Thinking: ${preview(tVal, maxAnswer, full)}`);
+    }
     if (ctxSnippet) console.log(`Ctx: ${preview(ctxSnippet, maxContext, full)}`);
     if (verOk !== undefined) console.log(`Verifier ok: ${verOk}${verScore !== undefined ? ` (score: ${verScore})` : ''}`);
     if (rew !== undefined) console.log(`Reward: ${rew}`);

src/generator/generator_core.mjs CHANGED Viewed

@@ -58,6 +58,17 @@ export async function runGenerator(question, contextChunks, provider) {
     thought = thinkingObj;
   }
   // Try parsing Qwen-style answer block first
   const parseAnswerBlock = (txt) => {
     if (!txt || typeof txt !== 'string') return null;
@@ -65,6 +76,29 @@ export async function runGenerator(question, contextChunks, provider) {
     const body = blockMatch ? blockMatch[1] : txt;
     const lines = body.split('\n').map((l) => l.trim()).filter(Boolean);
     const result = {};
     for (const line of lines) {
       if (/^confidence:/i.test(line)) {
         const val = line.split(':')[1]?.trim();
@@ -98,6 +132,10 @@ export async function runGenerator(question, contextChunks, provider) {
     confidence = blockParsed.confidence ?? confidence;
     evidence = blockParsed.evidence ?? evidence;
     limitations = blockParsed.limitations ?? limitations;
   } else {
     // fallback: parse JSON if it's actually JSON
     const parsed = safeParse(raw);
@@ -127,6 +165,9 @@ export async function runGenerator(question, contextChunks, provider) {
       if (parsed.evidence) evidence = parsed.evidence;
       if (parsed.limitations) limitations = parsed.limitations;
     } else {
       // fallback: extract visible chain-of-thought tags if present
       const thinkMatch = typeof raw === 'string'
         ? raw.match(/<think>([\s\S]*?)<\/think>/i)
@@ -138,8 +179,13 @@ export async function runGenerator(question, contextChunks, provider) {
     }
   }
   return {
     raw,
     thought,
     answer,
     confidence,

     thought = thinkingObj;
   }
+  const extractThoughtBlock = (txt) => {
+    if (!txt || typeof txt !== 'string') return null;
+    const thoughtMatch = txt.match(/<\|thought\|>([\s\S]*?)<\|end_of_thought\|>/i);
+    if (thoughtMatch) return thoughtMatch[1].trim();
+    const understandingMatch = txt.match(/<understanding>[\s\S]*?(?=<\|answer\||<answer>|$)/i);
+    if (understandingMatch) return understandingMatch[0].trim();
+    return null;
+  };
   // Try parsing Qwen-style answer block first
   const parseAnswerBlock = (txt) => {
     if (!txt || typeof txt !== 'string') return null;
     const body = blockMatch ? blockMatch[1] : txt;
     const lines = body.split('\n').map((l) => l.trim()).filter(Boolean);
     const result = {};
+    // line-based fallbacks even without tags
+    const answerLine = txt.match(/^answer:\s*(.+)$/im);
+    if (answerLine) result.answer = answerLine[1].trim();
+    const confLine = txt.match(/^confidence:\s*(.+)$/im);
+    if (confLine) result.confidence = confLine[1].trim();
+    const evidenceLine = txt.match(/^evidence:\s*(.+)$/im);
+    if (evidenceLine) {
+      const evLine = evidenceLine[1].trim();
+      let ev = [];
+      const arrMatch = evLine.match(/\[(.*)\]/);
+      if (arrMatch) {
+        ev = arrMatch[1]
+          .split(/,(?=(?:[^'"]|'[^']*'|"[^"]*")*$)/)
+          .map((s) => s.replace(/^["'\s]+|["'\s]+$/g, ''))
+          .filter(Boolean);
+      } else {
+        ev = evLine.split(',').map((s) => s.replace(/^["'\s]+|["'\s]+$/g, '')).filter(Boolean);
+      }
+      result.evidence = ev;
+    }
+    const limLine = txt.match(/^limitations?:\s*(.+)$/im);
+    if (limLine) result.limitations = limLine[1].trim();
     for (const line of lines) {
       if (/^confidence:/i.test(line)) {
         const val = line.split(':')[1]?.trim();
     confidence = blockParsed.confidence ?? confidence;
     evidence = blockParsed.evidence ?? evidence;
     limitations = blockParsed.limitations ?? limitations;
+    if (!thought) {
+      const t = extractThoughtBlock(raw);
+      if (t) thought = t;
+    }
   } else {
     // fallback: parse JSON if it's actually JSON
     const parsed = safeParse(raw);
       if (parsed.evidence) evidence = parsed.evidence;
       if (parsed.limitations) limitations = parsed.limitations;
     } else {
+      // fallback: extract thought block or <think>
+      const tBlock = extractThoughtBlock(raw);
+      if (tBlock) thought = tBlock;
       // fallback: extract visible chain-of-thought tags if present
       const thinkMatch = typeof raw === 'string'
         ? raw.match(/<think>([\s\S]*?)<\/think>/i)
     }
   }
+  if (!thought && raw) {
+    thought = raw;
+  }
   return {
     raw,
+    thinking: thinkingObj,
     thought,
     answer,
     confidence,

src/pipeline/batch.mjs CHANGED Viewed

@@ -131,6 +131,7 @@ export async function runPipelineBatch({
               confidence: result.gen?.confidence,
               evidence: result.gen?.evidence,
               limitations: result.gen?.limitations,
             },
             verifier: result.ver,
             reward: result.rew,

               confidence: result.gen?.confidence,
               evidence: result.gen?.evidence,
               limitations: result.gen?.limitations,
+              thinking: result.gen?.thinking,
             },
             verifier: result.ver,
             reward: result.rew,

src/pipeline/step.mjs CHANGED Viewed

@@ -165,6 +165,14 @@ export async function runPipelineStep({
           log('   [generator] thought:');
           log('   ' + preview(thoughtPreview, 500).replace(/\n/g, '\n   '));
         }
         log('   [generator] answer:');
         log('   ' + preview(gen?.answer ?? '', 400).replace(/\n/g, '\n   '));
         if (gen?.confidence) {
@@ -184,6 +192,17 @@ export async function runPipelineStep({
         if (gen?.limitations) {
           log('   [generator] limitations: ' + preview(gen.limitations, 200));
         }
       }
     } catch (e) {
       const msg = e?.message || String(e);

           log('   [generator] thought:');
           log('   ' + preview(thoughtPreview, 500).replace(/\n/g, '\n   '));
         }
+        if (gen?.thinking) {
+          const thinkingPreview =
+            typeof gen.thinking === 'string'
+              ? gen.thinking
+              : JSON.stringify(gen.thinking, null, 2);
+          log('   [generator] thinking (raw from provider):');
+          log('   ' + preview(thinkingPreview, 500).replace(/\n/g, '\n   '));
+        }
         log('   [generator] answer:');
         log('   ' + preview(gen?.answer ?? '', 400).replace(/\n/g, '\n   '));
         if (gen?.confidence) {
         if (gen?.limitations) {
           log('   [generator] limitations: ' + preview(gen.limitations, 200));
         }
+        if (gen?.raw) {
+          let rawDisplay = gen.raw;
+          try {
+            const parsed = JSON.parse(gen.raw);
+            rawDisplay = JSON.stringify(parsed, null, 2);
+          } catch {
+            // leave as string
+          }
+          log('   [generator] raw response (JSON if parsable):');
+          log('   ' + preview(rawDisplay, 2000).replace(/\n/g, '\n   '));
+        }
       }
     } catch (e) {
       const msg = e?.message || String(e);

state_of_project.md CHANGED Viewed

@@ -6,6 +6,7 @@
 - Verifier parsing tolerates distributor format (`SCORE` as number or `PASS`/`FAIL` with noisy prefixes); caching and retry logic in place.
 - Tests: 42 passing (retrieval mock/real, generator, verifier, reward, pipeline behaviour, cache, full mock pipeline).
 - CLI defaults: verbose on, question-first, JSONL chunks; chunk/question limits respected.
 ## What needs attention
 - Real pipeline currently fails at question generation when Ollama/question model is unreachable; run requires a live Ollama with the specified model pulled.

 - Verifier parsing tolerates distributor format (`SCORE` as number or `PASS`/`FAIL` with noisy prefixes); caching and retry logic in place.
 - Tests: 42 passing (retrieval mock/real, generator, verifier, reward, pipeline behaviour, cache, full mock pipeline).
 - CLI defaults: verbose on, question-first, JSONL chunks; chunk/question limits respected.
+- Generator parsing/logging: preserves provider `.thinking` (structured) and parsed thought/answer/confidence/evidence/limitations; verbose mode prints both parsed and raw (JSON pretty if parsable). Gold stores all generator fields.
 ## What needs attention
 - Real pipeline currently fails at question generation when Ollama/question model is unreachable; run requires a live Ollama with the specified model pulled.

tests/generator_core.test.mjs CHANGED Viewed

@@ -80,8 +80,8 @@ The final answer derived from the context.`;
     );
     expect(result.raw).toBe('Just a direct answer with no visible reasoning.');
-    // No JSON or think tags means thought=null and answer = full output
-    expect(result.thought).toBeNull();
     expect(result.answer).toBe('Just a direct answer with no visible reasoning.');
   });
@@ -102,4 +102,21 @@ The final answer derived from the context.`;
     expect(result.evidence).toEqual(['quote1 (loc1)', 'quote2 (loc2)']);
     expect(result.limitations).toBe('None');
   });
 });

     );
     expect(result.raw).toBe('Just a direct answer with no visible reasoning.');
+    // No JSON or think tags means thought falls back to raw
+    expect(result.thought).toBe('Just a direct answer with no visible reasoning.');
     expect(result.answer).toBe('Just a direct answer with no visible reasoning.');
   });
     expect(result.evidence).toEqual(['quote1 (loc1)', 'quote2 (loc2)']);
     expect(result.limitations).toBe('None');
   });
+  it('parses legacy reasoning tags without answer block', async () => {
+    const fakeContext = [{ content: 'ctx' }];
+    const provider = {
+      generate: vi.fn(async () =>
+        `<understanding>step A</understanding>\n<reasoning_chain>step B</reasoning_chain>\nConfidence: Medium\nAnswer: Legacy answer\nEvidence: ["ev1 (loc)"]\nLimitations: None`
+      ),
+    };
+    const result = await runGenerator('Test?', fakeContext, provider);
+    expect(typeof result.thought).toBe('string');
+    expect(result.answer).toBe('Legacy answer');
+    expect(result.confidence).toBe('Medium');
+    expect(result.evidence).toEqual(['ev1 (loc)']);
+    expect(result.limitations).toBe('None');
+  });
 });