agentbee

Running

mangubee Claude commited on 13 days ago

Commit

c64957b

1 Parent(s): 9ce1b76

feat: add Chain of Thought for LLM synthesis debugging

Implement CoT format to expose LLM reasoning process for debugging synthesis failures.

Changes:
- Updated system_prompt for all 3 providers (HF, Groq, Claude) to request REASONING + FINAL ANSWER format
- Increased max_tokens from 256 to 1024 to accommodate reasoning
- Added parsing logic to extract FINAL ANSWER from response
- Enhanced log file format to save full response with reasoning
- Updated CHANGELOG.md

Result: Can now inspect LLM's thought process in log/llm_context_*.txt files

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (2) hide show

CHANGELOG.md +48 -0
src/agent/llm_client.py +81 -51

CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,53 @@
 # Session Changelog
 ## [2026-01-13] [Infrastructure] [COMPLETED] Logging Standard - Console + File Separation
 **Problem:** Logs were too verbose (14k-16k tokens), making debugging difficult and expensive.

 # Session Changelog
+## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Chain of Thought for LLM Synthesis Debugging
+**Problem:** LLM returns "Unable to answer" with no reasoning. Can't debug why synthesis fails despite having complete transcript evidence.
+**Solution:** Implemented Chain of Thought (CoT) format - LLM now provides reasoning before final answer.
+**Response Format:**
+```
+REASONING: [Step-by-step thought process]
+- What information is in the evidence?
+- What is the question asking for?
+- How do you extract the answer?
+- Any ambiguities or uncertainties?
+FINAL ANSWER: [Factoid answer]
+```
+**Implementation:**
+1. **Updated system_prompt** (all 3 providers: HF, Groq, Claude)
+   - Request two-part response: REASONING + FINAL ANSWER
+   - Clear examples showing expected format
+   - Instructions for handling insufficient evidence
+2. **Increased max_tokens** from 256 → 1024
+   - Accommodate longer reasoning text
+   - Allow space for both reasoning and answer
+3. **Added parsing logic** to extract FINAL ANSWER
+   - Split response on "FINAL ANSWER:" delimiter
+   - Return only answer to agent (short for UI)
+   - Save full response (with reasoning) to log file
+4. **Enhanced log file format** (log/llm_context_TIMESTAMP.txt)
+   - Full LLM response with reasoning
+   - Extracted final answer
+   - Clear separation markers
+**Modified Files:**
+- **src/agent/llm_client.py** (~50 lines modified)
+  - Updated `synthesize_answer_hf()` - CoT prompt, max_tokens=1024, parsing
+  - Updated `synthesize_answer_groq()` - Same changes
+  - Updated `synthesize_answer_claude()` - Same changes
+**Result:** Can now inspect LLM's thought process in log files to debug synthesis failures
+---
 ## [2026-01-13] [Infrastructure] [COMPLETED] Logging Standard - Console + File Separation
 **Problem:** Logs were too verbose (14k-16k tokens), making debugging difficult and expensive.

src/agent/llm_client.py CHANGED Viewed

@@ -968,22 +968,28 @@ def synthesize_answer_claude(question: str, evidence: List[str]) -> str:
 Your task is to extract a factoid answer from the provided evidence.
-CRITICAL - Answer format requirements:
-1. Answers must be factoids: a number, a few words, or a comma-separated list
-2. Be concise - no explanations, just the answer
-3. If evidence conflicts, evaluate source credibility and recency
-4. If evidence is insufficient, state "Unable to answer"
-Examples of good factoid answers:
-- "42"
-- "Paris"
-- "Albert Einstein"
-- "red, blue, green"
-- "1969-07-20"
-Examples of bad answers (too verbose):
-- "The answer is 42 because..."
-- "Based on the evidence, it appears that..."
 """
     user_prompt = f"""Question: {question}
@@ -1083,22 +1089,28 @@ def synthesize_answer_hf(question: str, evidence: List[str]) -> str:
 Your task is to extract a factoid answer from the provided evidence.
-CRITICAL - Answer format requirements:
-1. Answers must be factoids: a number, a few words, or a comma-separated list
-2. Be concise - no explanations, just the answer
-3. If evidence conflicts, evaluate source credibility and recency
-4. If evidence is insufficient, state "Unable to answer"
-Examples of good factoid answers:
-- "42"
-- "Paris"
-- "Albert Einstein"
-- "red, blue, green"
-- "1969-07-20"
-Examples of bad answers (too verbose):
-- "The answer is 42 because..."
-- "Based on the evidence, it appears that..."
 """
     user_prompt = f"""Question: {question}
@@ -1151,21 +1163,33 @@ Extract the factoid answer from the evidence above. Return only the factoid, not
     response = client.chat_completion(
         messages=messages,
-        max_tokens=256,  # Factoid answers are short
         temperature=TEMPERATURE,
     )
-    answer = response.choices[0].message.content.strip()
-    logger.info(f"[synthesize_answer_hf] Generated answer: {answer}")
-    # Append answer to context file
     with open(context_file, "a", encoding="utf-8") as f:
         f.write("\n" + "=" * 80 + "\n")
-        f.write("LLM ANSWER:\n")
         f.write("=" * 80 + "\n")
-        f.write(answer)
         f.write("\n" + "=" * 80 + "\n")
-    logger.info(f"[synthesize_answer_hf] Answer appended to context file")
     return answer
@@ -1188,22 +1212,28 @@ def synthesize_answer_groq(question: str, evidence: List[str]) -> str:
 Your task is to extract a factoid answer from the provided evidence.
-CRITICAL - Answer format requirements:
-1. Answers must be factoids: a number, a few words, or a comma-separated list
-2. Be concise - no explanations, just the answer
-3. If evidence conflicts, evaluate source credibility and recency
-4. If evidence is insufficient, state "Unable to answer"
-Examples of good factoid answers:
-- "42"
-- "Paris"
-- "Albert Einstein"
-- "red, blue, green"
-- "1969-07-20"
-Examples of bad answers (too verbose):
-- "The answer is 42 because..."
-- "Based on the evidence, it appears that..."
 """
     user_prompt = f"""Question: {question}

 Your task is to extract a factoid answer from the provided evidence.
+CRITICAL - Response format (two parts):
+1. **REASONING** - Show your step-by-step thought process:
+   - What information is in the evidence?
+   - What is the question asking for?
+   - How do you extract the answer from the evidence?
+   - Any ambiguities or uncertainties?
+2. **FINAL ANSWER** - The factoid answer only:
+   - A number, a few words, or a comma-separated list
+   - No explanations, just the answer
+   - If evidence is insufficient, state "Unable to answer"
+Response format:
+REASONING: [Your step-by-step thought process here]
+FINAL ANSWER: [The factoid answer]
+Examples:
+REASONING: The evidence mentions the population of Tokyo is 13.9 million. The question asks for the city with highest population. Tokyo is listed as the highest.
+FINAL ANSWER: Tokyo
+REASONING: The transcript mentions "giant petrel", "emperor", and "adelie" (with typo "deli"). These are three different bird species present in the same scene.
+FINAL ANSWER: 3
 """
     user_prompt = f"""Question: {question}
 Your task is to extract a factoid answer from the provided evidence.
+CRITICAL - Response format (two parts):
+1. **REASONING** - Show your step-by-step thought process:
+   - What information is in the evidence?
+   - What is the question asking for?
+   - How do you extract the answer from the evidence?
+   - Any ambiguities or uncertainties?
+2. **FINAL ANSWER** - The factoid answer only:
+   - A number, a few words, or a comma-separated list
+   - No explanations, just the answer
+   - If evidence is insufficient, state "Unable to answer"
+Response format:
+REASONING: [Your step-by-step thought process here]
+FINAL ANSWER: [The factoid answer]
+Examples:
+REASONING: The evidence mentions the population of Tokyo is 13.9 million. The question asks for the city with highest population. Tokyo is listed as the highest.
+FINAL ANSWER: Tokyo
+REASONING: The transcript mentions "giant petrel", "emperor", and "adelie" (with typo "deli"). These are three different bird species present in the same scene.
+FINAL ANSWER: 3
 """
     user_prompt = f"""Question: {question}
     response = client.chat_completion(
         messages=messages,
+        max_tokens=1024,  # Increased for CoT reasoning
         temperature=TEMPERATURE,
     )
+    full_response = response.choices[0].message.content.strip()
+    # Extract FINAL ANSWER from response (format: "REASONING: ...\nFINAL ANSWER: ...")
+    if "FINAL ANSWER:" in full_response:
+        parts = full_response.split("FINAL ANSWER:")
+        answer = parts[-1].strip()
+        reasoning = parts[0].replace("REASONING:", "").strip()
+    else:
+        # Fallback if LLM doesn't follow format
+        answer = full_response
+        reasoning = "No reasoning provided (format not followed)"
+    logger.info(f"[synthesize_answer_hf] Answer: {answer}")
+    # Append full response to context file (includes reasoning)
     with open(context_file, "a", encoding="utf-8") as f:
         f.write("\n" + "=" * 80 + "\n")
+        f.write("LLM RESPONSE (with reasoning):\n")
         f.write("=" * 80 + "\n")
+        f.write(full_response)
         f.write("\n" + "=" * 80 + "\n")
+        f.write(f"\nEXTRACTED FINAL ANSWER: {answer}\n")
+        f.write("=" * 80 + "\n")
     return answer
 Your task is to extract a factoid answer from the provided evidence.
+CRITICAL - Response format (two parts):
+1. **REASONING** - Show your step-by-step thought process:
+   - What information is in the evidence?
+   - What is the question asking for?
+   - How do you extract the answer from the evidence?
+   - Any ambiguities or uncertainties?
+2. **FINAL ANSWER** - The factoid answer only:
+   - A number, a few words, or a comma-separated list
+   - No explanations, just the answer
+   - If evidence is insufficient, state "Unable to answer"
+Response format:
+REASONING: [Your step-by-step thought process here]
+FINAL ANSWER: [The factoid answer]
+Examples:
+REASONING: The evidence mentions the population of Tokyo is 13.9 million. The question asks for the city with highest population. Tokyo is listed as the highest.
+FINAL ANSWER: Tokyo
+REASONING: The transcript mentions "giant petrel", "emperor", and "adelie" (with typo "deli"). These are three different bird species present in the same scene.
+FINAL ANSWER: 3
 """
     user_prompt = f"""Question: {question}