mangubee Claude commited on
Commit
c64957b
·
1 Parent(s): 9ce1b76

feat: add Chain of Thought for LLM synthesis debugging

Browse files

Implement CoT format to expose LLM reasoning process for debugging synthesis failures.

Changes:
- Updated system_prompt for all 3 providers (HF, Groq, Claude) to request REASONING + FINAL ANSWER format
- Increased max_tokens from 256 to 1024 to accommodate reasoning
- Added parsing logic to extract FINAL ANSWER from response
- Enhanced log file format to save full response with reasoning
- Updated CHANGELOG.md

Result: Can now inspect LLM's thought process in log/llm_context_*.txt files

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (2) hide show
  1. CHANGELOG.md +48 -0
  2. src/agent/llm_client.py +81 -51
CHANGELOG.md CHANGED
@@ -1,5 +1,53 @@
1
  # Session Changelog
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ## [2026-01-13] [Infrastructure] [COMPLETED] Logging Standard - Console + File Separation
4
 
5
  **Problem:** Logs were too verbose (14k-16k tokens), making debugging difficult and expensive.
 
1
  # Session Changelog
2
 
3
+ ## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Chain of Thought for LLM Synthesis Debugging
4
+
5
+ **Problem:** LLM returns "Unable to answer" with no reasoning. Can't debug why synthesis fails despite having complete transcript evidence.
6
+
7
+ **Solution:** Implemented Chain of Thought (CoT) format - LLM now provides reasoning before final answer.
8
+
9
+ **Response Format:**
10
+ ```
11
+ REASONING: [Step-by-step thought process]
12
+ - What information is in the evidence?
13
+ - What is the question asking for?
14
+ - How do you extract the answer?
15
+ - Any ambiguities or uncertainties?
16
+
17
+ FINAL ANSWER: [Factoid answer]
18
+ ```
19
+
20
+ **Implementation:**
21
+
22
+ 1. **Updated system_prompt** (all 3 providers: HF, Groq, Claude)
23
+ - Request two-part response: REASONING + FINAL ANSWER
24
+ - Clear examples showing expected format
25
+ - Instructions for handling insufficient evidence
26
+
27
+ 2. **Increased max_tokens** from 256 → 1024
28
+ - Accommodate longer reasoning text
29
+ - Allow space for both reasoning and answer
30
+
31
+ 3. **Added parsing logic** to extract FINAL ANSWER
32
+ - Split response on "FINAL ANSWER:" delimiter
33
+ - Return only answer to agent (short for UI)
34
+ - Save full response (with reasoning) to log file
35
+
36
+ 4. **Enhanced log file format** (log/llm_context_TIMESTAMP.txt)
37
+ - Full LLM response with reasoning
38
+ - Extracted final answer
39
+ - Clear separation markers
40
+
41
+ **Modified Files:**
42
+ - **src/agent/llm_client.py** (~50 lines modified)
43
+ - Updated `synthesize_answer_hf()` - CoT prompt, max_tokens=1024, parsing
44
+ - Updated `synthesize_answer_groq()` - Same changes
45
+ - Updated `synthesize_answer_claude()` - Same changes
46
+
47
+ **Result:** Can now inspect LLM's thought process in log files to debug synthesis failures
48
+
49
+ ---
50
+
51
  ## [2026-01-13] [Infrastructure] [COMPLETED] Logging Standard - Console + File Separation
52
 
53
  **Problem:** Logs were too verbose (14k-16k tokens), making debugging difficult and expensive.
src/agent/llm_client.py CHANGED
@@ -968,22 +968,28 @@ def synthesize_answer_claude(question: str, evidence: List[str]) -> str:
968
 
969
  Your task is to extract a factoid answer from the provided evidence.
970
 
971
- CRITICAL - Answer format requirements:
972
- 1. Answers must be factoids: a number, a few words, or a comma-separated list
973
- 2. Be concise - no explanations, just the answer
974
- 3. If evidence conflicts, evaluate source credibility and recency
975
- 4. If evidence is insufficient, state "Unable to answer"
976
-
977
- Examples of good factoid answers:
978
- - "42"
979
- - "Paris"
980
- - "Albert Einstein"
981
- - "red, blue, green"
982
- - "1969-07-20"
983
-
984
- Examples of bad answers (too verbose):
985
- - "The answer is 42 because..."
986
- - "Based on the evidence, it appears that..."
 
 
 
 
 
 
987
  """
988
 
989
  user_prompt = f"""Question: {question}
@@ -1083,22 +1089,28 @@ def synthesize_answer_hf(question: str, evidence: List[str]) -> str:
1083
 
1084
  Your task is to extract a factoid answer from the provided evidence.
1085
 
1086
- CRITICAL - Answer format requirements:
1087
- 1. Answers must be factoids: a number, a few words, or a comma-separated list
1088
- 2. Be concise - no explanations, just the answer
1089
- 3. If evidence conflicts, evaluate source credibility and recency
1090
- 4. If evidence is insufficient, state "Unable to answer"
 
1091
 
1092
- Examples of good factoid answers:
1093
- - "42"
1094
- - "Paris"
1095
- - "Albert Einstein"
1096
- - "red, blue, green"
1097
- - "1969-07-20"
1098
 
1099
- Examples of bad answers (too verbose):
1100
- - "The answer is 42 because..."
1101
- - "Based on the evidence, it appears that..."
 
 
 
 
 
 
 
1102
  """
1103
 
1104
  user_prompt = f"""Question: {question}
@@ -1151,21 +1163,33 @@ Extract the factoid answer from the evidence above. Return only the factoid, not
1151
 
1152
  response = client.chat_completion(
1153
  messages=messages,
1154
- max_tokens=256, # Factoid answers are short
1155
  temperature=TEMPERATURE,
1156
  )
1157
 
1158
- answer = response.choices[0].message.content.strip()
1159
- logger.info(f"[synthesize_answer_hf] Generated answer: {answer}")
1160
 
1161
- # Append answer to context file
 
 
 
 
 
 
 
 
 
 
 
 
1162
  with open(context_file, "a", encoding="utf-8") as f:
1163
  f.write("\n" + "=" * 80 + "\n")
1164
- f.write("LLM ANSWER:\n")
1165
  f.write("=" * 80 + "\n")
1166
- f.write(answer)
1167
  f.write("\n" + "=" * 80 + "\n")
1168
- logger.info(f"[synthesize_answer_hf] Answer appended to context file")
 
1169
 
1170
  return answer
1171
 
@@ -1188,22 +1212,28 @@ def synthesize_answer_groq(question: str, evidence: List[str]) -> str:
1188
 
1189
  Your task is to extract a factoid answer from the provided evidence.
1190
 
1191
- CRITICAL - Answer format requirements:
1192
- 1. Answers must be factoids: a number, a few words, or a comma-separated list
1193
- 2. Be concise - no explanations, just the answer
1194
- 3. If evidence conflicts, evaluate source credibility and recency
1195
- 4. If evidence is insufficient, state "Unable to answer"
 
1196
 
1197
- Examples of good factoid answers:
1198
- - "42"
1199
- - "Paris"
1200
- - "Albert Einstein"
1201
- - "red, blue, green"
1202
- - "1969-07-20"
1203
 
1204
- Examples of bad answers (too verbose):
1205
- - "The answer is 42 because..."
1206
- - "Based on the evidence, it appears that..."
 
 
 
 
 
 
 
1207
  """
1208
 
1209
  user_prompt = f"""Question: {question}
 
968
 
969
  Your task is to extract a factoid answer from the provided evidence.
970
 
971
+ CRITICAL - Response format (two parts):
972
+ 1. **REASONING** - Show your step-by-step thought process:
973
+ - What information is in the evidence?
974
+ - What is the question asking for?
975
+ - How do you extract the answer from the evidence?
976
+ - Any ambiguities or uncertainties?
977
+
978
+ 2. **FINAL ANSWER** - The factoid answer only:
979
+ - A number, a few words, or a comma-separated list
980
+ - No explanations, just the answer
981
+ - If evidence is insufficient, state "Unable to answer"
982
+
983
+ Response format:
984
+ REASONING: [Your step-by-step thought process here]
985
+ FINAL ANSWER: [The factoid answer]
986
+
987
+ Examples:
988
+ REASONING: The evidence mentions the population of Tokyo is 13.9 million. The question asks for the city with highest population. Tokyo is listed as the highest.
989
+ FINAL ANSWER: Tokyo
990
+
991
+ REASONING: The transcript mentions "giant petrel", "emperor", and "adelie" (with typo "deli"). These are three different bird species present in the same scene.
992
+ FINAL ANSWER: 3
993
  """
994
 
995
  user_prompt = f"""Question: {question}
 
1089
 
1090
  Your task is to extract a factoid answer from the provided evidence.
1091
 
1092
+ CRITICAL - Response format (two parts):
1093
+ 1. **REASONING** - Show your step-by-step thought process:
1094
+ - What information is in the evidence?
1095
+ - What is the question asking for?
1096
+ - How do you extract the answer from the evidence?
1097
+ - Any ambiguities or uncertainties?
1098
 
1099
+ 2. **FINAL ANSWER** - The factoid answer only:
1100
+ - A number, a few words, or a comma-separated list
1101
+ - No explanations, just the answer
1102
+ - If evidence is insufficient, state "Unable to answer"
 
 
1103
 
1104
+ Response format:
1105
+ REASONING: [Your step-by-step thought process here]
1106
+ FINAL ANSWER: [The factoid answer]
1107
+
1108
+ Examples:
1109
+ REASONING: The evidence mentions the population of Tokyo is 13.9 million. The question asks for the city with highest population. Tokyo is listed as the highest.
1110
+ FINAL ANSWER: Tokyo
1111
+
1112
+ REASONING: The transcript mentions "giant petrel", "emperor", and "adelie" (with typo "deli"). These are three different bird species present in the same scene.
1113
+ FINAL ANSWER: 3
1114
  """
1115
 
1116
  user_prompt = f"""Question: {question}
 
1163
 
1164
  response = client.chat_completion(
1165
  messages=messages,
1166
+ max_tokens=1024, # Increased for CoT reasoning
1167
  temperature=TEMPERATURE,
1168
  )
1169
 
1170
+ full_response = response.choices[0].message.content.strip()
 
1171
 
1172
+ # Extract FINAL ANSWER from response (format: "REASONING: ...\nFINAL ANSWER: ...")
1173
+ if "FINAL ANSWER:" in full_response:
1174
+ parts = full_response.split("FINAL ANSWER:")
1175
+ answer = parts[-1].strip()
1176
+ reasoning = parts[0].replace("REASONING:", "").strip()
1177
+ else:
1178
+ # Fallback if LLM doesn't follow format
1179
+ answer = full_response
1180
+ reasoning = "No reasoning provided (format not followed)"
1181
+
1182
+ logger.info(f"[synthesize_answer_hf] Answer: {answer}")
1183
+
1184
+ # Append full response to context file (includes reasoning)
1185
  with open(context_file, "a", encoding="utf-8") as f:
1186
  f.write("\n" + "=" * 80 + "\n")
1187
+ f.write("LLM RESPONSE (with reasoning):\n")
1188
  f.write("=" * 80 + "\n")
1189
+ f.write(full_response)
1190
  f.write("\n" + "=" * 80 + "\n")
1191
+ f.write(f"\nEXTRACTED FINAL ANSWER: {answer}\n")
1192
+ f.write("=" * 80 + "\n")
1193
 
1194
  return answer
1195
 
 
1212
 
1213
  Your task is to extract a factoid answer from the provided evidence.
1214
 
1215
+ CRITICAL - Response format (two parts):
1216
+ 1. **REASONING** - Show your step-by-step thought process:
1217
+ - What information is in the evidence?
1218
+ - What is the question asking for?
1219
+ - How do you extract the answer from the evidence?
1220
+ - Any ambiguities or uncertainties?
1221
 
1222
+ 2. **FINAL ANSWER** - The factoid answer only:
1223
+ - A number, a few words, or a comma-separated list
1224
+ - No explanations, just the answer
1225
+ - If evidence is insufficient, state "Unable to answer"
 
 
1226
 
1227
+ Response format:
1228
+ REASONING: [Your step-by-step thought process here]
1229
+ FINAL ANSWER: [The factoid answer]
1230
+
1231
+ Examples:
1232
+ REASONING: The evidence mentions the population of Tokyo is 13.9 million. The question asks for the city with highest population. Tokyo is listed as the highest.
1233
+ FINAL ANSWER: Tokyo
1234
+
1235
+ REASONING: The transcript mentions "giant petrel", "emperor", and "adelie" (with typo "deli"). These are three different bird species present in the same scene.
1236
+ FINAL ANSWER: 3
1237
  """
1238
 
1239
  user_prompt = f"""Question: {question}