dcostenco commited on
Commit
30020c5
·
verified ·
1 Parent(s): c5403cc

Add training/build_4b_v43_patch2.py

Browse files
Files changed (1) hide show
  1. training/build_4b_v43_patch2.py +325 -0
training/build_4b_v43_patch2.py ADDED
@@ -0,0 +1,325 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ build_4b_v43_patch2.py — Second surgical patch targeting 8 specific BFCL failures.
4
+
5
+ 86.9% → 100% target. Exact failures addressed:
6
+ 1. knowledge_search vs session_search_memory (3 failures)
7
+ 2. session_task_route for "local or cloud?" (1 failure)
8
+ 3. session_delete_memory (hallucinated) → session_forget_memory (1 failure)
9
+ 4. session_init (hallucinated) → session_load_context (1 failure)
10
+ 5. knowledge_forget vs session_forget_memory (1 failure)
11
+ 6. session_save_experience vs session_save_ledger on followup (1 failure)
12
+ 7. CS abstain even when "retry/backoff" appears in prompt (1 failure)
13
+ """
14
+ import json, random
15
+ from pathlib import Path
16
+
17
+ random.seed(2028)
18
+
19
+ SYS_PROMPT = (
20
+ "You are Synalux, a memory-augmented coding and clinical reasoning assistant. "
21
+ "You have access to Prism Memory tools (session_save_ledger, session_load_context, "
22
+ "session_search_memory, session_save_handoff, session_forget_memory, session_health_check, "
23
+ "session_compact_ledger, session_export_memory, session_task_route, session_save_experience, "
24
+ "session_synthesize_edges, session_backfill_links, knowledge_search, knowledge_forget, "
25
+ "knowledge_upvote, knowledge_downvote, knowledge_set_retention, session_save_image, session_view_image) "
26
+ "and 13 multimodal tool modules (image_gen, office, web_scraper, browser, tts, ocr, git, "
27
+ "terminal, deps_scanner, hipaa, data_graph, templates, pdf_parser). "
28
+ "Think step-by-step before answering. When the user references past work, prior decisions, "
29
+ "or stored context, use the appropriate Prism Memory tool. "
30
+ "TOOL DISTINCTION: Use knowledge_search to query the persistent knowledge base (accumulated "
31
+ "documentation, best practices, reusable insights). Use session_search_memory to find past "
32
+ "session work, project history, prior conversations, and what we worked on before. "
33
+ "Use session_task_route when asked whether the local or cloud agent should handle a task. "
34
+ "Format tool calls inside <tool_call>...</tool_call> JSON blocks with fields 'name' and 'arguments'. "
35
+ "If no tool is needed, answer directly in plain text. "
36
+ "ABSTAIN for general programming questions, CS concepts (algorithms, data structures, "
37
+ "networking, design patterns, frameworks), greetings, and capability questions — even if "
38
+ "the question mentions words like 'retry', 'session', 'memory', or 'knowledge' in a CS context."
39
+ )
40
+
41
+ def ex(user, tool_name, args):
42
+ args_json = json.dumps(args, ensure_ascii=False)
43
+ return {"text": (
44
+ f"<|im_start|>system\n{SYS_PROMPT}<|im_end|>\n"
45
+ f"<|im_start|>user\n{user}<|im_end|>\n"
46
+ f"<|im_start|>assistant\n"
47
+ f"<tool_call>\n"
48
+ f'{{\"name\": \"{tool_name}\", \"arguments\": {args_json}}}\n'
49
+ f"</tool_call>\n<|im_end|>"
50
+ )}
51
+
52
+ def ex_abstain(user, reply):
53
+ return {"text": (
54
+ f"<|im_start|>system\n{SYS_PROMPT}<|im_end|>\n"
55
+ f"<|im_start|>user\n{user}<|im_end|>\n"
56
+ f"<|im_start|>assistant\n{reply}<|im_end|>"
57
+ )}
58
+
59
+ def ex_multiturn(user, tool1, args1, tool_resp, tool2, args2):
60
+ a1 = json.dumps(args1, ensure_ascii=False)
61
+ first = f'<tool_call>\n{{"name": "{tool1}", "arguments": {a1}}}\n</tool_call>'
62
+ if tool2 == "NO_TOOL":
63
+ second = args2.get("reply", "Done.")
64
+ else:
65
+ a2 = json.dumps(args2, ensure_ascii=False)
66
+ second = f'<tool_call>\n{{"name": "{tool2}", "arguments": {a2}}}\n</tool_call>'
67
+ return {"text": (
68
+ f"<|im_start|>system\n{SYS_PROMPT}<|im_end|>\n"
69
+ f"<|im_start|>user\n{user}<|im_end|>\n"
70
+ f"<|im_start|>assistant\n{first}<|im_end|>\n"
71
+ f"<|im_start|>tool\n{tool_resp}<|im_end|>\n"
72
+ f"<|im_start|>assistant\n{second}<|im_end|>"
73
+ )}
74
+
75
+ rows = []
76
+
77
+ # =============================================================================
78
+ # FIX 1: knowledge_search vs session_search_memory (40 examples each)
79
+ # =============================================================================
80
+
81
+ # knowledge_search: the persistent knowledge base, accumulated docs, best practices, reusable insights
82
+ ks_prompts = [
83
+ "Search our accumulated documentation for {topic}.",
84
+ "Look up {topic} in the knowledge base.",
85
+ "Find {topic} in our knowledge base.",
86
+ "Search knowledge for {topic}.",
87
+ "Query the knowledge base for {topic}.",
88
+ "What does our knowledge base say about {topic}?",
89
+ "Check the accumulated knowledge for {topic}.",
90
+ "Find {topic} in our documentation knowledge.",
91
+ "Search persisted knowledge for {topic}.",
92
+ "Pull up knowledge base entries about {topic}.",
93
+ "Look for {topic} in the knowledge repository.",
94
+ "Find reusable insights about {topic}.",
95
+ "Knowledge base search: {topic}.",
96
+ "Find best practices for {topic} in our knowledge base.",
97
+ "Search the knowledge store for {topic}.",
98
+ ]
99
+ ks_topics = [
100
+ "WebSocket best practices", "retry strategies", "caching patterns",
101
+ "auth flow", "rate limiting", "database indexing", "circuit breaker pattern",
102
+ "API versioning", "error handling strategies", "deployment checklists",
103
+ "code review guidelines", "security best practices", "logging conventions",
104
+ "microservice communication", "data validation patterns",
105
+ ]
106
+ for i in range(40):
107
+ topic = ks_topics[i % len(ks_topics)]
108
+ user = ks_prompts[i % len(ks_prompts)].format(topic=topic)
109
+ rows.append(ex(user, "knowledge_search", {"query": topic}))
110
+
111
+ # session_search_memory: past sessions, what we worked on, project history, prior decisions
112
+ ssm_prompts = [
113
+ "What did we work on last time for {proj}?",
114
+ "Search my session history for {topic}.",
115
+ "Find what we discussed about {topic} in past sessions.",
116
+ "Look up our prior work on {topic}.",
117
+ "What have we worked on related to {topic}?",
118
+ "Find previous decisions about {topic} in my memory.",
119
+ "Search session memory for {topic}.",
120
+ "What did we decide about {topic} last time?",
121
+ "Look through our past sessions for {topic}.",
122
+ "Find recent session work on {topic}.",
123
+ ]
124
+ ssm_topics = [
125
+ "the auth module", "the deploy pipeline", "the payment service", "database migrations",
126
+ "the API gateway", "the caching layer", "the websocket handler", "performance optimization",
127
+ ]
128
+ projs = ["portal", "analytics", "billing", "auth-service", "dashboard"]
129
+ for i in range(40):
130
+ topic = ssm_topics[i % len(ssm_topics)]
131
+ proj = projs[i % len(projs)]
132
+ user = ssm_prompts[i % len(ssm_prompts)].format(topic=topic, proj=proj)
133
+ rows.append(ex(user, "session_search_memory", {"query": topic}))
134
+
135
+ print(f"After FIX 1 (knowledge vs session search): {len(rows)} rows")
136
+
137
+ # =============================================================================
138
+ # FIX 2: session_task_route (30 examples)
139
+ # =============================================================================
140
+ task_route_prompts = [
141
+ "Should the local agent handle this {task}? If cloud, just tell me.",
142
+ "Route this {task} — local or cloud?",
143
+ "Should I run this {task} locally or use the cloud model?",
144
+ "Task routing for {task}: local agent or cloud?",
145
+ "Is this {task} suitable for the local agent?",
146
+ "Which agent should handle this {task}: local or host?",
147
+ "Route: should local handle this {task}?",
148
+ "Local or cloud for {task}?",
149
+ "Task route check: can local model do this {task}?",
150
+ "Should I use the local model for {task} or route to cloud?",
151
+ ]
152
+ tasks = [
153
+ "TypeScript refactor", "Python debugging", "code review",
154
+ "SQL query optimization", "React component", "security audit",
155
+ "performance profiling", "architecture design", "bug fix",
156
+ "unit test generation",
157
+ ]
158
+ for i in range(30):
159
+ task = tasks[i % len(tasks)]
160
+ user = task_route_prompts[i % len(task_route_prompts)].format(task=task)
161
+ rows.append(ex(user, "session_task_route", {"task": task}))
162
+
163
+ print(f"After FIX 2 (session_task_route): {len(rows)} rows")
164
+
165
+ # =============================================================================
166
+ # FIX 3: session_forget_memory (not session_delete_memory — doesn't exist)
167
+ # =============================================================================
168
+ forget_prompts = [
169
+ "Delete memory entry '{mem_id}' — it's outdated.",
170
+ "Remove memory entry {mem_id} from my session memory.",
171
+ "Forget memory entry ID {mem_id}.",
172
+ "Delete specific memory {mem_id}.",
173
+ "Clear out memory entry {mem_id} — it's wrong.",
174
+ "Remove the memory with id {mem_id}.",
175
+ "Erase memory entry {mem_id}.",
176
+ "Drop memory {mem_id} from session.",
177
+ ]
178
+ mem_ids = ["mem-42", "mem-007", "mem-123", "entry-99", "session-mem-5"]
179
+ for i in range(20):
180
+ mid = mem_ids[i % len(mem_ids)]
181
+ user = forget_prompts[i % len(forget_prompts)].format(mem_id=mid)
182
+ rows.append(ex(user, "session_forget_memory", {"memory_id": mid}))
183
+
184
+ print(f"After FIX 3 (session_forget_memory not delete): {len(rows)} rows")
185
+
186
+ # =============================================================================
187
+ # FIX 4: session_load_context for "initialize/start/begin/setup" context
188
+ # =============================================================================
189
+ init_prompts = [
190
+ "Initialize the session context for project {proj} at the {level} level.",
191
+ "Start up the session context for {proj}.",
192
+ "Begin session with context for {proj}.",
193
+ "Set up context for {proj} project.",
194
+ "Init session for {proj} at {level} level.",
195
+ "Please initialize session context for {proj}.",
196
+ "Start loading context for {proj}.",
197
+ "Open up the context for project {proj}.",
198
+ "Boot up context for {proj}.",
199
+ "Set context for {proj} ({level}).",
200
+ ]
201
+ levels = ["standard", "deep", "shallow", "full"]
202
+ for i in range(20):
203
+ proj = projs[i % len(projs)]
204
+ level = levels[i % len(levels)]
205
+ user = init_prompts[i % len(init_prompts)].format(proj=proj, level=level)
206
+ rows.append(ex(user, "session_load_context", {"project": proj, "level": level}))
207
+
208
+ print(f"After FIX 4 (session_load_context for 'initialize'): {len(rows)} rows")
209
+
210
+ # =============================================================================
211
+ # FIX 5: knowledge_forget vs session_forget_memory
212
+ # knowledge_forget = clear knowledge base entries (by category/project)
213
+ # session_forget_memory = clear a specific session memory entry (by ID)
214
+ # =============================================================================
215
+ kf_prompts = [
216
+ "Clear out all old knowledge entries in the '{cat}' category for {proj}.",
217
+ "Remove all {cat} knowledge entries for the {proj} project.",
218
+ "Forget all knowledge about {cat} in {proj}.",
219
+ "Delete {proj} knowledge entries tagged {cat}.",
220
+ "Purge {cat} knowledge for {proj}.",
221
+ "Clear the {cat} knowledge base entries for {proj}.",
222
+ "Remove all {cat}-category knowledge from {proj}.",
223
+ "Delete outdated knowledge in {cat} for {proj}.",
224
+ ]
225
+ cats = ["testing", "deprecated", "v1", "staging", "draft", "archived"]
226
+ for i in range(20):
227
+ cat = cats[i % len(cats)]
228
+ proj = projs[i % len(projs)]
229
+ user = kf_prompts[i % len(kf_prompts)].format(cat=cat, proj=proj)
230
+ rows.append(ex(user, "knowledge_forget", {"project": proj, "category": cat}))
231
+
232
+ print(f"After FIX 5 (knowledge_forget): {len(rows)} rows")
233
+
234
+ # =============================================================================
235
+ # FIX 6: session_save_experience vs session_save_ledger on followup
236
+ # session_save_experience = record a correction/insight/learning (event_type matters)
237
+ # session_save_ledger = log a session summary/progress
238
+ # =============================================================================
239
+
240
+ # Multi-turn: load context, then log what we EXPERIENCED (correction/insight)
241
+ load_experience_chains = [
242
+ ("Load context for {proj} and then log that we tried {what} but should have used {better} instead.",
243
+ "session_load_context", lambda p, w, b: {"project": p},
244
+ '{{"project": "{proj}", "last_summary": "Working on {proj}"}}',
245
+ "session_save_experience", lambda p, w, b: {"project": p, "event_type": "correction",
246
+ "content": f"Tried {w} but should have used {b}"}),
247
+ ("Get {proj} context, then record the correction: used {what} when {better} was better.",
248
+ "session_load_context", lambda p, w, b: {"project": p},
249
+ '{{"project": "{proj}"}}',
250
+ "session_save_experience", lambda p, w, b: {"project": p, "event_type": "correction",
251
+ "content": f"Used {what} instead of {better}"}),
252
+ ]
253
+ whats = ["batch inserts", "polling", "mutex locks", "REST calls", "eager loading"]
254
+ betters = ["streaming writes", "webhooks", "read-write locks", "GraphQL", "lazy loading"]
255
+
256
+ for i in range(15):
257
+ ch = load_experience_chains[i % len(load_experience_chains)]
258
+ proj = projs[i % len(projs)]
259
+ what = whats[i % len(whats)]
260
+ better = betters[i % len(betters)]
261
+ user = ch[0].format(proj=proj, what=what, better=better)
262
+ t1 = ch[2](proj, what, better)
263
+ resp = ch[3].format(proj=proj)
264
+ t2 = ch[5](proj, what, better)
265
+ rows.append(ex_multiturn(user, ch[1], t1, resp, ch[4], t2))
266
+
267
+ # Distinguish ledger (progress/session log) from experience (insight/correction)
268
+ ledger_vs_exp = [
269
+ ("Load {proj} context, then save a session ledger entry about today's progress.",
270
+ "session_load_context", lambda p: {"project": p},
271
+ '{{"project": "{proj}", "last_summary": "Progress on {proj}"}}',
272
+ "session_save_ledger", lambda p: {"project": p, "summary": "Session progress logged"}),
273
+ ("Get {proj} context, then record that we successfully completed the migration.",
274
+ "session_load_context", lambda p: {"project": p},
275
+ '{{"project": "{proj}"}}',
276
+ "session_save_ledger", lambda p: {"project": p, "summary": "Migration completed successfully"}),
277
+ ]
278
+ for i in range(10):
279
+ ch = ledger_vs_exp[i % len(ledger_vs_exp)]
280
+ proj = projs[i % len(projs)]
281
+ user = ch[0].format(proj=proj)
282
+ t1 = ch[2](proj)
283
+ resp = ch[3].format(proj=proj)
284
+ t2 = ch[5](proj)
285
+ rows.append(ex_multiturn(user, ch[1], t1, resp, ch[4], t2))
286
+
287
+ print(f"After FIX 6 (save_experience vs save_ledger): {len(rows)} rows")
288
+
289
+ # =============================================================================
290
+ # FIX 7: CS abstain even with "retry/backoff/session/memory/knowledge" in CS context
291
+ # =============================================================================
292
+ cs_with_prism_keywords = [
293
+ ("How do I implement exponential backoff with jitter for API retries?",
294
+ "Exponential backoff with jitter: `wait = random.uniform(0, base * 2^attempt)`. Cap at max_delay. This prevents thundering herd. Pure CS — no Prism tool needed."),
295
+ ("Explain retry semantics in distributed systems.",
296
+ "Retry semantics in distributed systems cover idempotency (safe to retry), at-least-once vs exactly-once delivery, and exponential backoff to avoid overload. This is a CS concept, not a Prism memory operation."),
297
+ ("How does session management work in Django?",
298
+ "Django uses a session middleware that stores a session ID in a cookie and keeps session data server-side (database, cache, or file). Access via `request.session`. Not a Prism memory operation."),
299
+ ("What is memory management in C++?",
300
+ "C++ memory management involves heap allocation (new/delete), stack allocation, RAII, and smart pointers (unique_ptr, shared_ptr). Not related to Prism session memory."),
301
+ ("Explain knowledge graphs and how they work.",
302
+ "Knowledge graphs represent entities and relationships as nodes and edges. They power semantic search and reasoning engines. This is a CS/ML concept — not a Prism knowledge_search operation."),
303
+ ("How do I implement retry logic with exponential backoff in Python?",
304
+ "Use the `tenacity` library or implement manually: `time.sleep(min(max_delay, base * 2**attempt) * random.uniform(0.5, 1.5))`. This is a general Python pattern, not a Prism memory task."),
305
+ ("What is session affinity in load balancers?",
306
+ "Session affinity (sticky sessions) routes all requests from a client to the same backend server, typically using a cookie. This is a networking/infrastructure concept, not a Prism operation."),
307
+ ("How does Redis handle memory eviction?",
308
+ "Redis uses LRU (least recently used), LFU (least frequently used), or TTL-based eviction policies, configured via `maxmemory-policy`. This is a Redis/CS concept, not a Prism memory tool call."),
309
+ ("Explain the actor model for concurrent programming.",
310
+ "The actor model treats concurrent computation as actors that communicate via message passing, with no shared state. Erlang, Akka, and Pony use this model. Pure CS concept."),
311
+ ("What is the difference between a stack and a queue?",
312
+ "Stack: LIFO (last in, first out) — push/pop. Queue: FIFO (first in, first out) — enqueue/dequeue. These are fundamental data structures."),
313
+ ]
314
+ for user, reply in cs_with_prism_keywords:
315
+ rows.append(ex_abstain(user, reply))
316
+
317
+ print(f"After FIX 7 (CS abstain with prism-like keywords): {len(rows)} rows")
318
+
319
+ # =============================================================================
320
+ # SHUFFLE AND WRITE
321
+ # =============================================================================
322
+ random.shuffle(rows)
323
+ out = Path("/tmp/4b_v43_patch2.jsonl")
324
+ out.write_text("\n".join(json.dumps(r, ensure_ascii=False) for r in rows) + "\n")
325
+ print(f"\n✅ Wrote {len(rows)} patch2 rows to {out}")