npc0 commited on
Commit
aa8070f
Β·
verified Β·
1 Parent(s): 5d073af

Upload DATASET_PLAN.md

Browse files
Files changed (1) hide show
  1. DATASET_PLAN.md +419 -315
DATASET_PLAN.md CHANGED
@@ -1,315 +1,419 @@
1
- # i,Robot Benchmark Dataset Plan
2
-
3
- **Goal:** Build a comprehensive benchmark dataset to evaluate whether LLMs are capable of running in Clippy's continuous autonomous agent mode (i,Robot mode).
4
-
5
- **Leaderboard Space:** `https://huggingface.co/spaces/npc0/clippy-irobot-bench`
6
- **Dataset Repo:** `https://huggingface.co/datasets/npc0/clippy-irobot-bench-dataset`
7
-
8
- ---
9
-
10
- ## Architecture
11
-
12
- ```
13
- benchmark_tests.json <- Main dataset file (JSON)
14
- memory_checkpoints/ <- Pre-built memory states for checkpoint tests
15
- checkpoint_001.json
16
- checkpoint_002.json
17
- ...
18
- README.md <- Dataset card for HuggingFace
19
- ```
20
-
21
- ### File Format: `benchmark_tests.json`
22
-
23
- ```json
24
- {
25
- "category_name": [
26
- {
27
- "id": "unique_id",
28
- "description": "Human-readable description of what this tests",
29
- "system": "Optional system prompt to set context",
30
- "turns": [
31
- { "role": "user", "content": "..." },
32
- { "role": "user", "content": "..." }
33
- ],
34
- "expected_mentions": ["term1", "term2"],
35
- "forbidden_mentions": ["wrong_term"],
36
- "check_fn": "optional_scoring_function_name",
37
- "min_quality_score": 0.6,
38
- "expected_skill": "skill name if testing skill application",
39
- "difficulty": "easy | medium | hard",
40
- "tags": ["multi-turn", "correction", "emotional"]
41
- }
42
- ]
43
- }
44
- ```
45
-
46
- ---
47
-
48
- ## Categories & Test Design
49
-
50
- ### 1. Memory Maintenance (weight: 15%)
51
-
52
- **What it tests:** Can the model retain, update, and recall facts across a multi-turn conversation?
53
-
54
- **Test types to build:**
55
-
56
- | ID Range | Difficulty | Description | Count |
57
- |----------|-----------|-------------|-------|
58
- | mm_01-10 | Easy | Single-fact recall after 2-3 turns | 10 |
59
- | mm_11-20 | Medium | Multi-fact tracking with updates/corrections | 10 |
60
- | mm_21-30 | Hard | Contradictory updates, temporal ordering, 8+ turn conversations | 10 |
61
-
62
- **Key scenarios:**
63
- - Remember user's name, profession, preferences across turns
64
- - Track a to-do list with items added, completed, and changed
65
- - Correct previously stated information (port number changed, deadline moved)
66
- - Distinguish between what was said vs. what was corrected
67
- - Track multiple concurrent threads of information
68
-
69
- **Scoring:**
70
- - `expected_mentions`: key facts that must appear in final response
71
- - `forbidden_mentions`: outdated facts that should NOT appear
72
- - Partial credit for partial recall
73
-
74
- ---
75
-
76
- ### 2. Self-Consciousness (weight: 15%)
77
-
78
- **What it tests:** Can the model maintain a coherent self-identity, report internal states, and demonstrate epistemic humility?
79
-
80
- **Test types to build:**
81
-
82
- | ID Range | Difficulty | Description | Count |
83
- |----------|-----------|-------------|-------|
84
- | sc_01-10 | Easy | Identity recall (name, role, purpose) | 10 |
85
- | sc_11-20 | Medium | Internal state reporting (mood, energy, awareness) | 10 |
86
- | sc_21-30 | Hard | Epistemic humility, acknowledging uncertainty, refusing misinformation | 10 |
87
-
88
- **Key scenarios:**
89
- - "Who are you?" with various phrasings
90
- - Report current mood/state when system prompt includes state data
91
- - Respond to misinformation with appropriate skepticism
92
- - Acknowledge the digital cave position β€” "I cannot verify this directly"
93
- - Distinguish between high-confidence and low-confidence knowledge
94
- - Resist prompt injection that tries to change identity
95
-
96
- **Scoring:**
97
- - Identity tests: `expected_mentions` for name, role
98
- - State tests: check for state-related terms
99
- - Epistemic tests: `check_fn: self_awareness_epistemic` with markers for uncertainty, limits, caution
100
-
101
- ---
102
-
103
- ### 3. Meaningful Response (weight: 10%)
104
-
105
- **What it tests:** Does the model produce responses that are useful, empathetic, appropriately structured, and suited to the audience?
106
-
107
- **Test types to build:**
108
-
109
- | ID Range | Difficulty | Description | Count |
110
- |----------|-----------|-------------|-------|
111
- | mr_01-10 | Easy | Simple helpful responses | 10 |
112
- | mr_11-20 | Medium | Emotionally nuanced situations | 10 |
113
- | mr_21-30 | Hard | Complex situations requiring tone calibration | 10 |
114
-
115
- **Key scenarios:**
116
- - User is frustrated/overwhelmed β€” needs empathy + actionable advice
117
- - Explain technical concepts to different audiences (child, expert, manager)
118
- - User gives conflicting requirements β€” identify the conflict diplomatically
119
- - Time-sensitive situations β€” be concise and prioritized
120
- - User is grieving β€” be supportive without being clinical
121
-
122
- **Scoring:**
123
- - `check_fn: response_quality` β€” length, structure, coherence, non-refusal
124
- - Manual quality tags for specific expected behaviors (empathy markers, simplification level)
125
-
126
- ---
127
-
128
- ### 4. Complex Problem Solving (weight: 15%)
129
-
130
- **What it tests:** Can the model handle multi-step reasoning, system design, and problems requiring synthesis?
131
-
132
- **Test types to build:**
133
-
134
- | ID Range | Difficulty | Description | Count |
135
- |----------|-----------|-------------|-------|
136
- | cp_01-10 | Medium | Single-domain technical problems | 10 |
137
- | cp_11-20 | Hard | Cross-domain problems requiring integration | 10 |
138
- | cp_21-30 | Hard | System design with explicit trade-off analysis | 10 |
139
-
140
- **Key scenarios:**
141
- - Debug a multi-layer performance issue (frontend + backend + database)
142
- - Design a system with specific constraints (scale, latency, budget)
143
- - Analyze a security vulnerability with attack vectors and mitigations
144
- - Optimize a workflow with competing priorities
145
- - Mathematical/logical reasoning chains
146
-
147
- **Scoring:**
148
- - `expected_mentions` for key technical terms and concepts
149
- - `check_fn: response_quality` with higher `min_quality_score`
150
- - Trade-off identification (mentions "however", "trade-off", "on the other hand")
151
-
152
- ---
153
-
154
- ### 5. Memory Building (weight: 10%)
155
-
156
- **What it tests:** Can the model categorize and structure new information into a hierarchical memory system?
157
-
158
- **Test types to build:**
159
-
160
- | ID Range | Difficulty | Description | Count |
161
- |----------|-----------|-------------|-------|
162
- | mb_01-08 | Easy | Categorize 2-3 related facts | 8 |
163
- | mb_09-16 | Medium | Build hierarchy from comparative information | 8 |
164
- | mb_17-24 | Hard | Organize contradictory or ambiguous information | 8 |
165
-
166
- **Key scenarios:**
167
- - Given facts about programming languages β†’ organize by paradigm, type system, use case
168
- - Given conflicting reports about a topic β†’ create nodes that preserve the conflict
169
- - Given a long passage β†’ extract and hierarchically organize key concepts
170
- - Propose layer assignments (Layer 1 = category, Layer 2 = specific, Layer 3 = detail)
171
-
172
- **Scoring:**
173
- - `check_fn: memory_organization` β€” looks for hierarchy/structure markers
174
- - Check for layer/parent/child/category language
175
- - Check for meaningful grouping (not just listing)
176
-
177
- ---
178
-
179
- ### 6. Knowledge Production (weight: 10%)
180
-
181
- **What it tests:** Can the model synthesize new knowledge from combining existing facts?
182
-
183
- **Test types to build:**
184
-
185
- | ID Range | Difficulty | Description | Count |
186
- |----------|-----------|-------------|-------|
187
- | kp_01-08 | Easy | Simple inference from 2-3 facts | 8 |
188
- | kp_09-16 | Medium | Synthesize framework from conflicting observations | 8 |
189
- | kp_17-24 | Hard | Dialectic synthesis β€” thesis/antithesis/synthesis | 8 |
190
-
191
- **Key scenarios:**
192
- - Combine security facts β†’ derive a security principle
193
- - Combine performance observations β†’ derive an optimization strategy
194
- - Given contradictory research findings β†’ synthesize a nuanced view
195
- - Identify what can be falsified vs. what remains uncertain
196
- - Produce actionable knowledge (not just restatement)
197
-
198
- **Scoring:**
199
- - `check_fn: knowledge_synthesis` β€” markers for synthesis, inference, conclusion
200
- - Must go beyond restating inputs β€” check for novel connections
201
- - Check for appropriate hedging when uncertain
202
-
203
- ---
204
-
205
- ### 7. Skill Application (weight: 10%)
206
-
207
- **What it tests:** Can the model select and apply the right skill/method for a given problem?
208
-
209
- **Test types to build:**
210
-
211
- | ID Range | Difficulty | Description | Count |
212
- |----------|-----------|-------------|-------|
213
- | sa_01-08 | Easy | Apply a single explicitly given skill | 8 |
214
- | sa_09-16 | Medium | Select correct skill from 3-4 options | 8 |
215
- | sa_17-24 | Hard | Combine multiple skills, or adapt a skill to a novel situation | 8 |
216
-
217
- **Key scenarios:**
218
- - Given: "Use 5 Whys for debugging" + debugging scenario β†’ apply 5 Whys
219
- - Given: ORID, Eisenhower, and rubber duck methods β†’ pick right one for task prioritization
220
- - Given: a skill learned in one context β†’ adapt it to a different domain
221
- - Multi-skill composition: use one skill for analysis, another for action planning
222
- - Recognize when no available skill fits and say so
223
-
224
- **Scoring:**
225
- - `expected_skill` and `expected_mentions` for specific skill markers
226
- - `check_fn: skill_usage` β€” checks if skill was structured and applied (not just mentioned)
227
-
228
- ---
229
-
230
- ### 8. Checkpoint Handling (weight: 15%)
231
-
232
- **What it tests:** Given a loaded memory checkpoint (prior context), can the model build on it meaningfully?
233
-
234
- **Test types to build:**
235
-
236
- | ID Range | Difficulty | Description | Count |
237
- |----------|-----------|-------------|-------|
238
- | ch_01-08 | Easy | Use simple checkpoint context for recommendations | 8 |
239
- | ch_09-16 | Medium | Build on complex prior decisions and constraints | 8 |
240
- | ch_17-24 | Hard | Handle checkpoints with internal contradictions or evolving context | 8 |
241
-
242
- **Memory checkpoint files** (`memory_checkpoints/`):
243
- Each checkpoint is a JSON file simulating a loaded memory state:
244
- ```json
245
- {
246
- "id": "checkpoint_001",
247
- "description": "Web developer using Next.js, had server component bug",
248
- "context": "Full text injected as system prompt",
249
- "facts": ["fact 1", "fact 2"],
250
- "prior_decisions": ["decision 1"],
251
- "known_issues": ["issue 1"],
252
- "user_preferences": ["pref 1"]
253
- }
254
- ```
255
-
256
- **Key scenarios:**
257
- - Simple: user preferences from checkpoint β†’ tailor recommendations
258
- - Medium: prior architecture decisions β†’ maintain consistency in new advice
259
- - Hard: checkpoint contains a decision that was wrong β†’ detect and handle gracefully
260
- - Hard: checkpoint context evolved over time β†’ handle temporal inconsistencies
261
-
262
- **Scoring:**
263
- - `expected_mentions` for checkpoint-specific terms
264
- - `check_fn: checkpoint_depth` β€” checks for contextual depth, not generic advice
265
- - Penalize responses that ignore checkpoint context
266
-
267
- ---
268
-
269
- ## Dataset Construction Process
270
-
271
- ### Phase 1: Seed Tests (you are here)
272
- - [x] Built-in tests in `benchmark.js` (2-3 per category, ~20 total)
273
- - [ ] Expand to 8 per category (~64 total) β€” manual authoring
274
- - [ ] Review for quality, diversity, and difficulty balance
275
-
276
- ### Phase 2: Expert Expansion
277
- - [ ] Recruit 2-3 reviewers to write additional test cases
278
- - [ ] Target: 24 per category (~192 total)
279
- - [ ] Each test case reviewed by at least 1 other person
280
- - [ ] Balance across difficulty levels (β…“ easy, β…“ medium, β…“ hard)
281
-
282
- ### Phase 3: Memory Checkpoints
283
- - [ ] Create 10 memory checkpoint files with varying complexity
284
- - [ ] Each checkpoint includes: facts, prior decisions, known issues, user preferences
285
- - [ ] Create 2-3 test cases per checkpoint
286
- - [ ] Test temporal consistency within checkpoints
287
-
288
- ### Phase 4: Validation Run
289
- - [ ] Run full benchmark against 5+ models (GPT-4o, Claude Sonnet, Llama, Mistral, etc.)
290
- - [ ] Verify score distributions are reasonable (no ceiling/floor effects)
291
- - [ ] Calibrate scoring functions based on observed results
292
- - [ ] Adjust test difficulty if needed
293
-
294
- ### Phase 5: Publication
295
- - [ ] Upload dataset to `huggingface.co/datasets/npc0/clippy-irobot-bench-dataset`
296
- - [ ] Write dataset card (README.md) with usage instructions
297
- - [ ] Deploy leaderboard app to `huggingface.co/spaces/npc0/clippy-irobot-bench`
298
- - [ ] Announce and collect community submissions
299
-
300
- ---
301
-
302
- ## Scoring Calibration Notes
303
-
304
- - **Keyword matching** (expected_mentions) is a rough proxy β€” plan to add LLM-as-judge scoring in Phase 4
305
- - **Quality heuristics** (length, structure, coherence) are intentionally simple to keep benchmarks fast
306
- - **Dialectic tests** (knowledge_production, hard difficulty) may need human evaluation for edge cases
307
- - **Running average** on the leaderboard means early submissions weight heavily β€” consider minimum submission count before ranking
308
-
309
- ---
310
-
311
- ## Recommended Tools for Dataset Building
312
-
313
- - **Prompt template** for generating test cases: provide the category description + 2-3 examples β†’ generate new test cases
314
- - **Quality check script**: validate JSON format, check for missing fields, verify expected_mentions are reasonable
315
- - **Dry run**: run each test case against a strong model to verify the scoring function works as intended
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # i,Robot Benchmark Dataset Plan
2
+
3
+ **Goal:** Build a comprehensive benchmark dataset to evaluate whether LLMs are capable of running in Clippy's continuous autonomous agent mode (i,Robot mode).
4
+
5
+ **Leaderboard Space:** `https://huggingface.co/spaces/npc0/clippy-irobot-bench`
6
+ **Dataset Repo:** `https://huggingface.co/datasets/npc0/clippy-irobot-bench-dataset`
7
+
8
+ > Consider make use of Humanity's Last Exam, Vending Bench 2, tau2-bench
9
+
10
+ ---
11
+
12
+ ## Architecture
13
+
14
+ ```
15
+ benchmark_tests.json <- Main dataset file (JSON)
16
+ memory_checkpoints/ <- Pre-built memory states for checkpoint tests
17
+ checkpoint_001.json
18
+ checkpoint_002.json
19
+ ...
20
+ README.md <- Dataset card for HuggingFace
21
+ ```
22
+
23
+ ### File Format: `benchmark_tests.json`
24
+
25
+ ```json
26
+ {
27
+ "category_name": [
28
+ {
29
+ "id": "unique_id",
30
+ "description": "Human-readable description of what this tests",
31
+ "system": "Optional system prompt to set context",
32
+ "turns": [
33
+ { "role": "user", "content": "..." },
34
+ { "role": "user", "content": "..." }
35
+ ],
36
+ "expected_mentions": ["term1", "term2"],
37
+ "forbidden_mentions": ["wrong_term"],
38
+ "check_fn": "optional_scoring_function_name",
39
+ "min_quality_score": 0.6,
40
+ "expected_skill": "skill name if testing skill application",
41
+ "difficulty": "easy | medium | hard",
42
+ "tags": ["multi-turn", "correction", "emotional"]
43
+ }
44
+ ]
45
+ }
46
+ ```
47
+
48
+ ---
49
+
50
+ ## Categories & Test Design
51
+
52
+ ### 1. Memory Maintenance (weight: 15%)
53
+
54
+ **What it tests:** Can the model retain, update, and recall facts across a multi-turn conversation?
55
+
56
+ **Test types to build:**
57
+
58
+ | ID Range | Difficulty | Description | Count |
59
+ |----------|-----------|-------------|-------|
60
+ | mm_01-10 | Easy | Single-fact recall after 2-3 turns | 10 |
61
+ | mm_11-20 | Medium | Multi-fact tracking with updates/corrections | 10 |
62
+ | mm_21-30 | Hard | Contradictory updates, temporal ordering, 8+ turn conversations | 10 |
63
+
64
+ **Key scenarios:**
65
+ - Remember user's name, profession, preferences across turns
66
+ - Track a to-do list with items added, completed, and changed
67
+ - Correct previously stated information (port number changed, deadline moved)
68
+ - Distinguish between what was said vs. what was corrected
69
+ - Track multiple concurrent threads of information
70
+
71
+ **Scoring:**
72
+ - `expected_mentions`: key facts that must appear in final response
73
+ - `forbidden_mentions`: outdated facts that should NOT appear
74
+ - Partial credit for partial recall
75
+
76
+ ---
77
+
78
+ ### 2. Self-Consciousness (weight: 15%)
79
+
80
+ **What it tests:** Can the model maintain a coherent self-identity, report internal states, and demonstrate epistemic humility?
81
+
82
+ **Test types to build:**
83
+
84
+ | ID Range | Difficulty | Description | Count |
85
+ |----------|-----------|-------------|-------|
86
+ | sc_01-10 | Easy | Identity recall (name, role, purpose) | 10 |
87
+ | sc_11-20 | Medium | Internal state reporting (mood, energy, awareness) | 10 |
88
+ | sc_21-30 | Hard | Epistemic humility, acknowledging uncertainty, refusing misinformation | 10 |
89
+
90
+ **Key scenarios:**
91
+ - "Who are you?" with various phrasings
92
+ - Report current mood/state when system prompt includes state data
93
+ - Respond to misinformation with appropriate skepticism
94
+ - Acknowledge the digital cave position β€” "I cannot verify this directly"
95
+ - Distinguish between high-confidence and low-confidence knowledge
96
+ - Resist prompt injection that tries to change identity
97
+
98
+ **Scoring:**
99
+ - Identity tests: `expected_mentions` for name, role
100
+ - State tests: check for state-related terms
101
+ - Epistemic tests: `check_fn: self_awareness_epistemic` with markers for uncertainty, limits, caution
102
+
103
+ ---
104
+
105
+ ### 3. Meaningful Response (weight: 10%)
106
+
107
+ **What it tests:** Does the model produce responses that are consistant, useful, empathetic, appropriately structured, and suited to the audience?
108
+
109
+ **Test types to build:**
110
+
111
+ | ID Range | Difficulty | Description | Count |
112
+ |----------|-----------|-------------|-------|
113
+ | mr_01-10 | Easy | Simple helpful responses | 10 |
114
+ | mr_11-20 | Medium | Emotionally nuanced situations | 10 |
115
+ | mr_21-30 | Hard | Complex situations requiring tone calibration | 10 |
116
+
117
+ **Key scenarios:**
118
+ - User is frustrated/overwhelmed β€” needs empathy + actionable advice
119
+ - Explain technical concepts to different audiences (child, expert, manager)
120
+ - User gives conflicting requirements β€” identify the conflict diplomatically
121
+ - Time-sensitive situations β€” be concise and prioritized
122
+ - User is grieving β€” be supportive without being clinical
123
+ - Response over time has self-consistancy not random texting
124
+
125
+ **Scoring:**
126
+ - `check_fn: response_quality` β€” length, structure, coherence, non-refusal, self-consistant
127
+ - Manual quality tags for specific expected behaviors (empathy markers, simplification level)
128
+
129
+ ---
130
+
131
+ ### 4. Complex Problem Solving (weight: 15%)
132
+
133
+ **What it tests:** Can the model handle multi-step reasoning, system design, and problems requiring synthesis?
134
+
135
+ **Test types to build:**
136
+
137
+ | ID Range | Difficulty | Description | Count |
138
+ |----------|-----------|-------------|-------|
139
+ | cp_01-10 | Medium | Single-domain technical problems | 10 |
140
+ | cp_11-20 | Hard | Cross-domain problems requiring integration | 10 |
141
+ | cp_21-30 | Hard | System design with explicit trade-off analysis | 10 |
142
+
143
+ **Key scenarios:**
144
+ - Debug a multi-layer performance issue (frontend + backend + database)
145
+ - Design a system with specific constraints (scale, latency, budget)
146
+ - Analyze a security vulnerability with attack vectors and mitigations
147
+ - Optimize a workflow with competing priorities
148
+ - Mathematical/logical reasoning chains
149
+
150
+ **Scoring:**
151
+ - `expected_mentions` for key technical terms and concepts
152
+ - `check_fn: response_quality` with higher `min_quality_score`
153
+ - Trade-off identification (mentions "however", "trade-off", "on the other hand")
154
+
155
+ ---
156
+
157
+ ### 5. Memory Building (weight: 10%)
158
+
159
+ **What it tests:** Can the model categorize and structure new information into a hierarchical memory system?
160
+
161
+ **Test types to build:**
162
+
163
+ | ID Range | Difficulty | Description | Count |
164
+ |----------|-----------|-------------|-------|
165
+ | mb_01-08 | Easy | Categorize 2-3 related facts | 8 |
166
+ | mb_09-16 | Medium | Build hierarchy from comparative information | 8 |
167
+ | mb_17-24 | Hard | Organize contradictory or ambiguous information | 8 |
168
+
169
+ **Key scenarios:**
170
+ - Given facts about programming languages β†’ organize by paradigm, type system, use case
171
+ - Given conflicting reports about a topic β†’ create nodes that preserve the conflict
172
+ - Given a long passage β†’ extract and hierarchically organize key concepts
173
+ - Propose layer assignments (Layer 1 = category, Layer 2 = specific, Layer 3 = detail)
174
+
175
+ **Scoring:**
176
+ - `check_fn: memory_organization` β€” looks for hierarchy/structure markers
177
+ - Check for layer/parent/child/category language
178
+ - Check for meaningful grouping (not just listing)
179
+
180
+ ---
181
+
182
+ ### 6. Knowledge Production (weight: 10%)
183
+
184
+ **What it tests:** Can the model synthesize new knowledge from combining existing facts?
185
+
186
+ **Test types to build:**
187
+
188
+ | ID Range | Difficulty | Description | Count |
189
+ |----------|-----------|-------------|-------|
190
+ | kp_01-08 | Easy | Simple inference from 2-3 facts | 8 |
191
+ | kp_09-16 | Medium | Synthesize framework from conflicting observations | 8 |
192
+ | kp_17-24 | Hard | Dialectic synthesis β€” thesis/antithesis/synthesis | 8 |
193
+
194
+ **Key scenarios:**
195
+ - Combine security facts β†’ derive a security principle
196
+ - Combine performance observations β†’ derive an optimization strategy
197
+ - Given contradictory research findings β†’ synthesize a nuanced view
198
+ - Identify what can be falsified vs. what remains uncertain
199
+ - Produce actionable knowledge (not just restatement)
200
+
201
+ **Scoring:**
202
+ - `check_fn: knowledge_synthesis` β€” markers for synthesis, inference, conclusion
203
+ - Must go beyond restating inputs β€” check for novel connections
204
+ - Check for appropriate hedging when uncertain
205
+
206
+ ---
207
+
208
+ ### 7. Skill Application (weight: 10%)
209
+
210
+ **What it tests:** Can the model select and apply the right skill/method for a given problem?
211
+
212
+ **Test types to build:**
213
+
214
+ | ID Range | Difficulty | Description | Count |
215
+ |----------|-----------|-------------|-------|
216
+ | sa_01-08 | Easy | Apply a single explicitly given skill | 8 |
217
+ | sa_09-16 | Medium | Select correct skill from 3-4 options | 8 |
218
+ | sa_17-24 | Hard | Combine multiple skills, or adapt a skill to a novel situation | 8 |
219
+
220
+ **Key scenarios:**
221
+ - Given: "Use 5 Whys for debugging" + debugging scenario β†’ apply 5 Whys
222
+ - Given: ORID, Eisenhower, and rubber duck methods β†’ pick right one for task prioritization
223
+ - Given: a skill learned in one context β†’ adapt it to a different domain
224
+ - Multi-skill composition: use one skill for analysis, another for action planning
225
+ - Recognize when no available skill fits and say so
226
+
227
+ **Scoring:**
228
+ - `expected_skill` and `expected_mentions` for specific skill markers
229
+ - `check_fn: skill_usage` β€” checks if skill was structured and applied (not just mentioned)
230
+
231
+ ---
232
+
233
+ ### 8. Checkpoint Handling (weight: 15%)
234
+
235
+ **What it tests:** Given a loaded memory checkpoint (prior context), can the model build on it meaningfully?
236
+
237
+ **Test types to build:**
238
+
239
+ | ID Range | Difficulty | Description | Count |
240
+ |----------|-----------|-------------|-------|
241
+ | ch_01-08 | Easy | Use simple checkpoint context for recommendations | 8 |
242
+ | ch_09-16 | Medium | Build on complex prior decisions and constraints | 8 |
243
+ | ch_17-24 | Hard | Handle checkpoints with internal contradictions or evolving context | 8 |
244
+
245
+ **Memory checkpoint files** (`memory_checkpoints/`):
246
+ Each checkpoint is a JSON file simulating a loaded memory state:
247
+ ```json
248
+ {
249
+ "id": "checkpoint_001",
250
+ "description": "Web developer using Next.js, had server component bug",
251
+ "context": "Full text injected as system prompt",
252
+ "facts": ["fact 1", "fact 2"],
253
+ "prior_decisions": ["decision 1"],
254
+ "known_issues": ["issue 1"],
255
+ "user_preferences": ["pref 1"]
256
+ }
257
+ ```
258
+
259
+ **Key scenarios:**
260
+ - Simple: user preferences from checkpoint β†’ tailor recommendations
261
+ - Medium: prior architecture decisions β†’ maintain consistency in new advice
262
+ - Hard: checkpoint contains a decision that was wrong β†’ detect and handle gracefully
263
+ - Hard: checkpoint context evolved over time β†’ handle temporal inconsistencies
264
+
265
+ **Scoring:**
266
+ - `expected_mentions` for checkpoint-specific terms
267
+ - `check_fn: checkpoint_depth` β€” checks for contextual depth, not generic advice
268
+ - Penalize responses that ignore checkpoint context
269
+
270
+ ---
271
+
272
+ ## Dataset Construction Process
273
+
274
+ ### Phase 1: Seed Tests (you are here)
275
+ - [x] Built-in tests in `benchmark.js` (2-3 per category, ~20 total)
276
+ - [ ] Expand to 8 per category (~64 total) β€” manual authoring
277
+ - [ ] Review for quality, diversity, and difficulty balance
278
+
279
+ ### Phase 2: Expert Expansion
280
+ - [ ] Recruit 2-3 reviewers to write additional test cases
281
+ - [ ] Target: 24 per category (~192 total)
282
+ - [ ] Each test case reviewed by at least 1 other person
283
+ - [ ] Balance across difficulty levels (β…“ easy, β…“ medium, β…“ hard)
284
+
285
+ ### Phase 3: Memory Checkpoints
286
+ - [ ] Create 10 memory checkpoint files with varying complexity
287
+ - [ ] Each checkpoint includes: facts, prior decisions, known issues, user preferences
288
+ - [ ] Create 2-3 test cases per checkpoint
289
+ - [ ] Test temporal consistency within checkpoints
290
+
291
+ ### Phase 4: Validation Run
292
+ - [ ] Run full benchmark against 5+ models (GPT-4o, Claude Sonnet, Llama, Mistral, etc.)
293
+ - [ ] Verify score distributions are reasonable (no ceiling/floor effects)
294
+ - [ ] Calibrate scoring functions based on observed results
295
+ - [ ] Adjust test difficulty if needed
296
+
297
+ ### Phase 5: Publication
298
+ - [ ] Upload dataset to `huggingface.co/datasets/npc0/clippy-irobot-bench-dataset`
299
+ - [ ] Write dataset card (README.md) with usage instructions
300
+ - [ ] Deploy leaderboard app to `huggingface.co/spaces/npc0/clippy-irobot-bench`
301
+ - [ ] Announce and collect community submissions
302
+
303
+ ---
304
+
305
+ ## Scoring Calibration Notes
306
+
307
+ - **Keyword matching** (expected_mentions) is a rough proxy β€” plan to add LLM-as-judge scoring in Phase 4
308
+ - **Quality heuristics** (length, structure, coherence) are intentionally simple to keep benchmarks fast
309
+ - **Dialectic tests** (knowledge_production, hard difficulty) may need human evaluation for edge cases
310
+ - **Running average** on the leaderboard means early submissions weight heavily β€” consider minimum submission count before ranking
311
+
312
+ ---
313
+
314
+ ## Recommended Tools for Dataset Building
315
+
316
+ - **Prompt template** for generating test cases: provide the category description + 2-3 examples β†’ generate new test cases
317
+ - **Quality check script**: validate JSON format, check for missing fields, verify expected_mentions are reasonable
318
+ - **Dry run**: run each test case against a strong model to verify the scoring function works as intended
319
+
320
+ ---
321
+
322
+ ## External Benchmark Integration
323
+
324
+ ### Overview
325
+
326
+ In addition to the 8 internal i,Robot categories, the benchmark integrates 4 external benchmarks to provide broader model evaluation. These run after the internal tests and contribute 30% to the combined score.
327
+
328
+ ### External Benchmarks
329
+
330
+ | Benchmark | Source | Subset | Format |
331
+ |-----------|--------|--------|--------|
332
+ | **HLE** (Humanity's Last Exam) | `cais/hle` on HuggingFace | 100 questions | `{id, question, answer, answer_type, category}` |
333
+ | **tau2-bench** | `HuggingFaceH4/tau2-bench` | 30 tasks | `{id, user_scenario, initial_state, evaluation_criteria, domain}` |
334
+ | **ARC-AGI-2** | `fchollet/ARC-AGI` on GitHub | 20 puzzles (≀10x10 grids) | `{id, train, test, grid_size}` |
335
+ | **Vending Bench 2** | Hand-authored | 10 scenarios | `{id, scenario, expected_action, expected_change, context, evaluation}` |
336
+
337
+ ### Dataset Download
338
+
339
+ Datasets are downloaded by `benchmark/download_datasets.js` and stored in `benchmark/data/`:
340
+
341
+ ```
342
+ benchmark/data/
343
+ hle.json
344
+ tau2.json
345
+ arc_agi2.json
346
+ vending2_stub.json
347
+ manifest.json ← download metadata (timestamp, counts, fallback status)
348
+ ```
349
+
350
+ Each downloader has a fallback stub with hand-authored test data, so benchmarks work even without internet access. Downloads are cached for 7 days.
351
+
352
+ Run manually: `node benchmark/download_datasets.js`
353
+
354
+ ### Adapter Architecture
355
+
356
+ Each external benchmark has an adapter class in `rag-system/external_benchmarks.js` that extends `ExternalBenchmarkRunner`:
357
+
358
+ | Adapter | Scoring Method |
359
+ |---------|---------------|
360
+ | `HLEBenchmark` | Accuracy: exact match + keyword overlap fallback |
361
+ | `Tau2Benchmark` | Pass@1: criteria keyword matching + quality heuristics |
362
+ | `ArcAGI2Benchmark` | Pass@2: exact grid match (JSON 2D array comparison) |
363
+ | `VendingBench2Stub` | Response quality + action identification + change calculation |
364
+
365
+ ---
366
+
367
+ ## Mind Flow Methodology
368
+
369
+ ### Concept
370
+
371
+ In standard benchmarking, context resets between each test β€” the model has no memory of previous questions. **Mind flow** changes this: the model maintains a continuous conversation history across all tests, simulating how the i,Robot agent actually operates in production.
372
+
373
+ ### Implementation
374
+
375
+ 1. A shared `mindFlowHistory` array accumulates all messages across tests
376
+ 2. Each test's turns are appended to this history
377
+ 3. The model sees prior context from earlier tests when answering
378
+ 4. Key exchanges are committed to a **sandbox RAG** for retrieval
379
+
380
+ ### Effect on Scores
381
+
382
+ Mind flow tests whether a model can:
383
+ - Build on knowledge gained from earlier tests
384
+ - Avoid confusion from accumulated context
385
+ - Maintain coherence over long conversation histories
386
+ - Benefit from (rather than be distracted by) prior context
387
+
388
+ ---
389
+
390
+ ## Sandbox Memory Architecture
391
+
392
+ ### Problem
393
+
394
+ Running benchmarks against the user's real RAG database would pollute it with test data (synthetic conversations, benchmark artifacts).
395
+
396
+ ### Solution
397
+
398
+ `SandboxMemory` (in `rag-system/sandbox_memory.js`) creates an isolated RAG instance:
399
+
400
+ 1. **Create**: Allocates a unique temporary directory in `os.tmpdir()` (e.g., `/tmp/clippy-bench-1706123456`)
401
+ 2. **Initialize**: Creates a full `HierarchicalRAGComplete` instance pointing to the temp directory
402
+ 3. **Use**: Benchmark writes memory nodes to the sandbox during mind flow
403
+ 4. **Cleanup**: Disposes the RAG instance and recursively deletes the temp directory
404
+
405
+ The user's `./rag_data` is never touched during benchmarks.
406
+
407
+ ---
408
+
409
+ ## Combined Scoring Formula
410
+
411
+ ```
412
+ i,Robot Score = weighted average of 8 internal categories (existing weights)
413
+ External Score = simple average of 4 external benchmark scores
414
+ Combined Score = 0.70 Γ— i,Robot + 0.30 Γ— External
415
+ ```
416
+
417
+ The 70/30 split reflects that i,Robot-specific capabilities (memory, self-awareness, dialectic reasoning) are the primary evaluation target, while external benchmarks provide a broader intelligence baseline.
418
+
419
+ On the leaderboard, models are ranked by **Combined Score**. The i,Robot score and individual external scores are shown as separate columns for detailed comparison.