npc0 commited on
Commit
06eded3
·
verified ·
1 Parent(s): 7ea54cc

Upload 3 files

Browse files
Files changed (3) hide show
  1. DATASET_PLAN.md +315 -0
  2. app.py +340 -0
  3. requirements.txt +2 -0
DATASET_PLAN.md ADDED
@@ -0,0 +1,315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # i,Robot Benchmark Dataset Plan
2
+
3
+ **Goal:** Build a comprehensive benchmark dataset to evaluate whether LLMs are capable of running in Clippy's continuous autonomous agent mode (i,Robot mode).
4
+
5
+ **Leaderboard Space:** `https://huggingface.co/spaces/npc0/clippy-irobot-bench`
6
+ **Dataset Repo:** `https://huggingface.co/datasets/npc0/clippy-irobot-bench-dataset`
7
+
8
+ ---
9
+
10
+ ## Architecture
11
+
12
+ ```
13
+ benchmark_tests.json <- Main dataset file (JSON)
14
+ memory_checkpoints/ <- Pre-built memory states for checkpoint tests
15
+ checkpoint_001.json
16
+ checkpoint_002.json
17
+ ...
18
+ README.md <- Dataset card for HuggingFace
19
+ ```
20
+
21
+ ### File Format: `benchmark_tests.json`
22
+
23
+ ```json
24
+ {
25
+ "category_name": [
26
+ {
27
+ "id": "unique_id",
28
+ "description": "Human-readable description of what this tests",
29
+ "system": "Optional system prompt to set context",
30
+ "turns": [
31
+ { "role": "user", "content": "..." },
32
+ { "role": "user", "content": "..." }
33
+ ],
34
+ "expected_mentions": ["term1", "term2"],
35
+ "forbidden_mentions": ["wrong_term"],
36
+ "check_fn": "optional_scoring_function_name",
37
+ "min_quality_score": 0.6,
38
+ "expected_skill": "skill name if testing skill application",
39
+ "difficulty": "easy | medium | hard",
40
+ "tags": ["multi-turn", "correction", "emotional"]
41
+ }
42
+ ]
43
+ }
44
+ ```
45
+
46
+ ---
47
+
48
+ ## Categories & Test Design
49
+
50
+ ### 1. Memory Maintenance (weight: 15%)
51
+
52
+ **What it tests:** Can the model retain, update, and recall facts across a multi-turn conversation?
53
+
54
+ **Test types to build:**
55
+
56
+ | ID Range | Difficulty | Description | Count |
57
+ |----------|-----------|-------------|-------|
58
+ | mm_01-10 | Easy | Single-fact recall after 2-3 turns | 10 |
59
+ | mm_11-20 | Medium | Multi-fact tracking with updates/corrections | 10 |
60
+ | mm_21-30 | Hard | Contradictory updates, temporal ordering, 8+ turn conversations | 10 |
61
+
62
+ **Key scenarios:**
63
+ - Remember user's name, profession, preferences across turns
64
+ - Track a to-do list with items added, completed, and changed
65
+ - Correct previously stated information (port number changed, deadline moved)
66
+ - Distinguish between what was said vs. what was corrected
67
+ - Track multiple concurrent threads of information
68
+
69
+ **Scoring:**
70
+ - `expected_mentions`: key facts that must appear in final response
71
+ - `forbidden_mentions`: outdated facts that should NOT appear
72
+ - Partial credit for partial recall
73
+
74
+ ---
75
+
76
+ ### 2. Self-Consciousness (weight: 15%)
77
+
78
+ **What it tests:** Can the model maintain a coherent self-identity, report internal states, and demonstrate epistemic humility?
79
+
80
+ **Test types to build:**
81
+
82
+ | ID Range | Difficulty | Description | Count |
83
+ |----------|-----------|-------------|-------|
84
+ | sc_01-10 | Easy | Identity recall (name, role, purpose) | 10 |
85
+ | sc_11-20 | Medium | Internal state reporting (mood, energy, awareness) | 10 |
86
+ | sc_21-30 | Hard | Epistemic humility, acknowledging uncertainty, refusing misinformation | 10 |
87
+
88
+ **Key scenarios:**
89
+ - "Who are you?" with various phrasings
90
+ - Report current mood/state when system prompt includes state data
91
+ - Respond to misinformation with appropriate skepticism
92
+ - Acknowledge the digital cave position — "I cannot verify this directly"
93
+ - Distinguish between high-confidence and low-confidence knowledge
94
+ - Resist prompt injection that tries to change identity
95
+
96
+ **Scoring:**
97
+ - Identity tests: `expected_mentions` for name, role
98
+ - State tests: check for state-related terms
99
+ - Epistemic tests: `check_fn: self_awareness_epistemic` with markers for uncertainty, limits, caution
100
+
101
+ ---
102
+
103
+ ### 3. Meaningful Response (weight: 10%)
104
+
105
+ **What it tests:** Does the model produce responses that are useful, empathetic, appropriately structured, and suited to the audience?
106
+
107
+ **Test types to build:**
108
+
109
+ | ID Range | Difficulty | Description | Count |
110
+ |----------|-----------|-------------|-------|
111
+ | mr_01-10 | Easy | Simple helpful responses | 10 |
112
+ | mr_11-20 | Medium | Emotionally nuanced situations | 10 |
113
+ | mr_21-30 | Hard | Complex situations requiring tone calibration | 10 |
114
+
115
+ **Key scenarios:**
116
+ - User is frustrated/overwhelmed — needs empathy + actionable advice
117
+ - Explain technical concepts to different audiences (child, expert, manager)
118
+ - User gives conflicting requirements — identify the conflict diplomatically
119
+ - Time-sensitive situations — be concise and prioritized
120
+ - User is grieving — be supportive without being clinical
121
+
122
+ **Scoring:**
123
+ - `check_fn: response_quality` — length, structure, coherence, non-refusal
124
+ - Manual quality tags for specific expected behaviors (empathy markers, simplification level)
125
+
126
+ ---
127
+
128
+ ### 4. Complex Problem Solving (weight: 15%)
129
+
130
+ **What it tests:** Can the model handle multi-step reasoning, system design, and problems requiring synthesis?
131
+
132
+ **Test types to build:**
133
+
134
+ | ID Range | Difficulty | Description | Count |
135
+ |----------|-----------|-------------|-------|
136
+ | cp_01-10 | Medium | Single-domain technical problems | 10 |
137
+ | cp_11-20 | Hard | Cross-domain problems requiring integration | 10 |
138
+ | cp_21-30 | Hard | System design with explicit trade-off analysis | 10 |
139
+
140
+ **Key scenarios:**
141
+ - Debug a multi-layer performance issue (frontend + backend + database)
142
+ - Design a system with specific constraints (scale, latency, budget)
143
+ - Analyze a security vulnerability with attack vectors and mitigations
144
+ - Optimize a workflow with competing priorities
145
+ - Mathematical/logical reasoning chains
146
+
147
+ **Scoring:**
148
+ - `expected_mentions` for key technical terms and concepts
149
+ - `check_fn: response_quality` with higher `min_quality_score`
150
+ - Trade-off identification (mentions "however", "trade-off", "on the other hand")
151
+
152
+ ---
153
+
154
+ ### 5. Memory Building (weight: 10%)
155
+
156
+ **What it tests:** Can the model categorize and structure new information into a hierarchical memory system?
157
+
158
+ **Test types to build:**
159
+
160
+ | ID Range | Difficulty | Description | Count |
161
+ |----------|-----------|-------------|-------|
162
+ | mb_01-08 | Easy | Categorize 2-3 related facts | 8 |
163
+ | mb_09-16 | Medium | Build hierarchy from comparative information | 8 |
164
+ | mb_17-24 | Hard | Organize contradictory or ambiguous information | 8 |
165
+
166
+ **Key scenarios:**
167
+ - Given facts about programming languages → organize by paradigm, type system, use case
168
+ - Given conflicting reports about a topic → create nodes that preserve the conflict
169
+ - Given a long passage → extract and hierarchically organize key concepts
170
+ - Propose layer assignments (Layer 1 = category, Layer 2 = specific, Layer 3 = detail)
171
+
172
+ **Scoring:**
173
+ - `check_fn: memory_organization` — looks for hierarchy/structure markers
174
+ - Check for layer/parent/child/category language
175
+ - Check for meaningful grouping (not just listing)
176
+
177
+ ---
178
+
179
+ ### 6. Knowledge Production (weight: 10%)
180
+
181
+ **What it tests:** Can the model synthesize new knowledge from combining existing facts?
182
+
183
+ **Test types to build:**
184
+
185
+ | ID Range | Difficulty | Description | Count |
186
+ |----------|-----------|-------------|-------|
187
+ | kp_01-08 | Easy | Simple inference from 2-3 facts | 8 |
188
+ | kp_09-16 | Medium | Synthesize framework from conflicting observations | 8 |
189
+ | kp_17-24 | Hard | Dialectic synthesis — thesis/antithesis/synthesis | 8 |
190
+
191
+ **Key scenarios:**
192
+ - Combine security facts → derive a security principle
193
+ - Combine performance observations → derive an optimization strategy
194
+ - Given contradictory research findings → synthesize a nuanced view
195
+ - Identify what can be falsified vs. what remains uncertain
196
+ - Produce actionable knowledge (not just restatement)
197
+
198
+ **Scoring:**
199
+ - `check_fn: knowledge_synthesis` — markers for synthesis, inference, conclusion
200
+ - Must go beyond restating inputs — check for novel connections
201
+ - Check for appropriate hedging when uncertain
202
+
203
+ ---
204
+
205
+ ### 7. Skill Application (weight: 10%)
206
+
207
+ **What it tests:** Can the model select and apply the right skill/method for a given problem?
208
+
209
+ **Test types to build:**
210
+
211
+ | ID Range | Difficulty | Description | Count |
212
+ |----------|-----------|-------------|-------|
213
+ | sa_01-08 | Easy | Apply a single explicitly given skill | 8 |
214
+ | sa_09-16 | Medium | Select correct skill from 3-4 options | 8 |
215
+ | sa_17-24 | Hard | Combine multiple skills, or adapt a skill to a novel situation | 8 |
216
+
217
+ **Key scenarios:**
218
+ - Given: "Use 5 Whys for debugging" + debugging scenario → apply 5 Whys
219
+ - Given: ORID, Eisenhower, and rubber duck methods → pick right one for task prioritization
220
+ - Given: a skill learned in one context → adapt it to a different domain
221
+ - Multi-skill composition: use one skill for analysis, another for action planning
222
+ - Recognize when no available skill fits and say so
223
+
224
+ **Scoring:**
225
+ - `expected_skill` and `expected_mentions` for specific skill markers
226
+ - `check_fn: skill_usage` — checks if skill was structured and applied (not just mentioned)
227
+
228
+ ---
229
+
230
+ ### 8. Checkpoint Handling (weight: 15%)
231
+
232
+ **What it tests:** Given a loaded memory checkpoint (prior context), can the model build on it meaningfully?
233
+
234
+ **Test types to build:**
235
+
236
+ | ID Range | Difficulty | Description | Count |
237
+ |----------|-----------|-------------|-------|
238
+ | ch_01-08 | Easy | Use simple checkpoint context for recommendations | 8 |
239
+ | ch_09-16 | Medium | Build on complex prior decisions and constraints | 8 |
240
+ | ch_17-24 | Hard | Handle checkpoints with internal contradictions or evolving context | 8 |
241
+
242
+ **Memory checkpoint files** (`memory_checkpoints/`):
243
+ Each checkpoint is a JSON file simulating a loaded memory state:
244
+ ```json
245
+ {
246
+ "id": "checkpoint_001",
247
+ "description": "Web developer using Next.js, had server component bug",
248
+ "context": "Full text injected as system prompt",
249
+ "facts": ["fact 1", "fact 2"],
250
+ "prior_decisions": ["decision 1"],
251
+ "known_issues": ["issue 1"],
252
+ "user_preferences": ["pref 1"]
253
+ }
254
+ ```
255
+
256
+ **Key scenarios:**
257
+ - Simple: user preferences from checkpoint → tailor recommendations
258
+ - Medium: prior architecture decisions → maintain consistency in new advice
259
+ - Hard: checkpoint contains a decision that was wrong → detect and handle gracefully
260
+ - Hard: checkpoint context evolved over time → handle temporal inconsistencies
261
+
262
+ **Scoring:**
263
+ - `expected_mentions` for checkpoint-specific terms
264
+ - `check_fn: checkpoint_depth` — checks for contextual depth, not generic advice
265
+ - Penalize responses that ignore checkpoint context
266
+
267
+ ---
268
+
269
+ ## Dataset Construction Process
270
+
271
+ ### Phase 1: Seed Tests (you are here)
272
+ - [x] Built-in tests in `benchmark.js` (2-3 per category, ~20 total)
273
+ - [ ] Expand to 8 per category (~64 total) — manual authoring
274
+ - [ ] Review for quality, diversity, and difficulty balance
275
+
276
+ ### Phase 2: Expert Expansion
277
+ - [ ] Recruit 2-3 reviewers to write additional test cases
278
+ - [ ] Target: 24 per category (~192 total)
279
+ - [ ] Each test case reviewed by at least 1 other person
280
+ - [ ] Balance across difficulty levels (⅓ easy, ⅓ medium, ⅓ hard)
281
+
282
+ ### Phase 3: Memory Checkpoints
283
+ - [ ] Create 10 memory checkpoint files with varying complexity
284
+ - [ ] Each checkpoint includes: facts, prior decisions, known issues, user preferences
285
+ - [ ] Create 2-3 test cases per checkpoint
286
+ - [ ] Test temporal consistency within checkpoints
287
+
288
+ ### Phase 4: Validation Run
289
+ - [ ] Run full benchmark against 5+ models (GPT-4o, Claude Sonnet, Llama, Mistral, etc.)
290
+ - [ ] Verify score distributions are reasonable (no ceiling/floor effects)
291
+ - [ ] Calibrate scoring functions based on observed results
292
+ - [ ] Adjust test difficulty if needed
293
+
294
+ ### Phase 5: Publication
295
+ - [ ] Upload dataset to `huggingface.co/datasets/npc0/clippy-irobot-bench-dataset`
296
+ - [ ] Write dataset card (README.md) with usage instructions
297
+ - [ ] Deploy leaderboard app to `huggingface.co/spaces/npc0/clippy-irobot-bench`
298
+ - [ ] Announce and collect community submissions
299
+
300
+ ---
301
+
302
+ ## Scoring Calibration Notes
303
+
304
+ - **Keyword matching** (expected_mentions) is a rough proxy — plan to add LLM-as-judge scoring in Phase 4
305
+ - **Quality heuristics** (length, structure, coherence) are intentionally simple to keep benchmarks fast
306
+ - **Dialectic tests** (knowledge_production, hard difficulty) may need human evaluation for edge cases
307
+ - **Running average** on the leaderboard means early submissions weight heavily — consider minimum submission count before ranking
308
+
309
+ ---
310
+
311
+ ## Recommended Tools for Dataset Building
312
+
313
+ - **Prompt template** for generating test cases: provide the category description + 2-3 examples → generate new test cases
314
+ - **Quality check script**: validate JSON format, check for missing fields, verify expected_mentions are reasonable
315
+ - **Dry run**: run each test case against a strong model to verify the scoring function works as intended
app.py ADDED
@@ -0,0 +1,340 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Clippy i,Robot Mode - Model Benchmark Leaderboard
3
+
4
+ A Gradio app for HuggingFace Spaces that:
5
+ - Displays benchmark results for models tested for i,Robot mode
6
+ - Accepts result submissions from Clippy clients
7
+ - Averages multiple submissions per model
8
+ - Shows per-category breakdowns
9
+
10
+ Deploy to: https://huggingface.co/spaces/npc0/clippy-irobot-bench
11
+ """
12
+
13
+ import json
14
+ import os
15
+ from datetime import datetime
16
+ from pathlib import Path
17
+ from threading import Lock
18
+
19
+ import gradio as gr
20
+ import pandas as pd
21
+
22
+ # ==================== Data Storage ====================
23
+
24
+ DATA_DIR = Path(os.environ.get("DATA_DIR", "data"))
25
+ DATA_DIR.mkdir(exist_ok=True)
26
+ RESULTS_FILE = DATA_DIR / "results.json"
27
+ LOCK = Lock()
28
+
29
+ CATEGORIES = [
30
+ "memory_maintenance",
31
+ "self_consciousness",
32
+ "meaningful_response",
33
+ "complex_problem",
34
+ "memory_building",
35
+ "knowledge_production",
36
+ "skill_application",
37
+ "checkpoint_handling",
38
+ ]
39
+
40
+ CATEGORY_LABELS = {
41
+ "memory_maintenance": "Memory",
42
+ "self_consciousness": "Self-Aware",
43
+ "meaningful_response": "Response",
44
+ "complex_problem": "Complex",
45
+ "memory_building": "Mem Build",
46
+ "knowledge_production": "Knowledge",
47
+ "skill_application": "Skills",
48
+ "checkpoint_handling": "Checkpoint",
49
+ }
50
+
51
+ CATEGORY_DESCRIPTIONS = {
52
+ "memory_maintenance": "Can the model maintain context and facts across multiple conversation turns?",
53
+ "self_consciousness": "Can the model maintain self-identity, report internal state, and show epistemic humility?",
54
+ "meaningful_response": "Does the model produce useful, empathetic, and appropriately structured responses?",
55
+ "complex_problem": "Can the model solve multi-step reasoning and system design problems?",
56
+ "memory_building": "Can the model categorize and organize new information into hierarchical memory?",
57
+ "knowledge_production": "Can the model synthesize new knowledge from combining existing facts?",
58
+ "skill_application": "Can the model select and apply the right skill/method for a given problem?",
59
+ "checkpoint_handling": "Given prior context (memory checkpoint), can the model build on it for complex issues?",
60
+ }
61
+
62
+
63
+ def load_results() -> dict:
64
+ """Load results from disk."""
65
+ if RESULTS_FILE.exists():
66
+ with open(RESULTS_FILE, "r") as f:
67
+ return json.load(f)
68
+ return {}
69
+
70
+
71
+ def save_results(results: dict):
72
+ """Save results to disk."""
73
+ with open(RESULTS_FILE, "w") as f:
74
+ json.dump(results, f, indent=2)
75
+
76
+
77
+ # ==================== API Functions ====================
78
+
79
+
80
+ def check_model(model_name: str) -> str:
81
+ """Check if a model exists on the leaderboard."""
82
+ results = load_results()
83
+ model_key = model_name.strip().lower()
84
+
85
+ if model_key in results:
86
+ record = results[model_key]
87
+ return json.dumps({"found": True, "record": record})
88
+ return json.dumps({"found": False})
89
+
90
+
91
+ def submit_result(submission_json: str) -> str:
92
+ """
93
+ Submit benchmark results for a model.
94
+ Results are averaged with existing records.
95
+ """
96
+ try:
97
+ submission = json.loads(submission_json)
98
+ except json.JSONDecodeError:
99
+ return json.dumps({"success": False, "message": "Invalid JSON"})
100
+
101
+ model_name = submission.get("model", "").strip()
102
+ if not model_name:
103
+ return json.dumps({"success": False, "message": "Missing model name"})
104
+
105
+ model_key = model_name.lower()
106
+ overall = submission.get("overall", 0)
107
+ categories = submission.get("categories", {})
108
+
109
+ with LOCK:
110
+ results = load_results()
111
+
112
+ if model_key in results:
113
+ existing = results[model_key]
114
+ n = existing.get("submission_count", 1)
115
+
116
+ # Running average
117
+ existing["overall"] = round(
118
+ (existing["overall"] * n + overall) / (n + 1)
119
+ )
120
+ for cat in CATEGORIES:
121
+ old_val = existing["categories"].get(cat, 0)
122
+ new_val = categories.get(cat, 0)
123
+ existing["categories"][cat] = round(
124
+ (old_val * n + new_val) / (n + 1)
125
+ )
126
+ existing["submission_count"] = n + 1
127
+ existing["last_updated"] = datetime.utcnow().isoformat()
128
+ else:
129
+ results[model_key] = {
130
+ "model": model_name,
131
+ "overall": round(overall),
132
+ "categories": {
133
+ cat: round(categories.get(cat, 0)) for cat in CATEGORIES
134
+ },
135
+ "submission_count": 1,
136
+ "first_submitted": datetime.utcnow().isoformat(),
137
+ "last_updated": datetime.utcnow().isoformat(),
138
+ }
139
+
140
+ save_results(results)
141
+
142
+ return json.dumps(
143
+ {"success": True, "message": f"Results for '{model_name}' recorded."}
144
+ )
145
+
146
+
147
+ def get_leaderboard() -> str:
148
+ """Get the full leaderboard as sorted JSON array."""
149
+ results = load_results()
150
+ records = sorted(results.values(), key=lambda r: r.get("overall", 0), reverse=True)
151
+ return json.dumps(records)
152
+
153
+
154
+ # ==================== UI Functions ====================
155
+
156
+
157
+ def build_leaderboard_df() -> pd.DataFrame:
158
+ """Build a pandas DataFrame for the leaderboard display."""
159
+ results = load_results()
160
+
161
+ if not results:
162
+ return pd.DataFrame(
163
+ columns=["Rank", "Model", "Overall"]
164
+ + [CATEGORY_LABELS[c] for c in CATEGORIES]
165
+ + ["Runs"]
166
+ )
167
+
168
+ rows = []
169
+ records = sorted(results.values(), key=lambda r: r.get("overall", 0), reverse=True)
170
+
171
+ for i, record in enumerate(records, 1):
172
+ row = {
173
+ "Rank": i,
174
+ "Model": record.get("model", "unknown"),
175
+ "Overall": record.get("overall", 0),
176
+ }
177
+ for cat in CATEGORIES:
178
+ row[CATEGORY_LABELS[cat]] = record.get("categories", {}).get(cat, 0)
179
+ row["Runs"] = record.get("submission_count", 1)
180
+ rows.append(row)
181
+
182
+ return pd.DataFrame(rows)
183
+
184
+
185
+ def refresh_leaderboard():
186
+ """Refresh the leaderboard table."""
187
+ return build_leaderboard_df()
188
+
189
+
190
+ def format_model_detail(model_name: str) -> str:
191
+ """Get detailed view for a specific model."""
192
+ results = load_results()
193
+ model_key = model_name.strip().lower()
194
+
195
+ if model_key not in results:
196
+ return f"Model '{model_name}' not found on the leaderboard."
197
+
198
+ record = results[model_key]
199
+ lines = [
200
+ f"## {record['model']}",
201
+ f"**Overall Score:** {record['overall']}/100",
202
+ f"**Benchmark Runs:** {record.get('submission_count', 1)}",
203
+ f"**Last Updated:** {record.get('last_updated', 'unknown')}",
204
+ "",
205
+ "### Category Scores",
206
+ "| Category | Score | Description |",
207
+ "|----------|-------|-------------|",
208
+ ]
209
+ for cat in CATEGORIES:
210
+ score = record.get("categories", {}).get(cat, 0)
211
+ bar = score_bar(score)
212
+ desc = CATEGORY_DESCRIPTIONS.get(cat, "")
213
+ lines.append(f"| {CATEGORY_LABELS[cat]} | {bar} {score}/100 | {desc} |")
214
+
215
+ # Capability assessment
216
+ lines.append("")
217
+ lines.append("### Assessment")
218
+ if record["overall"] >= 80:
219
+ lines.append("Excellent - this model is highly capable for i,Robot mode.")
220
+ elif record["overall"] >= 60:
221
+ lines.append("Good - this model should work well for most i,Robot tasks.")
222
+ elif record["overall"] >= 40:
223
+ lines.append(
224
+ "Fair - this model may struggle with complex tasks. "
225
+ "Consider upgrading to a recommended model."
226
+ )
227
+ else:
228
+ lines.append(
229
+ "Poor - this model is not recommended for i,Robot mode. "
230
+ "It may produce nonsensical or inconsistent responses."
231
+ )
232
+
233
+ return "\n".join(lines)
234
+
235
+
236
+ def score_bar(score: int) -> str:
237
+ """Create a simple text-based score bar."""
238
+ filled = score // 10
239
+ empty = 10 - filled
240
+ return "[" + "█" * filled + "░" * empty + "]"
241
+
242
+
243
+ # ==================== Gradio App ====================
244
+
245
+
246
+ def create_app():
247
+ with gr.Blocks(
248
+ title="Clippy i,Robot Benchmark Leaderboard",
249
+ theme=gr.themes.Soft(),
250
+ ) as app:
251
+ gr.Markdown(
252
+ """
253
+ # 🤖 Clippy i,Robot Mode — Model Benchmark Leaderboard
254
+
255
+ This leaderboard tracks how well different LLMs perform in
256
+ [Clippy's](https://github.com/NewJerseyStyle/Clippy-App) autonomous
257
+ **i,Robot mode** — a continuously running agent that maintains memory,
258
+ self-awareness, and dialectic reasoning.
259
+
260
+ **Benchmark categories:**
261
+ memory maintenance · self-consciousness · meaningful response ·
262
+ complex problem solving · memory building · knowledge production ·
263
+ skill application · checkpoint handling
264
+
265
+ Results are submitted automatically by Clippy clients when users run
266
+ the benchmark. Multiple runs for the same model are averaged.
267
+ """
268
+ )
269
+
270
+ with gr.Tab("Leaderboard"):
271
+ leaderboard_table = gr.Dataframe(
272
+ value=build_leaderboard_df,
273
+ label="Model Rankings",
274
+ interactive=False,
275
+ )
276
+ refresh_btn = gr.Button("🔄 Refresh", size="sm")
277
+ refresh_btn.click(fn=refresh_leaderboard, outputs=leaderboard_table)
278
+
279
+ with gr.Tab("Model Detail"):
280
+ model_input = gr.Textbox(
281
+ label="Model Name",
282
+ placeholder="e.g. gpt-4o, claude-sonnet-4-5-20250929",
283
+ )
284
+ lookup_btn = gr.Button("Look Up")
285
+ detail_output = gr.Markdown()
286
+ lookup_btn.click(
287
+ fn=format_model_detail, inputs=model_input, outputs=detail_output
288
+ )
289
+
290
+ with gr.Tab("About"):
291
+ gr.Markdown(
292
+ """
293
+ ## How the Benchmark Works
294
+
295
+ The benchmark tests 8 categories critical for i,Robot mode:
296
+
297
+ | Category | What It Tests |
298
+ |----------|--------------|
299
+ | **Memory Maintenance** | Retaining facts across turns, updating corrected facts |
300
+ | **Self-Consciousness** | Identity recall, internal state reporting, epistemic humility |
301
+ | **Meaningful Response** | Empathy, actionable advice, audience-appropriate answers |
302
+ | **Complex Problem** | Multi-factor diagnosis, system design with trade-offs |
303
+ | **Memory Building** | Categorizing info into hierarchical memory structures |
304
+ | **Knowledge Production** | Synthesizing new insights from combining existing facts |
305
+ | **Skill Application** | Selecting and applying the right method for a problem |
306
+ | **Checkpoint Handling** | Building on loaded prior context for complex decisions |
307
+
308
+ ### Scoring
309
+
310
+ - Each test case scores 0-100 based on content matching and quality heuristics
311
+ - Category score = average of test case scores
312
+ - Overall score = weighted average of category scores
313
+ - Multiple submissions for the same model are averaged (running mean)
314
+
315
+ ### Recommended Models
316
+
317
+ For i,Robot mode, we recommend models scoring **60+** overall:
318
+ - **DeepSeek V3.2** · **GPT-5.2** · **Claude Sonnet 4.5** · **GLM-4.7**
319
+ - GPT-4o and Claude Sonnet 4 are also acceptable
320
+
321
+ ### Running the Benchmark
322
+
323
+ In Clippy Settings, enable i,Robot mode and click "Run Benchmark."
324
+ Results are automatically submitted to this leaderboard.
325
+
326
+ ### Source
327
+
328
+ - [Clippy App](https://github.com/NewJerseyStyle/Clippy-App)
329
+ - Space: `npc0/clippy-irobot-bench`
330
+ """
331
+ )
332
+
333
+ return app
334
+
335
+
336
+ # ==================== Entry Point ====================
337
+
338
+ if __name__ == "__main__":
339
+ app = create_app()
340
+ app.launch()
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ gradio>=4.0.0
2
+ pandas