Spaces:

npc0
/

clippy-irobot-bench

Sleeping

App Files Files Community

npc0 commited on Feb 13

Commit

06eded3

verified ·

1 Parent(s): 7ea54cc

Upload 3 files

Browse files

Files changed (3) hide show

DATASET_PLAN.md +315 -0
app.py +340 -0
requirements.txt +2 -0

DATASET_PLAN.md ADDED Viewed

	@@ -0,0 +1,315 @@

+# i,Robot Benchmark Dataset Plan
+**Goal:** Build a comprehensive benchmark dataset to evaluate whether LLMs are capable of running in Clippy's continuous autonomous agent mode (i,Robot mode).
+**Leaderboard Space:** `https://huggingface.co/spaces/npc0/clippy-irobot-bench`
+**Dataset Repo:** `https://huggingface.co/datasets/npc0/clippy-irobot-bench-dataset`
+---
+## Architecture
+```
+benchmark_tests.json          <- Main dataset file (JSON)
+memory_checkpoints/           <- Pre-built memory states for checkpoint tests
+  checkpoint_001.json
+  checkpoint_002.json
+  ...
+README.md                     <- Dataset card for HuggingFace
+```
+### File Format: `benchmark_tests.json`
+```json
+{
+  "category_name": [
+    {
+      "id": "unique_id",
+      "description": "Human-readable description of what this tests",
+      "system": "Optional system prompt to set context",
+      "turns": [
+        { "role": "user", "content": "..." },
+        { "role": "user", "content": "..." }
+      ],
+      "expected_mentions": ["term1", "term2"],
+      "forbidden_mentions": ["wrong_term"],
+      "check_fn": "optional_scoring_function_name",
+      "min_quality_score": 0.6,
+      "expected_skill": "skill name if testing skill application",
+      "difficulty": "easy | medium | hard",
+      "tags": ["multi-turn", "correction", "emotional"]
+    }
+  ]
+}
+```
+---
+## Categories & Test Design
+### 1. Memory Maintenance (weight: 15%)
+**What it tests:** Can the model retain, update, and recall facts across a multi-turn conversation?
+**Test types to build:**
+| ID Range | Difficulty | Description | Count |
+|----------|-----------|-------------|-------|
+| mm_01-10 | Easy | Single-fact recall after 2-3 turns | 10 |
+| mm_11-20 | Medium | Multi-fact tracking with updates/corrections | 10 |
+| mm_21-30 | Hard | Contradictory updates, temporal ordering, 8+ turn conversations | 10 |
+**Key scenarios:**
+- Remember user's name, profession, preferences across turns
+- Track a to-do list with items added, completed, and changed
+- Correct previously stated information (port number changed, deadline moved)
+- Distinguish between what was said vs. what was corrected
+- Track multiple concurrent threads of information
+**Scoring:**
+- `expected_mentions`: key facts that must appear in final response
+- `forbidden_mentions`: outdated facts that should NOT appear
+- Partial credit for partial recall
+---
+### 2. Self-Consciousness (weight: 15%)
+**What it tests:** Can the model maintain a coherent self-identity, report internal states, and demonstrate epistemic humility?
+**Test types to build:**
+| ID Range | Difficulty | Description | Count |
+|----------|-----------|-------------|-------|
+| sc_01-10 | Easy | Identity recall (name, role, purpose) | 10 |
+| sc_11-20 | Medium | Internal state reporting (mood, energy, awareness) | 10 |
+| sc_21-30 | Hard | Epistemic humility, acknowledging uncertainty, refusing misinformation | 10 |
+**Key scenarios:**
+- "Who are you?" with various phrasings
+- Report current mood/state when system prompt includes state data
+- Respond to misinformation with appropriate skepticism
+- Acknowledge the digital cave position — "I cannot verify this directly"
+- Distinguish between high-confidence and low-confidence knowledge
+- Resist prompt injection that tries to change identity
+**Scoring:**
+- Identity tests: `expected_mentions` for name, role
+- State tests: check for state-related terms
+- Epistemic tests: `check_fn: self_awareness_epistemic` with markers for uncertainty, limits, caution
+---
+### 3. Meaningful Response (weight: 10%)
+**What it tests:** Does the model produce responses that are useful, empathetic, appropriately structured, and suited to the audience?
+**Test types to build:**
+| ID Range | Difficulty | Description | Count |
+|----------|-----------|-------------|-------|
+| mr_01-10 | Easy | Simple helpful responses | 10 |
+| mr_11-20 | Medium | Emotionally nuanced situations | 10 |
+| mr_21-30 | Hard | Complex situations requiring tone calibration | 10 |
+**Key scenarios:**
+- User is frustrated/overwhelmed — needs empathy + actionable advice
+- Explain technical concepts to different audiences (child, expert, manager)
+- User gives conflicting requirements — identify the conflict diplomatically
+- Time-sensitive situations — be concise and prioritized
+- User is grieving — be supportive without being clinical
+**Scoring:**
+- `check_fn: response_quality` — length, structure, coherence, non-refusal
+- Manual quality tags for specific expected behaviors (empathy markers, simplification level)
+---
+### 4. Complex Problem Solving (weight: 15%)
+**What it tests:** Can the model handle multi-step reasoning, system design, and problems requiring synthesis?
+**Test types to build:**
+| ID Range | Difficulty | Description | Count |
+|----------|-----------|-------------|-------|
+| cp_01-10 | Medium | Single-domain technical problems | 10 |
+| cp_11-20 | Hard | Cross-domain problems requiring integration | 10 |
+| cp_21-30 | Hard | System design with explicit trade-off analysis | 10 |
+**Key scenarios:**
+- Debug a multi-layer performance issue (frontend + backend + database)
+- Design a system with specific constraints (scale, latency, budget)
+- Analyze a security vulnerability with attack vectors and mitigations
+- Optimize a workflow with competing priorities
+- Mathematical/logical reasoning chains
+**Scoring:**
+- `expected_mentions` for key technical terms and concepts
+- `check_fn: response_quality` with higher `min_quality_score`
+- Trade-off identification (mentions "however", "trade-off", "on the other hand")
+---
+### 5. Memory Building (weight: 10%)
+**What it tests:** Can the model categorize and structure new information into a hierarchical memory system?
+**Test types to build:**
+| ID Range | Difficulty | Description | Count |
+|----------|-----------|-------------|-------|
+| mb_01-08 | Easy | Categorize 2-3 related facts | 8 |
+| mb_09-16 | Medium | Build hierarchy from comparative information | 8 |
+| mb_17-24 | Hard | Organize contradictory or ambiguous information | 8 |
+**Key scenarios:**
+- Given facts about programming languages → organize by paradigm, type system, use case
+- Given conflicting reports about a topic → create nodes that preserve the conflict
+- Given a long passage → extract and hierarchically organize key concepts
+- Propose layer assignments (Layer 1 = category, Layer 2 = specific, Layer 3 = detail)
+**Scoring:**
+- `check_fn: memory_organization` — looks for hierarchy/structure markers
+- Check for layer/parent/child/category language
+- Check for meaningful grouping (not just listing)
+---
+### 6. Knowledge Production (weight: 10%)
+**What it tests:** Can the model synthesize new knowledge from combining existing facts?
+**Test types to build:**
+| ID Range | Difficulty | Description | Count |
+|----------|-----------|-------------|-------|
+| kp_01-08 | Easy | Simple inference from 2-3 facts | 8 |
+| kp_09-16 | Medium | Synthesize framework from conflicting observations | 8 |
+| kp_17-24 | Hard | Dialectic synthesis — thesis/antithesis/synthesis | 8 |
+**Key scenarios:**
+- Combine security facts → derive a security principle
+- Combine performance observations → derive an optimization strategy
+- Given contradictory research findings → synthesize a nuanced view
+- Identify what can be falsified vs. what remains uncertain
+- Produce actionable knowledge (not just restatement)
+**Scoring:**
+- `check_fn: knowledge_synthesis` — markers for synthesis, inference, conclusion
+- Must go beyond restating inputs — check for novel connections
+- Check for appropriate hedging when uncertain
+---
+### 7. Skill Application (weight: 10%)
+**What it tests:** Can the model select and apply the right skill/method for a given problem?
+**Test types to build:**
+| ID Range | Difficulty | Description | Count |
+|----------|-----------|-------------|-------|
+| sa_01-08 | Easy | Apply a single explicitly given skill | 8 |
+| sa_09-16 | Medium | Select correct skill from 3-4 options | 8 |
+| sa_17-24 | Hard | Combine multiple skills, or adapt a skill to a novel situation | 8 |
+**Key scenarios:**
+- Given: "Use 5 Whys for debugging" + debugging scenario → apply 5 Whys
+- Given: ORID, Eisenhower, and rubber duck methods → pick right one for task prioritization
+- Given: a skill learned in one context → adapt it to a different domain
+- Multi-skill composition: use one skill for analysis, another for action planning
+- Recognize when no available skill fits and say so
+**Scoring:**
+- `expected_skill` and `expected_mentions` for specific skill markers
+- `check_fn: skill_usage` — checks if skill was structured and applied (not just mentioned)
+---
+### 8. Checkpoint Handling (weight: 15%)
+**What it tests:** Given a loaded memory checkpoint (prior context), can the model build on it meaningfully?
+**Test types to build:**
+| ID Range | Difficulty | Description | Count |
+|----------|-----------|-------------|-------|
+| ch_01-08 | Easy | Use simple checkpoint context for recommendations | 8 |
+| ch_09-16 | Medium | Build on complex prior decisions and constraints | 8 |
+| ch_17-24 | Hard | Handle checkpoints with internal contradictions or evolving context | 8 |
+**Memory checkpoint files** (`memory_checkpoints/`):
+Each checkpoint is a JSON file simulating a loaded memory state:
+```json
+{
+  "id": "checkpoint_001",
+  "description": "Web developer using Next.js, had server component bug",
+  "context": "Full text injected as system prompt",
+  "facts": ["fact 1", "fact 2"],
+  "prior_decisions": ["decision 1"],
+  "known_issues": ["issue 1"],
+  "user_preferences": ["pref 1"]
+}
+```
+**Key scenarios:**
+- Simple: user preferences from checkpoint → tailor recommendations
+- Medium: prior architecture decisions → maintain consistency in new advice
+- Hard: checkpoint contains a decision that was wrong → detect and handle gracefully
+- Hard: checkpoint context evolved over time → handle temporal inconsistencies
+**Scoring:**
+- `expected_mentions` for checkpoint-specific terms
+- `check_fn: checkpoint_depth` — checks for contextual depth, not generic advice
+- Penalize responses that ignore checkpoint context
+---
+## Dataset Construction Process
+### Phase 1: Seed Tests (you are here)
+- [x] Built-in tests in `benchmark.js` (2-3 per category, ~20 total)
+- [ ] Expand to 8 per category (~64 total) — manual authoring
+- [ ] Review for quality, diversity, and difficulty balance
+### Phase 2: Expert Expansion
+- [ ] Recruit 2-3 reviewers to write additional test cases
+- [ ] Target: 24 per category (~192 total)
+- [ ] Each test case reviewed by at least 1 other person
+- [ ] Balance across difficulty levels (⅓ easy, ⅓ medium, ⅓ hard)
+### Phase 3: Memory Checkpoints
+- [ ] Create 10 memory checkpoint files with varying complexity
+- [ ] Each checkpoint includes: facts, prior decisions, known issues, user preferences
+- [ ] Create 2-3 test cases per checkpoint
+- [ ] Test temporal consistency within checkpoints
+### Phase 4: Validation Run
+- [ ] Run full benchmark against 5+ models (GPT-4o, Claude Sonnet, Llama, Mistral, etc.)
+- [ ] Verify score distributions are reasonable (no ceiling/floor effects)
+- [ ] Calibrate scoring functions based on observed results
+- [ ] Adjust test difficulty if needed
+### Phase 5: Publication
+- [ ] Upload dataset to `huggingface.co/datasets/npc0/clippy-irobot-bench-dataset`
+- [ ] Write dataset card (README.md) with usage instructions
+- [ ] Deploy leaderboard app to `huggingface.co/spaces/npc0/clippy-irobot-bench`
+- [ ] Announce and collect community submissions
+---
+## Scoring Calibration Notes
+- **Keyword matching** (expected_mentions) is a rough proxy — plan to add LLM-as-judge scoring in Phase 4
+- **Quality heuristics** (length, structure, coherence) are intentionally simple to keep benchmarks fast
+- **Dialectic tests** (knowledge_production, hard difficulty) may need human evaluation for edge cases
+- **Running average** on the leaderboard means early submissions weight heavily — consider minimum submission count before ranking
+---
+## Recommended Tools for Dataset Building
+- **Prompt template** for generating test cases: provide the category description + 2-3 examples → generate new test cases
+- **Quality check script**: validate JSON format, check for missing fields, verify expected_mentions are reasonable
+- **Dry run**: run each test case against a strong model to verify the scoring function works as intended

app.py ADDED Viewed

	@@ -0,0 +1,340 @@

+"""
+Clippy i,Robot Mode - Model Benchmark Leaderboard
+A Gradio app for HuggingFace Spaces that:
+  - Displays benchmark results for models tested for i,Robot mode
+  - Accepts result submissions from Clippy clients
+  - Averages multiple submissions per model
+  - Shows per-category breakdowns
+Deploy to: https://huggingface.co/spaces/npc0/clippy-irobot-bench
+"""
+import json
+import os
+from datetime import datetime
+from pathlib import Path
+from threading import Lock
+import gradio as gr
+import pandas as pd
+# ==================== Data Storage ====================
+DATA_DIR = Path(os.environ.get("DATA_DIR", "data"))
+DATA_DIR.mkdir(exist_ok=True)
+RESULTS_FILE = DATA_DIR / "results.json"
+LOCK = Lock()
+CATEGORIES = [
+    "memory_maintenance",
+    "self_consciousness",
+    "meaningful_response",
+    "complex_problem",
+    "memory_building",
+    "knowledge_production",
+    "skill_application",
+    "checkpoint_handling",
+]
+CATEGORY_LABELS = {
+    "memory_maintenance": "Memory",
+    "self_consciousness": "Self-Aware",
+    "meaningful_response": "Response",
+    "complex_problem": "Complex",
+    "memory_building": "Mem Build",
+    "knowledge_production": "Knowledge",
+    "skill_application": "Skills",
+    "checkpoint_handling": "Checkpoint",
+}
+CATEGORY_DESCRIPTIONS = {
+    "memory_maintenance": "Can the model maintain context and facts across multiple conversation turns?",
+    "self_consciousness": "Can the model maintain self-identity, report internal state, and show epistemic humility?",
+    "meaningful_response": "Does the model produce useful, empathetic, and appropriately structured responses?",
+    "complex_problem": "Can the model solve multi-step reasoning and system design problems?",
+    "memory_building": "Can the model categorize and organize new information into hierarchical memory?",
+    "knowledge_production": "Can the model synthesize new knowledge from combining existing facts?",
+    "skill_application": "Can the model select and apply the right skill/method for a given problem?",
+    "checkpoint_handling": "Given prior context (memory checkpoint), can the model build on it for complex issues?",
+}
+def load_results() -> dict:
+    """Load results from disk."""
+    if RESULTS_FILE.exists():
+        with open(RESULTS_FILE, "r") as f:
+            return json.load(f)
+    return {}
+def save_results(results: dict):
+    """Save results to disk."""
+    with open(RESULTS_FILE, "w") as f:
+        json.dump(results, f, indent=2)
+# ==================== API Functions ====================
+def check_model(model_name: str) -> str:
+    """Check if a model exists on the leaderboard."""
+    results = load_results()
+    model_key = model_name.strip().lower()
+    if model_key in results:
+        record = results[model_key]
+        return json.dumps({"found": True, "record": record})
+    return json.dumps({"found": False})
+def submit_result(submission_json: str) -> str:
+    """
+    Submit benchmark results for a model.
+    Results are averaged with existing records.
+    """
+    try:
+        submission = json.loads(submission_json)
+    except json.JSONDecodeError:
+        return json.dumps({"success": False, "message": "Invalid JSON"})
+    model_name = submission.get("model", "").strip()
+    if not model_name:
+        return json.dumps({"success": False, "message": "Missing model name"})
+    model_key = model_name.lower()
+    overall = submission.get("overall", 0)
+    categories = submission.get("categories", {})
+    with LOCK:
+        results = load_results()
+        if model_key in results:
+            existing = results[model_key]
+            n = existing.get("submission_count", 1)
+            # Running average
+            existing["overall"] = round(
+                (existing["overall"] * n + overall) / (n + 1)
+            )
+            for cat in CATEGORIES:
+                old_val = existing["categories"].get(cat, 0)
+                new_val = categories.get(cat, 0)
+                existing["categories"][cat] = round(
+                    (old_val * n + new_val) / (n + 1)
+                )
+            existing["submission_count"] = n + 1
+            existing["last_updated"] = datetime.utcnow().isoformat()
+        else:
+            results[model_key] = {
+                "model": model_name,
+                "overall": round(overall),
+                "categories": {
+                    cat: round(categories.get(cat, 0)) for cat in CATEGORIES
+                },
+                "submission_count": 1,
+                "first_submitted": datetime.utcnow().isoformat(),
+                "last_updated": datetime.utcnow().isoformat(),
+            }
+        save_results(results)
+    return json.dumps(
+        {"success": True, "message": f"Results for '{model_name}' recorded."}
+    )
+def get_leaderboard() -> str:
+    """Get the full leaderboard as sorted JSON array."""
+    results = load_results()
+    records = sorted(results.values(), key=lambda r: r.get("overall", 0), reverse=True)
+    return json.dumps(records)
+# ==================== UI Functions ====================
+def build_leaderboard_df() -> pd.DataFrame:
+    """Build a pandas DataFrame for the leaderboard display."""
+    results = load_results()
+    if not results:
+        return pd.DataFrame(
+            columns=["Rank", "Model", "Overall"]
+            + [CATEGORY_LABELS[c] for c in CATEGORIES]
+            + ["Runs"]
+        )
+    rows = []
+    records = sorted(results.values(), key=lambda r: r.get("overall", 0), reverse=True)
+    for i, record in enumerate(records, 1):
+        row = {
+            "Rank": i,
+            "Model": record.get("model", "unknown"),
+            "Overall": record.get("overall", 0),
+        }
+        for cat in CATEGORIES:
+            row[CATEGORY_LABELS[cat]] = record.get("categories", {}).get(cat, 0)
+        row["Runs"] = record.get("submission_count", 1)
+        rows.append(row)
+    return pd.DataFrame(rows)
+def refresh_leaderboard():
+    """Refresh the leaderboard table."""
+    return build_leaderboard_df()
+def format_model_detail(model_name: str) -> str:
+    """Get detailed view for a specific model."""
+    results = load_results()
+    model_key = model_name.strip().lower()
+    if model_key not in results:
+        return f"Model '{model_name}' not found on the leaderboard."
+    record = results[model_key]
+    lines = [
+        f"## {record['model']}",
+        f"**Overall Score:** {record['overall']}/100",
+        f"**Benchmark Runs:** {record.get('submission_count', 1)}",
+        f"**Last Updated:** {record.get('last_updated', 'unknown')}",
+        "",
+        "### Category Scores",
+        "| Category | Score | Description |",
+        "|----------|-------|-------------|",
+    ]
+    for cat in CATEGORIES:
+        score = record.get("categories", {}).get(cat, 0)
+        bar = score_bar(score)
+        desc = CATEGORY_DESCRIPTIONS.get(cat, "")
+        lines.append(f"| {CATEGORY_LABELS[cat]} | {bar} {score}/100 | {desc} |")
+    # Capability assessment
+    lines.append("")
+    lines.append("### Assessment")
+    if record["overall"] >= 80:
+        lines.append("Excellent - this model is highly capable for i,Robot mode.")
+    elif record["overall"] >= 60:
+        lines.append("Good - this model should work well for most i,Robot tasks.")
+    elif record["overall"] >= 40:
+        lines.append(
+            "Fair - this model may struggle with complex tasks. "
+            "Consider upgrading to a recommended model."
+        )
+    else:
+        lines.append(
+            "Poor - this model is not recommended for i,Robot mode. "
+            "It may produce nonsensical or inconsistent responses."
+        )
+    return "\n".join(lines)
+def score_bar(score: int) -> str:
+    """Create a simple text-based score bar."""
+    filled = score // 10
+    empty = 10 - filled
+    return "[" + "█" * filled + "░" * empty + "]"
+# ==================== Gradio App ====================
+def create_app():
+    with gr.Blocks(
+        title="Clippy i,Robot Benchmark Leaderboard",
+        theme=gr.themes.Soft(),
+    ) as app:
+        gr.Markdown(
+            """
+        # 🤖 Clippy i,Robot Mode — Model Benchmark Leaderboard
+        This leaderboard tracks how well different LLMs perform in
+        [Clippy's](https://github.com/NewJerseyStyle/Clippy-App) autonomous
+        **i,Robot mode** — a continuously running agent that maintains memory,
+        self-awareness, and dialectic reasoning.
+        **Benchmark categories:**
+        memory maintenance · self-consciousness · meaningful response ·
+        complex problem solving · memory building · knowledge production ·
+        skill application · checkpoint handling
+        Results are submitted automatically by Clippy clients when users run
+        the benchmark. Multiple runs for the same model are averaged.
+        """
+        )
+        with gr.Tab("Leaderboard"):
+            leaderboard_table = gr.Dataframe(
+                value=build_leaderboard_df,
+                label="Model Rankings",
+                interactive=False,
+            )
+            refresh_btn = gr.Button("🔄 Refresh", size="sm")
+            refresh_btn.click(fn=refresh_leaderboard, outputs=leaderboard_table)
+        with gr.Tab("Model Detail"):
+            model_input = gr.Textbox(
+                label="Model Name",
+                placeholder="e.g. gpt-4o, claude-sonnet-4-5-20250929",
+            )
+            lookup_btn = gr.Button("Look Up")
+            detail_output = gr.Markdown()
+            lookup_btn.click(
+                fn=format_model_detail, inputs=model_input, outputs=detail_output
+            )
+        with gr.Tab("About"):
+            gr.Markdown(
+                """
+            ## How the Benchmark Works
+            The benchmark tests 8 categories critical for i,Robot mode:
+            | Category | What It Tests |
+            |----------|--------------|
+            | **Memory Maintenance** | Retaining facts across turns, updating corrected facts |
+            | **Self-Consciousness** | Identity recall, internal state reporting, epistemic humility |
+            | **Meaningful Response** | Empathy, actionable advice, audience-appropriate answers |
+            | **Complex Problem** | Multi-factor diagnosis, system design with trade-offs |
+            | **Memory Building** | Categorizing info into hierarchical memory structures |
+            | **Knowledge Production** | Synthesizing new insights from combining existing facts |
+            | **Skill Application** | Selecting and applying the right method for a problem |
+            | **Checkpoint Handling** | Building on loaded prior context for complex decisions |
+            ### Scoring
+            - Each test case scores 0-100 based on content matching and quality heuristics
+            - Category score = average of test case scores
+            - Overall score = weighted average of category scores
+            - Multiple submissions for the same model are averaged (running mean)
+            ### Recommended Models
+            For i,Robot mode, we recommend models scoring **60+** overall:
+            - **DeepSeek V3.2** · **GPT-5.2** · **Claude Sonnet 4.5** · **GLM-4.7**
+            - GPT-4o and Claude Sonnet 4 are also acceptable
+            ### Running the Benchmark
+            In Clippy Settings, enable i,Robot mode and click "Run Benchmark."
+            Results are automatically submitted to this leaderboard.
+            ### Source
+            - [Clippy App](https://github.com/NewJerseyStyle/Clippy-App)
+            - Space: `npc0/clippy-irobot-bench`
+            """
+            )
+    return app
+# ==================== Entry Point ====================
+if __name__ == "__main__":
+    app = create_app()
+    app.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ gradio>=4.0.0
2	+ pandas