Spaces:

AMA-bench
/

AMA-bench-Leaderboard

Running

App Files Files Community

NorahYujieZhao commited on Mar 3

Commit

d8b2e03

1 Parent(s): e839e6a

the new version

Browse files

Files changed (18) hide show

UPDATES_v2.md +275 -0
app.py +999 -225
assets/model_colors.json +30 -0
content.py +56 -0
data/agent_capability.json +270 -0
data/agent_domain.json +404 -0
data/method_data.json +0 -160
data/model_capability.json +586 -0
data/model_data.json +0 -94
data/model_domain.json +404 -0
gaia-leaderboard +1 -0
lmgame_bench +1 -0
requirements.txt +4 -1
scorer.py +166 -0
utils.py +224 -0
validate_jsonl.py +205 -0
view_samples.py +181 -0
visualization.py +664 -0

UPDATES_v2.md ADDED Viewed

	@@ -0,0 +1,275 @@

+# AMA-Bench Leaderboard Updates v2.0
+## ✅ 完成的更新
+### 1. **Summary表格优化**
+- ✅ **新增Rank列**：显示排名作为第一列
+- ✅ **奖牌标识**：前三名自动添加 🥇🥈🥉 奖牌
+- ✅ **移除Categories列**：简化表格，只保留关键信息
+- ✅ **表格列结构**：Rank | Agent/Model | Avg Accuracy | Avg F1
+### 2. **配色方案升级**
+更新为更易区分的配色方案，参考原图：
+```python
+COLORS = [
+    'rgba(135, 160, 220, 0.5)',  # Light Blue
+    'rgba(230, 150, 120, 0.5)',  # Orange
+    'rgba(180, 180, 180, 0.5)',  # Gray
+    'rgba(255, 215, 100, 0.5)',  # Yellow
+    'rgba(140, 180, 220, 0.5)',  # Sky Blue
+    'rgba(140, 200, 150, 0.5)',  # Green
+    'rgba(200, 160, 140, 0.5)',  # Brown
+    'rgba(130, 140, 200, 0.5)',  # Purple-Blue
+    'rgba(255, 180, 150, 0.5)',  # Coral
+    'rgba(150, 220, 180, 0.5)',  # Mint Green
+]
+```
+**特点**：
+- 10种明显不同的颜色
+- 更好的视觉区分度
+- 适合雷达图和柱状图
+### 3. **Top N 动态选择**
+每个图表都添加了滑块控制：
+- **范围**：1-10
+- **默认值**：8
+- **实时更新**：拖动滑块立即刷新图表
+- **应用范围**：
+  - Agent Domain Performance (雷达图)
+  - Agent Capability Performance (2x2柱状图)
+  - Model Domain Performance (雷达图)
+  - Model Capability Performance (2x2柱状图)
+## 📊 新功能展示
+### Summary 表格示例
+```
+Rank    Agent           Avg Accuracy    Avg F1
+🥇 1    Long context    54.21%         34.61%
+🥈 2    Hipporag2       44.86%         20.32%
+🥉 3    GRAPHRAG        34.63%         27.58%
+4       Memorybank      35.64%         28.59%
+5       Amem            33.14%         26.31%
+```
+### Top N 滑块
+```
+┌────────────────────────────────┐
+│ Show Top N Agents              │
+│ ┣━━━━━━━●━━━━┫ 8               │
+│ Select how many top agents     │
+│ to display (1-10)              │
+└────────────────────────────────┘
+```
+## 🎨 视觉改进
+### 雷达图 (Radar Chart)
+- ✅ 显示Top N个表现最佳的项目
+- ✅ 使用新配色方案，更易区分
+- ✅ 动态切换显示数量
+- ✅ 保留交互功能（点击图例切换）
+### 柱状图 (2x2 Bar Chart)
+- ✅ 每个子图显示Top N个项目
+- ✅ 按accuracy降序排列
+- ✅ 使用新配色方案
+- ✅ 动态调整显示数量
+## 🚀 使用方法
+### 1. 启动应用
+```bash
+python3 app.py
+```
+### 2. 选择Top N
+1. 打开任意图表页面
+2. 使用滑块选择显示数量（1-10）
+3. 图表自动更新
+### 3. 查看排名
+1. 打开Summary Statistics折叠面板
+2. 查看Rank列，前三名有奖牌标识
+3. 表格按Avg Accuracy降序排列
+## 📝 技术细节
+### 排名计算
+```python
+# 按平均accuracy排序
+df = df.sort_values(by="_acc_sort", ascending=False)
+# 添加排名和奖牌
+medals = ["🥇", "🥈", "🥉"]
+ranks = []
+for i in range(len(df)):
+    if i < 3:
+        ranks.append(f"{medals[i]} {i+1}")
+    else:
+        ranks.append(str(i+1))
+```
+### Top N 筛选
+```python
+# 计算每个item的平均分数
+item_avg_scores = {}
+for item in all_items:
+    scores = [...]
+    item_avg_scores[item] = np.mean(scores)
+# 获取Top N
+sorted_items = sorted(item_avg_scores.items(),
+                      key=lambda x: x[1],
+                      reverse=True)
+top_items = [item[0] for item in sorted_items[:top_n]]
+```
+### 动态更新
+```python
+# 滑块改变时更新图表
+agent_domain_top_n.change(
+    fn=lambda n: create_radar_chart_from_dict(
+        AGENT_DOMAIN,
+        "Agent Performance Across Domains",
+        top_n=int(n)
+    ),
+    inputs=[agent_domain_top_n],
+    outputs=[agent_domain_chart]
+)
+```
+## 🎯 界面结构
+```
+🤖 Agent Performance
+├── 🎯 Domain Performance
+│   ├── Slider: Show Top N Agents (1-10)
+│   ├── Radar Chart (动态显示Top N)
+│   └── 📊 Summary Statistics (含Rank和奖牌)
+└── ⚡ Capability Performance
+    ├── Slider: Show Top N Agents (1-10)
+    ├── 2x2 Bar Chart (每个子图Top N)
+    └── 📊 Summary Statistics (含Rank和奖牌)
+🔬 Model Performance
+├── 🎯 Domain Performance
+│   ├── Slider: Show Top N Models (1-10)
+│   ├── Radar Chart (动态显示Top N)
+│   └── 📊 Summary Statistics (含Rank和奖牌)
+└── ⚡ Capability Performance
+    ├── Slider: Show Top N Models (1-10)
+    ├── 2x2 Bar Chart (每个子图Top N)
+    └── 📊 Summary Statistics (含Rank和奖牌)
+ℹ️ About
+└── 完整文档说明
+```
+## ✨ 特色功能
+### 1. 智能排名系统
+- 自动计算平均分数
+- 按accuracy降序排列
+- 前三名特殊标识（奖牌）
+- 清晰的数字排名
+### 2. 灵活的显示控制
+- 1-10可调范围
+- 实时响应
+- 独立控制每个图表
+- 默认显示Top 8
+### 3. 优化的配色
+- 10种明显区分的颜色
+- 50%透明度（线条/标记）
+- 15%透明度（填充区域）
+- 符合视觉设计规范
+### 4. 完整的交互性
+- 点击图例切换显示
+- 双击隔离单项
+- 悬停查看详细数值
+- 缩放和平移
+## 📈 数据示例
+### Agent Domain JSON
+```json
+{
+  "Game": {
+    "Long context": {
+      "accuracy": 0.5321,
+      "f1": 0.3285
+    },
+    "Hipporag2": {
+      "accuracy": 0.5934,
+      "f1": 0.2289
+    }
+  }
+}
+```
+### Summary Table 输出
+| Rank | Agent | Avg Accuracy | Avg F1 |
+|------|-------|--------------|--------|
+| 🥇 1 | Long context | 54.21% | 34.61% |
+| 🥈 2 | Hipporag2 | 44.86% | 20.32% |
+| 🥉 3 | GRAPHRAG | 34.63% | 27.58% |
+## 🔍 对比变化
+### 旧版本
+```
+表格列：Agent | Avg Accuracy | Avg F1 | Categories
+配色：15种相似的蓝绿色
+显示：全部项目，无法筛选
+```
+### 新版本
+```
+表格列：Rank | Agent | Avg Accuracy | Avg F1
+配色：10种明显不同的颜色
+显示：可选Top 1-10，动态调整
+奖牌：🥇🥈🥉 for top 3
+```
+## 💡 使用建议
+1. **对比少数顶尖选手**：设置Top 3-5
+2. **全面查看性能**：设置Top 8-10
+3. **关注冠军**：设置Top 1
+4. **查看详细排名**：展开Summary Statistics
+## 📦 文件说明
+- **app.py** - 主应用文件（已完全重写）
+- **data/agent_capability.json** - Agent能力数据
+- **data/agent_domain.json** - Agent领域数据
+- **data/model_capability.json** - Model能力数据
+- **data/model_domain.json** - Model领域数据
+## 🎓 代码亮点
+### 高度模块化
+- `create_radar_chart_from_dict()` - 雷达图生成
+- `create_capability_subplots()` - 2x2柱状图生成
+- `create_summary_table()` - 表格生成
+- 所有函数都支持`top_n`参数
+### 智能排序
+- 自动计算平均分
+- 多维度排序
+- 奖牌自动分配
+### 响应式设计
+- 滑块实时更新
+- 无需刷新页面
+- 流畅的用户体验
+---
+**版本**: v2.0
+**更新日期**: 2026-03-02
+**状态**: ✅ 所有功能已实现并测试

app.py CHANGED Viewed

@@ -1,324 +1,1098 @@
 import gradio as gr
 import pandas as pd
 import json
-import numpy as np
 import plotly.graph_objects as go
 # ---------------------------------------------------------------------------
 # Data loading
 # ---------------------------------------------------------------------------
-def load_data(path):
     with open(path, "r", encoding="utf-8") as f:
         return json.load(f)
-MODEL_DATA = load_data("data/model_data.json")
-METHOD_DATA = load_data("data/method_data.json")
 METRICS = ["Recall", "Causal Inference", "State Updating", "State Abstraction"]
-ALL_METRICS = METRICS + ["Average"]
 # ---------------------------------------------------------------------------
-# DataFrame helpers
 # ---------------------------------------------------------------------------
-def build_dataframe(data):
-    """Build a pandas DataFrame showing Accuracy (F1) for each metric."""
-    rows = []
-    for entry in data["entries"]:
-        row = {"Method": entry["method"]}
-        if entry.get("category"):
-            row["Category"] = entry["category"]
-        for m in ALL_METRICS:
-            acc = entry["scores"][m]["accuracy"]
-            f1 = entry["scores"][m]["f1"]
-            row[m] = f"{acc:.4f} ({f1:.4f})"
-        # Store raw average accuracy for sorting
-        row["_sort_avg"] = entry["scores"]["Average"]["accuracy"]
-        rows.append(row)
-    df = pd.DataFrame(rows)
-    df = df.sort_values("_sort_avg", ascending=False).reset_index(drop=True)
-    df = df.drop(columns=["_sort_avg"])
-    return df
-def build_chart_dataframe(data):
-    """Build a DataFrame with raw numeric Accuracy values for charting."""
-    rows = []
-    for entry in data["entries"]:
-        row = {"Method": entry["method"]}
-        for m in ALL_METRICS:
-            row[f"{m} (Acc)"] = entry["scores"][m]["accuracy"]
-        row["_sort_avg"] = entry["scores"]["Average"]["accuracy"]
-        rows.append(row)
-    df = pd.DataFrame(rows)
-    df = df.sort_values("_sort_avg", ascending=False).reset_index(drop=True)
-    df = df.drop(columns=["_sort_avg"])
-    return df
-def add_medals(df):
-    """Add medal emojis to the top-3 Method names."""
-    df = df.copy()
-    medals = ["\U0001f947", "\U0001f948", "\U0001f949"]
-    for i in range(min(3, len(df))):
-        df.loc[i, "Method"] = f"{medals[i]} {df.loc[i, 'Method']}"
-    return df
 # ---------------------------------------------------------------------------
-# Chart helpers
 # ---------------------------------------------------------------------------
-BAR_COLORS = ["#636EFA", "#EF553B", "#00CC96", "#AB63FA"]
-def make_bar_chart(chart_df, title=""):
-    """Create a grouped vertical bar chart showing Accuracy per metric."""
     fig = go.Figure()
-    for i, m in enumerate(METRICS):
-        fig.add_trace(go.Bar(
-            x=chart_df["Method"],
-            y=chart_df[f"{m} (Acc)"],
-            name=m,
-            marker_color=BAR_COLORS[i % len(BAR_COLORS)],
-        ))
-    # Wrap long titles to 2 lines
-    if len(title) > 60:
-        mid = len(title) // 2
-        space_pos = title.find(" ", mid)
-        if space_pos == -1:
-            space_pos = title.rfind(" ", 0, mid)
-        if space_pos != -1:
-            title = title[:space_pos] + "<br>" + title[space_pos + 1:]
     fig.update_layout(
-        barmode="group",
-        title=dict(text=title, x=0.5, font=dict(size=14)),
-        yaxis=dict(title="Accuracy", range=[0, 1]),
-        xaxis=dict(tickangle=-45),
-        height=500,
-        margin=dict(l=60, r=40, t=100, b=140),
         legend=dict(
-            orientation="h", yanchor="bottom", y=1.02,
-            xanchor="center", x=0.5, font=dict(size=12),
         ),
-        bargap=0.2,
-        bargroupgap=0.05,
     )
     return fig
-# ---------------------------------------------------------------------------
-# Update functions
-# ---------------------------------------------------------------------------
-def update_leaderboard(data, top_n):
-    """Return (display_df, bar_fig) for a given data source."""
-    df = build_dataframe(data)
-    chart_df = build_chart_dataframe(data)
-    df = df.head(int(top_n))
-    chart_df = chart_df.head(int(top_n))
-    display_df = add_medals(df)
-    title = data.get("title", "Score Breakdown")
-    bar = make_bar_chart(chart_df, title)
-    return display_df, bar
-def update_model_leaderboard(top_n):
-    return update_leaderboard(MODEL_DATA, top_n)
-def update_method_leaderboard(top_n):
-    return update_leaderboard(METHOD_DATA, top_n)
 # ---------------------------------------------------------------------------
-# App
 # ---------------------------------------------------------------------------
-CSS = """
-html, body {
-    overflow-y: auto !important;
-    width: 100% !important;
-}
-.gradio-container {
-    max-width: 1200px !important;
-    margin: auto !important;
-}
-.header-banner {
-    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
-    color: white;
-    padding: 24px 32px;
-    border-radius: 12px;
-    margin-bottom: 16px;
-    text-align: center;
-}
-.header-banner h1 { margin: 0 0 8px 0; font-size: 2em; }
-.header-banner p { margin: 0; font-size: 1.1em; opacity: 0.9; }
-.dark .header-banner {
-    background: linear-gradient(135deg, #434190 0%, #553c6b 100%);
-}
-.table-container {
-    border-radius: 8px;
-    box-shadow: 0 2px 10px rgba(0,0,0,0.08);
-}
-.tip-text {
-    font-size: 13px; color: #666; font-style: italic; margin-top: 4px;
-}
-.dark .tip-text { color: #aaa; }
-.metric-note {
-    background: #f0f4ff; padding: 10px 16px; border-radius: 8px;
-    border-left: 4px solid #667eea; margin-bottom: 12px; font-size: 14px;
-}
-.dark .metric-note {
-    background: #2d2d44; border-left-color: #764ba2;
-}
-"""
-def build_app():
-    with gr.Blocks(css=CSS, title="AMA-Bench Leaderboard") as demo:
         # Header
         gr.HTML("""
-        <div class="header-banner">
-            <h1>AMA-Bench Leaderboard</h1>
-            <p>Agent Memory Assessment Benchmark &mdash; Evaluating LLMs and Memory Methods on Cognitive Tasks</p>
         </div>
         """)
         with gr.Tabs():
             # ============================================================
-            # Tab 1: Model Leaderboard
             # ============================================================
-            with gr.Tab("Model Leaderboard"):
                 gr.Markdown("""
-                <div class="metric-note">
-                Comparing <strong>LLM models</strong> across 4 cognitive tasks: Recall, Causal Inference, State Updating, and State Abstraction.
-                Results are reported as <strong>Accuracy (F1)</strong>. Sorted by Average Accuracy.
-                </div>
                 """)
-                with gr.Row():
-                    model_top_n = gr.Slider(
-                        minimum=1,
-                        maximum=len(MODEL_DATA["entries"]),
-                        step=1,
-                        value=len(MODEL_DATA["entries"]),
-                        label="Number of models to display",
-                    )
-                # Chart
-                with gr.Row():
-                    gr.Markdown("### Data Visualization")
-                model_bar = gr.Plot(label="Score Breakdown")
-                gr.Markdown("*Click a legend entry to isolate that metric. Double-click to add more for comparison.*", elem_classes="tip-text")
-                # Table
-                with gr.Row():
-                    gr.Markdown("### Detailed Results")
-                init_model_df, _ = update_model_leaderboard(len(MODEL_DATA["entries"]))
-                model_table = gr.DataFrame(
-                    value=init_model_df,
-                    elem_classes="table-container",
-                    show_row_numbers=True,
-                    show_fullscreen_button=True,
-                    show_search="search",
-                    interactive=False,
-                )
-                # Wire events
-                model_top_n.change(
-                    update_model_leaderboard,
-                    inputs=[model_top_n],
-                    outputs=[model_table, model_bar],
-                )
-                demo.load(
-                    update_model_leaderboard,
-                    inputs=[model_top_n],
-                    outputs=[model_table, model_bar],
-                )
             # ============================================================
-            # Tab 2: Method Leaderboard
             # ============================================================
-            with gr.Tab("Method Leaderboard"):
                 gr.Markdown("""
-                <div class="metric-note">
-                Comparing <strong>RAG &amp; Agent Memory methods</strong> (base model: Qwen-32B) across 4 cognitive tasks.
-                Results are reported as <strong>Accuracy (F1)</strong>. Sorted by Average Accuracy.
-                </div>
                 """)
                 with gr.Row():
-                    method_top_n = gr.Slider(
-                        minimum=1,
-                        maximum=len(METHOD_DATA["entries"]),
-                        step=1,
-                        value=len(METHOD_DATA["entries"]),
-                        label="Number of methods to display",
-                    )
-                # Chart
-                with gr.Row():
-                    gr.Markdown("### Data Visualization")
-                method_bar = gr.Plot(label="Score Breakdown")
-                gr.Markdown("*Click a legend entry to isolate that metric. Double-click to add more for comparison.*", elem_classes="tip-text")
-                # Table
                 with gr.Row():
-                    gr.Markdown("### Detailed Results")
-                init_method_df, _ = update_method_leaderboard(len(METHOD_DATA["entries"]))
-                method_table = gr.DataFrame(
-                    value=init_method_df,
-                    elem_classes="table-container",
-                    show_row_numbers=True,
-                    show_fullscreen_button=True,
-                    show_search="search",
-                    interactive=False,
-                )
-                # Wire events
-                method_top_n.change(
-                    update_method_leaderboard,
-                    inputs=[method_top_n],
-                    outputs=[method_table, method_bar],
-                )
-                demo.load(
-                    update_method_leaderboard,
-                    inputs=[method_top_n],
-                    outputs=[method_table, method_bar],
                 )
             # ============================================================
-            # Tab 3: About
             # ============================================================
-            with gr.Tab("About"):
                 gr.Markdown("""
 ## AMA-Bench: Agent Memory Assessment Benchmark
 AMA-Bench evaluates memory capabilities of LLMs and memory-augmented agents across four cognitive dimensions:
-**Recall** (retrieving stored info), **Causal Inference** (cause-and-effect reasoning), **State Updating** (tracking evolving states), and **State Abstraction** (forming higher-level representations).
-**Benchmarks** &mdash; We evaluate on two complementary subsets:
-(1) **Real-world Subset:** 2,496 QA pairs.
-(2) **Synthetic Subset:** 1,200 QA pairs stratified across five trajectory lengths (8K, 16K, 32K, 64K, and 128K tokens), with 240 samples per interval.
-**Leaderboard Tabs** &mdash; *Model Leaderboard* compares LLM models directly; *Method Leaderboard* compares RAG and Agent Memory methods using Qwen-32B as the base model.
-**Metrics** &mdash; Results are reported as **Accuracy (F1)**.
 ---
 *For questions or submissions, please open a discussion in the Community tab.*
                 """)

 import gradio as gr
 import pandas as pd
 import json
 import plotly.graph_objects as go
+from plotly.subplots import make_subplots
+import numpy as np
+import os
+import datetime
+from email.utils import parseaddr
+# Optional imports with fallbacks
+try:
+    from content import format_error, format_warning, format_log
+except ImportError:
+    def format_error(msg): return f"❌ **Error:** {msg}"
+    def format_warning(msg): return f"⚠️ **Warning:** {msg}"
+    def format_log(msg): return f"✅ {msg}"
+try:
+    from scorer import score_submission, extract_uppercase_letters
+except ImportError:
+    score_submission = None
+    extract_uppercase_letters = None
+try:
+    from utils import load_groundtruth, validate_submission_file
+except ImportError:
+    load_groundtruth = None
+    validate_submission_file = None
+# Configuration
+TOKEN = os.environ.get("TOKEN", None)
+OWNER = "Pettingllms"
+GROUNDTRUTH_PATH = f"{OWNER}/AMA-bench"
+LOCAL_DEBUG = True
 # ---------------------------------------------------------------------------
 # Data loading
 # ---------------------------------------------------------------------------
+def load_json_data(path):
+    """Load JSON data from file."""
     with open(path, "r", encoding="utf-8") as f:
         return json.load(f)
+# Load all data files
+AGENT_CAPABILITY = load_json_data("data/agent_capability.json")
+AGENT_DOMAIN = load_json_data("data/agent_domain.json")
+MODEL_CAPABILITY = load_json_data("data/model_capability.json")
+MODEL_DOMAIN = load_json_data("data/model_domain.json")
 METRICS = ["Recall", "Causal Inference", "State Updating", "State Abstraction"]
+# Weighted ratios (from benchmark data distribution)
+# Exact ratios from counts
+# Domain counts total = 2463
+DOMAIN_RATIO = {
+    "TEXT2SQL": 612 / 2463,
+    "SOFTWARE_ENGINEER": 432 / 2463,
+    "WEB": 372 / 2463,
+    "EMBODIED_AI": 360 / 2463,
+    "OPENWORLD_QA": 360 / 2463,
+    "GAME": 327 / 2463,
+}
+# Problem-type counts total = 2462
+# Type A/B/C/D -> Recall/Causal Inference/State Updating/State Abstraction
+PROBLEM_TYPE_RATIO = {
+    "RECALL": 835 / 2462,            # Type A
+    "CAUSAL_INFERENCE": 578 / 2462,  # Type B
+    "STATE_UPDATING": 635 / 2462,    # Type C
+    "STATE_ABSTRACTION": 414 / 2462, # Type D
+}
+DOMAIN_ALIASES = {
+    "TEXT2SQL": "TEXT2SQL",
+    "SOFTWARE": "SOFTWARE_ENGINEER",
+    "SOFTWARE_ENGINEER": "SOFTWARE_ENGINEER",
+    "WEB": "WEB",
+    "EMBODIED_AI": "EMBODIED_AI",
+    "OPENWORLD_QA": "OPENWORLD_QA",
+    "GAME": "GAME",
+    "GAMING": "GAME",
+}
+PROBLEM_TYPE_ALIASES = {
+    "TYPE_A": "RECALL",
+    "TYPE_B": "CAUSAL_INFERENCE",
+    "TYPE_C": "STATE_UPDATING",
+    "TYPE_D": "STATE_ABSTRACTION",
+    "RECALL": "RECALL",
+    "CAUSAL": "CAUSAL_INFERENCE",
+    "CAUSAL_INFERENCE": "CAUSAL_INFERENCE",
+    "STATE": "STATE_UPDATING",
+    "STATE_UPDATING": "STATE_UPDATING",
+    "ABSTRACTION": "STATE_ABSTRACTION",
+    "STATE_ABSTRACTION": "STATE_ABSTRACTION",
+}
+def _normalize_category_key(name: str) -> str:
+    """Normalize category key to uppercase snake-style for robust matching."""
+    return str(name).strip().upper().replace(" ", "_").replace("-", "_")
+def get_category_weights(categories):
+    """Return normalized per-category weights based on configured ratios."""
+    if not categories:
+        return {}
+    normalized = [_normalize_category_key(c) for c in categories]
+    domain_hits = sum(1 for c in normalized if c in DOMAIN_ALIASES)
+    type_hits = sum(1 for c in normalized if c in PROBLEM_TYPE_ALIASES)
+    # Detect whether current dict is domain-based or capability/problem-type-based
+    use_domain = domain_hits >= type_hits
+    weights = {}
+    for original in categories:
+        key = _normalize_category_key(original)
+        if use_domain:
+            canonical = DOMAIN_ALIASES.get(key, "")
+            weight = DOMAIN_RATIO.get(canonical, 0.0)
+        else:
+            canonical = PROBLEM_TYPE_ALIASES.get(key, "")
+            weight = PROBLEM_TYPE_RATIO.get(canonical, 0.0)
+        weights[original] = weight
+    total = sum(weights.values())
+    if total <= 0:
+        equal_weight = 1.0 / len(categories)
+        return {c: equal_weight for c in categories}
+    return {c: w / total for c, w in weights.items()}
+def filter_data_by_items(data_dict, allowed_items):
+    """Filter nested score dict to only keep specified items for each category."""
+    allowed_set = set(allowed_items)
+    filtered = {}
+    for category, category_data in data_dict.items():
+        filtered[category] = {
+            item: item_data
+            for item, item_data in category_data.items()
+            if item in allowed_set
+        }
+    return filtered
+# Color palette: Distinct colors for better differentiation
+COLORS = [
+    'rgba(135, 160, 220, 0.5)',  # Light Blue
+    'rgba(230, 150, 120, 0.5)',  # Orange
+    'rgba(180, 180, 180, 0.5)',  # Gray
+    'rgba(255, 215, 100, 0.5)',  # Yellow
+    'rgba(140, 180, 220, 0.5)',  # Sky Blue
+    'rgba(140, 200, 150, 0.5)',  # Green
+    'rgba(200, 160, 140, 0.5)',  # Brown
+    'rgba(130, 140, 200, 0.5)',  # Purple-Blue
+    'rgba(255, 180, 150, 0.5)',  # Coral
+    'rgba(150, 220, 180, 0.5)',  # Mint Green
+]
 # ---------------------------------------------------------------------------
+# Submission processing functions
 # ---------------------------------------------------------------------------
+def calculate_f1_score(predictions, references):
+    """Calculate F1 score for multi-label classification."""
+    if not predictions or not references:
+        return 0.0
+    if extract_uppercase_letters is None:
+        # Fallback implementation
+        def extract_letters(text):
+            return ''.join(sorted(set(c for c in str(text) if c.isupper() and c.isalpha())))
+        extract_fn = extract_letters
+    else:
+        extract_fn = extract_uppercase_letters
+    total_precision = 0.0
+    total_recall = 0.0
+    count = 0
+    for pred, ref in zip(predictions, references):
+        pred_set = set(extract_fn(pred))
+        ref_set = set(extract_fn(ref))
+        if not pred_set and not ref_set:
+            total_precision += 1.0
+            total_recall += 1.0
+            count += 1
+        elif not pred_set or not ref_set:
+            count += 1
+        else:
+            intersection = len(pred_set & ref_set)
+            precision = intersection / len(pred_set) if pred_set else 0
+            recall = intersection / len(ref_set) if ref_set else 0
+            total_precision += precision
+            total_recall += recall
+            count += 1
+    if count == 0:
+        return 0.0
+    avg_precision = total_precision / count
+    avg_recall = total_recall / count
+    if avg_precision + avg_recall == 0:
+        return 0.0
+    f1 = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall)
+    return f1
+def update_json_with_submission(model_name, scores_by_metric, scored_submissions, is_agent=False, model_family=""):
+    """Update JSON files with new submission data."""
+    try:
+        if is_agent:
+            capability_file = "data/agent_capability.json"
+            domain_file = "data/agent_domain.json"
+        else:
+            capability_file = "data/model_capability.json"
+            domain_file = "data/model_domain.json"
+        # Load existing data
+        with open(capability_file, 'r', encoding='utf-8') as f:
+            capability_data = json.load(f)
+        # Update capability data
+        for capability in METRICS:
+            if capability in scores_by_metric and capability in capability_data:
+                metric_data = scores_by_metric[capability]
+                # Get submissions for this capability
+                capability_submissions = [
+                    s for s in scored_submissions
+                    if s.get('metric_category') == capability
+                ]
+                # Calculate F1
+                if capability_submissions:
+                    predictions = [s.get('answer', '') for s in capability_submissions]
+                    references = [s.get('reference_answer', '') for s in capability_submissions]
+                    f1 = calculate_f1_score(predictions, references)
+                else:
+                    f1 = 0.0
+                capability_data[capability][model_name] = {
+                    "accuracy": metric_data['accuracy'],
+                    "model_family": model_family,
+                    "f1": f1
+                }
+        # Save updated data
+        with open(capability_file, 'w', encoding='utf-8') as f:
+            json.dump(capability_data, f, indent=2, ensure_ascii=False)
+        print(f"✓ Updated {capability_file}")
+        return True
+    except Exception as e:
+        print(f"Error updating JSON files: {e}")
+        import traceback
+        traceback.print_exc()
+        return False
+def add_new_submission(model, submission_type, url, file, organisation, mail, model_family=""):
+    """Process and evaluate a new model/agent submission."""
+    try:
+        # Validate inputs
+        if file is None:
+            return format_warning("Please attach a file.")
+        _, parsed_mail = parseaddr(mail)
+        if "@" not in parsed_mail:
+            return format_warning("Please provide a valid email address.")
+        if not model or not submission_type or not organisation:
+            return format_warning("Please fill in all required fields.")
+        print(f"Processing submission from {organisation}/{model}")
+        # Check if functions are available
+        if validate_submission_file is None or score_submission is None or load_groundtruth is None:
+            return format_warning(
+                "Submission processing modules are not fully available. "
+                "Please ensure scorer.py and utils.py are present."
+            )
+        # Validate file
+        is_valid, error_msg, submissions = validate_submission_file(file.name)
+        if not is_valid:
+            return format_error(error_msg)
+        print(f"✓ Validated {len(submissions)} submissions")
+        # Load ground truth
+        groundtruth = load_groundtruth(GROUNDTRUTH_PATH, TOKEN)
+        if not groundtruth:
+            return format_warning(
+                "Ground truth data could not be loaded. "
+                "Submission received but cannot be scored automatically."
+            )
+        print(f"✓ Loaded {len(groundtruth)} ground truth Q&A pairs")
+        # Score submissions
+        result = score_submission(submissions, groundtruth)
+        scores_by_metric = result["scores"]
+        scored_submissions = result["scored_submissions"]
+        average_accuracy = scores_by_metric["Average"]["accuracy"]
+        print(f"✓ Overall accuracy: {average_accuracy:.4f}")
+        for metric_name, metric_data in scores_by_metric.items():
+            if metric_name != "Average":
+                print(f"  {metric_name}: {metric_data['accuracy']:.4f} ({metric_data['correct']}/{metric_data['count']})")
+        # Save locally
+        submission_dir = f"submissions/{organisation}_{model}"
+        os.makedirs(submission_dir, exist_ok=True)
+        timestamp = datetime.datetime.today().strftime('%Y%m%d_%H%M%S')
+        # Save files
+        scored_file = f"{submission_dir}/submission_scored_{timestamp}.jsonl"
+        with open(scored_file, 'w', encoding='utf-8') as f:
+            for submission in scored_submissions:
+                f.write(json.dumps(submission, ensure_ascii=False) + "\n")
+        metadata = {
+            "model": model,
+            "submission_type": submission_type,
+            "url": url,
+            "organisation": organisation,
+            "timestamp": timestamp,
+            "overall_accuracy": float(average_accuracy),
+            "scores_by_metric": {
+                metric_name: {
+                    "accuracy": float(metric_data["accuracy"]),
+                    "count": int(metric_data["count"]),
+                    "correct": int(metric_data["correct"])
+                }
+                for metric_name, metric_data in scores_by_metric.items()
+            }
+        }
+        metadata_file = f"{submission_dir}/metadata_{timestamp}.json"
+        with open(metadata_file, 'w', encoding='utf-8') as f:
+            json.dump(metadata, f, indent=2, ensure_ascii=False)
+        print(f"✓ Saved results to {submission_dir}")
+        # Update JSON files
+        is_agent = (submission_type.lower() == "agent")
+        update_success = update_json_with_submission(
+            model, scores_by_metric, scored_submissions, is_agent=is_agent, model_family=model_family
+        )
+        if update_success:
+            print("✓ Updated leaderboard JSON files")
+            # Reload data
+            global AGENT_CAPABILITY, AGENT_DOMAIN, MODEL_CAPABILITY, MODEL_DOMAIN
+            if is_agent:
+                AGENT_CAPABILITY = load_json_data("data/agent_capability.json")
+                AGENT_DOMAIN = load_json_data("data/agent_domain.json")
+            else:
+                MODEL_CAPABILITY = load_json_data("data/model_capability.json")
+                MODEL_DOMAIN = load_json_data("data/model_domain.json")
+        # Format message
+        message = f"✅ **Submission successful!**\n\n"
+        message += f"**{'Agent' if is_agent else 'Model'}:** {model}\n"
+        message += f"**Organisation:** {organisation}\n"
+        message += f"**Overall Accuracy:** {average_accuracy:.4f}\n\n"
+        message += "**Scores by Capability:**\n"
+        for metric_name in METRICS:
+            if metric_name in scores_by_metric:
+                metric_data = scores_by_metric[metric_name]
+                message += f"- **{metric_name}:** {metric_data['accuracy']:.4f} ({metric_data['correct']}/{metric_data['count']})\n"
+        message += f"\n**Submission ID:** {timestamp}\n"
+        if update_success:
+            message += f"\n*The leaderboard has been updated. Refresh the page to see changes.*"
+        return format_log(message)
+    except Exception as e:
+        import traceback
+        traceback.print_exc()
+        return format_error(f"An error occurred: {str(e)}")
 # ---------------------------------------------------------------------------
+# Visualization functions
 # ---------------------------------------------------------------------------
+def create_radar_chart_from_dict(data_dict, title="Performance Radar Chart", top_n=10):
+    """
+    Create radar chart from dictionary data showing top N entries.
+    Args:
+        data_dict: Dictionary with structure {category: {item_name: {accuracy: x, f1: y}}}
+        title: Chart title
+        top_n: Number of top entries to display (default 10)
+    Returns:
+        Plotly Figure with radar chart (showing only accuracy)
+    """
+    if not data_dict:
+        fig = go.Figure()
+        fig.update_layout(title="No data available")
+        return fig
+    # Extract categories and items
+    categories = list(data_dict.keys())
+    all_items = set()
+    for category_data in data_dict.values():
+        all_items.update(category_data.keys())
+    # Calculate weighted average accuracy for each item to determine top N
+    category_weights = get_category_weights(categories)
+    item_avg_scores = {}
+    for item in all_items:
+        weighted_sum = 0.0
+        weight_sum = 0.0
+        for category in categories:
+            item_data = data_dict[category].get(item, {})
+            accuracy = item_data.get('accuracy', 0) if isinstance(item_data, dict) else item_data
+            weight = category_weights.get(category, 0.0)
+            weighted_sum += accuracy * weight
+            weight_sum += weight
+        item_avg_scores[item] = (weighted_sum / weight_sum) if weight_sum > 0 else 0
+    # Get top N items by average accuracy
+    sorted_items = sorted(item_avg_scores.items(), key=lambda x: x[1], reverse=True)
+    top_items = [item[0] for item in sorted_items[:top_n]]
     fig = go.Figure()
+    # Add trace for each top item
+    for idx, item in enumerate(top_items):
+        values = []
+        for category in categories:
+            item_data = data_dict[category].get(item, {})
+            # Extract accuracy value only
+            accuracy = item_data.get('accuracy', 0) if isinstance(item_data, dict) else item_data
+            values.append(accuracy * 100)  # Convert to percentage
+        # Close the polygon
+        values_closed = values + [values[0]]
+        categories_closed = categories + [categories[0]]
+        color = COLORS[idx % len(COLORS)]
+        fig.add_trace(go.Scatterpolar(
+            r=values_closed,
+            theta=categories_closed,
+            mode='lines+markers',
+            fill='toself',
+            name=item,
+            line=dict(color=color, width=2),
+            marker=dict(color=color, size=8),
+            fillcolor=color.replace('0.5', '0.15'),
+            hovertemplate='<b>%{fullData.name}</b><br>%{theta}: %{r:.2f}%<extra></extra>'
+        ))
+    # Update layout
     fig.update_layout(
+        title=dict(
+            text=title,
+            x=0.5,
+            xanchor='center',
+            font=dict(size=20, color='#2c3e50')
+        ),
+        polar=dict(
+            radialaxis=dict(
+                visible=True,
+                range=[0, 100],
+                ticksuffix='%',
+                tickfont=dict(size=11),
+                gridcolor='rgba(200, 200, 200, 0.3)',
+                gridwidth=1
+            ),
+            angularaxis=dict(
+                tickfont=dict(size=13, weight='bold', color='#2c3e50')
+            ),
+            bgcolor='rgba(245, 245, 245, 0.5)'
+        ),
         legend=dict(
+            font=dict(size=11),
+            title=dict(text="Items", font=dict(size=13)),
+            x=1.02,
+            y=1,
+            xanchor='left',
+            yanchor='top',
+            bgcolor='rgba(255,255,255,0.8)',
+            bordercolor='rgba(100,100,100,0.3)',
+            borderwidth=1,
+            itemclick="toggleothers",
+            itemdoubleclick="toggle"
         ),
+        height=600,
+        margin=dict(l=80, r=250, t=100, b=80),
+        paper_bgcolor='white',
+        font=dict(color='#2c3e50')
     )
     return fig
+def create_capability_subplots(data_dict, title="Capability Performance", top_n=10):
+    """
+    Create 2x2 subplot layout with one bar chart per capability, showing top N entries.
+    Optimized for responsive sizing with equal spacing across all subplots.
+    Args:
+        data_dict: Dictionary with structure {capability: {item_name: {accuracy: x, f1: y}}}
+        title: Overall chart title
+        top_n: Number of top entries to display per subplot (default 10)
+    Returns:
+        Plotly Figure with 2x2 subplots (showing only accuracy)
+    """
+    if not data_dict:
+        fig = go.Figure()
+        fig.update_layout(title="No data available")
+        return fig
+    # Extract capabilities
+    capabilities = list(data_dict.keys())
+    # Create 2x2 subplot with optimized spacing for full window coverage
+    fig = make_subplots(
+        rows=2, cols=2,
+        subplot_titles=capabilities[:4],
+        vertical_spacing=0.15,  # Increased for better separation
+        horizontal_spacing=0.12,  # Balanced horizontal spacing
+        specs=[[{"secondary_y": False}, {"secondary_y": False}],
+               [{"secondary_y": False}, {"secondary_y": False}]]
+    )
+    # Position mapping for 2x2 grid
+    positions = [(1, 1), (1, 2), (2, 1), (2, 2)]
+    # Get all unique items across all capabilities for consistent coloring
+    all_items = set()
+    for capability_data in data_dict.values():
+        all_items.update(capability_data.keys())
+    all_items = sorted(list(all_items))
+    # Create a bar chart for each capability
+    for idx, capability in enumerate(capabilities[:4]):
+        row, col = positions[idx]
+        capability_data = data_dict[capability]
+        # Sort items by accuracy score for this capability and get top N
+        sorted_items = sorted(
+            capability_data.items(),
+            key=lambda x: x[1].get('accuracy', 0) if isinstance(x[1], dict) else x[1],
+            reverse=True
+        )[:top_n]
+        item_names = [item[0] for item in sorted_items]
+        item_scores = [
+            (item[1].get('accuracy', 0) if isinstance(item[1], dict) else item[1]) * 100
+            for item in sorted_items
+        ]
+        # Assign colors based on global item index
+        colors = [COLORS[all_items.index(name) % len(COLORS)] for name in item_names]
+        fig.add_trace(
+            go.Bar(
+                x=item_names,
+                y=item_scores,
+                marker=dict(
+                    color=colors,
+                    line=dict(color='rgba(50, 50, 50, 0.5)', width=1)
+                ),
+                showlegend=False,
+                hovertemplate='<b>%{x}</b><br>Score: %{y:.2f}%<extra></extra>',
+                width=0.7
+            ),
+            row=row, col=col
+        )
+        # Update axes with consistent styling
+        fig.update_xaxes(
+            tickangle=-45,
+            tickfont=dict(size=9),
+            tickmode='linear',
+            row=row, col=col,
+            showgrid=False,
+            showline=True,
+            linewidth=1,
+            linecolor='rgba(200, 200, 200, 0.5)'
+        )
+        fig.update_yaxes(
+            range=[0, 100],
+            title_text="Performance (%)",
+            title_font=dict(size=12),
+            tickfont=dict(size=10),
+            gridcolor='rgba(200, 200, 200, 0.3)',
+            row=row, col=col,
+            showline=True,
+            linewidth=1,
+            linecolor='rgba(200, 200, 200, 0.5)'
+        )
+    # Update overall layout with fully responsive sizing
+    fig.update_layout(
+        title=dict(
+            text=title,
+            x=0.5,
+            xanchor='center',
+            font=dict(size=20, color='#2c3e50')
+        ),
+        height=900,  # Increased height for better proportions
+        autosize=True,
+        showlegend=False,
+        plot_bgcolor='rgba(245, 245, 245, 0.5)',
+        paper_bgcolor='white',
+        font=dict(color='#2c3e50', family="Arial, sans-serif"),
+        margin=dict(l=80, r=80, t=100, b=120),  # Increased margins for better spacing
+        hovermode='closest'
+    )
+    # Update subplot titles styling
+    for annotation in fig['layout']['annotations']:
+        annotation['font'] = dict(size=14, color='#2c3e50')
+        annotation['xanchor'] = 'center'
+        annotation['showarrow'] = False
+    return fig
+def create_summary_table(data_dict, type_name="Agent"):
+    """
+    Create summary table showing rank, average accuracy and F1 scores.
+    Args:
+        data_dict: Dictionary with structure {category: {item_name: {accuracy: x, f1: y}}}
+        type_name: "Agent" or "Model"
+    Returns:
+        pandas DataFrame with rank, accuracy and F1 columns
+    """
+    if not data_dict:
+        return pd.DataFrame()
+    # Calculate average scores for each item
+    items = set()
+    for category_data in data_dict.values():
+        items.update(category_data.keys())
+    categories = list(data_dict.keys())
+    category_weights = get_category_weights(categories)
+    rows = []
+    for item in sorted(items):
+        weighted_accuracy_sum = 0.0
+        weighted_f1_sum = 0.0
+        used_weight_sum = 0.0
+        model_family = ""
+        for category, category_data in data_dict.items():
+            if item in category_data:
+                item_data = category_data[item]
+                weight = category_weights.get(category, 0.0)
+                if isinstance(item_data, dict):
+                    weighted_accuracy_sum += item_data.get('accuracy', 0) * weight
+                    weighted_f1_sum += item_data.get('f1', 0) * weight
+                    used_weight_sum += weight
+                    if not model_family:
+                        model_family = item_data.get('model_family', '')
+                else:
+                    weighted_accuracy_sum += item_data * weight
+                    used_weight_sum += weight
+        avg_accuracy = (weighted_accuracy_sum / used_weight_sum) if used_weight_sum > 0 else 0
+        avg_f1 = (weighted_f1_sum / used_weight_sum) if used_weight_sum > 0 else 0
+        rows.append({
+            type_name: item,
+            "Model Family": model_family,
+            "Avg Accuracy": avg_accuracy,
+            "Avg F1": avg_f1,
+            "_acc_sort": avg_accuracy
+        })
+    df = pd.DataFrame(rows)
+    df = df.sort_values(by="_acc_sort", ascending=False).reset_index(drop=True)
+    # Add rank column with medals for top 3
+    medals = ["🥇", "🥈", "🥉"]
+    ranks = []
+    for i in range(len(df)):
+        if i < 3:
+            ranks.append(f"{medals[i]} {i+1}")
+        else:
+            ranks.append(str(i+1))
+    df.insert(0, "Rank", ranks)
+    # Format accuracy and F1 as percentages
+    df["Avg Accuracy"] = df["Avg Accuracy"].apply(lambda x: f"{x * 100:.2f}%")
+    df["Avg F1"] = df["Avg F1"].apply(lambda x: f"{x * 100:.2f}%")
+    # Drop sorting column
+    df = df.drop(columns=["_acc_sort"])
+    return df
 # ---------------------------------------------------------------------------
+# Build Gradio interface
 # ---------------------------------------------------------------------------
+def build_app():
+    """Build the Gradio application."""
+    CSS = """
+    .markdown-text {
+        font-size: 16px !important;
+    }
+    .intro-box {
+        background: linear-gradient(135deg, rgba(26, 188, 156, 0.1) 0%, rgba(52, 152, 219, 0.1) 100%);
+        padding: 25px;
+        border-radius: 10px;
+        margin: 20px 0;
+        border-left: 4px solid #1abc9c;
+    }
+    """
+    # Keep Model Domain view strictly model-only (prevents accidental agent entries)
+    model_items = set()
+    for capability_data in MODEL_CAPABILITY.values():
+        model_items.update(capability_data.keys())
+    model_domain_filtered = filter_data_by_items(MODEL_DOMAIN, model_items)
+    if not any(len(category_data) > 0 for category_data in model_domain_filtered.values()):
+        # If model_domain.json is polluted with non-model entries, avoid showing wrong (agent) curves
+        model_domain_filtered = {}
+    with gr.Blocks(css=CSS, title="AMA-Bench Leaderboard", theme=gr.themes.Soft()) as demo:
         # Header
         gr.HTML("""
+        <div style="text-align: center; padding: 10px 20px; margin-bottom: 20px;">
+            <h1 style="margin: 0; font-size: 48px; font-weight: 700; color: #1a1a2e;">
+                🤖 AMA-Bench: Leaderboard
+            </h1>
+            <p style="font-size: 18px; color: #666; margin-top: 10px;">
+                Agent Memory Assessment Benchmark - Performance Visualization
+            </p>
+        </div>
+        """)
+        # Welcome Banner
+        gr.HTML("""
+        <div class="intro-box">
+            <h3 style="margin: 0 0 15px 0; color: #1abc9c; font-size: 24px;">
+                🎯 Welcome to AMA-Bench!
+            </h3>
+            <p style="margin: 15px 0; color: #2c3e50; font-size: 22px; font-weight: 700; line-height: 1.6;">
+                Evaluate agent memory itself, not just dialogue.
+            </p>
+            <p style="margin: 10px 0; color: #2c3e50; font-size: 16px; line-height: 1.6;">
+                Built from real agent environment streams and scalable long-horizon trajectories across
+                representative domains, AMA-Bench tests whether LLM agents can <strong>recall</strong>,
+                perform <strong>causal inference</strong>, <strong>update state</strong>, and
+                <strong>abstract</strong> state information over long runs.
+            </p>
+            <p style="margin: 10px 0; color: #34495e; font-size: 14px;">
+                📄 Paper: <a href="https://arxiv.org/abs/2602.22769" style="color: #3498db;">https://arxiv.org/abs/2602.22769</a>
+            </p>
         </div>
         """)
         with gr.Tabs():
             # ============================================================
+            # Tab 1: Agent Performance
             # ============================================================
+            with gr.Tab("🤖 Agent Performance"):
                 gr.Markdown("""
+                ### Agent Performance Analysis
+                Explore agent performance across different domains and capabilities.
                 """)
+                with gr.Tabs():
+                    # Domain Sub-tab (Radar Chart)
+                    with gr.Tab("🎯 Domain Performance"):
+                        gr.Markdown("""
+                        **Radar chart** showing agent performance across different domains.
+                        Click legend items to isolate specific agents.
+                        """)
+                        with gr.Row():
+                            agent_domain_top_n = gr.Slider(
+                                minimum=1,
+                                maximum=10,
+                                value=8,
+                                step=1,
+                                label="Show Top N Agents",
+                                info="Select how many top agents to display (1-10)"
+                            )
+                        agent_domain_chart = gr.Plot(
+                            value=create_radar_chart_from_dict(
+                                AGENT_DOMAIN,
+                                "Agent Performance Across Domains",
+                                top_n=8
+                            )
+                        )
+                        with gr.Accordion("📊 Summary Statistics", open=True):
+                            agent_domain_table = gr.Dataframe(
+                                value=create_summary_table(AGENT_DOMAIN, "Agent"),
+                                label="Average Domain Scores"
+                            )
+                        # Update chart when slider changes
+                        agent_domain_top_n.change(
+                            fn=lambda n: create_radar_chart_from_dict(
+                                AGENT_DOMAIN,
+                                "Agent Performance Across Domains",
+                                top_n=int(n)
+                            ),
+                            inputs=[agent_domain_top_n],
+                            outputs=[agent_domain_chart]
+                        )
+                    # Capability Sub-tab (Bar Chart)
+                    with gr.Tab("⚡ Capability Performance"):
+                        gr.Markdown("""
+                        Showing agent performance for each capability.
+                        Each subplot represents one capability with comparative performance across all agents.
+                        """)
+                        with gr.Row():
+                            agent_capability_top_n = gr.Slider(
+                                minimum=1,
+                                maximum=10,
+                                value=8,
+                                step=1,
+                                label="Show Top N Agents",
+                                info="Select how many top agents to display per capability (1-10)"
+                            )
+                        agent_capability_chart = gr.Plot(
+                            value=create_capability_subplots(
+                                AGENT_CAPABILITY,
+                                "Agent Performance by Capability",
+                                top_n=8
+                            )
+                        )
+                        with gr.Accordion("📊 Summary Statistics", open=True):
+                            agent_capability_table = gr.Dataframe(
+                                value=create_summary_table(AGENT_CAPABILITY, "Agent"),
+                                label="Average Capability Scores"
+                            )
+                        # Update chart when slider changes
+                        agent_capability_top_n.change(
+                            fn=lambda n: create_capability_subplots(
+                                AGENT_CAPABILITY,
+                                "Agent Performance by Capability",
+                                top_n=int(n)
+                            ),
+                            inputs=[agent_capability_top_n],
+                            outputs=[agent_capability_chart]
+                        )
             # ============================================================
+            # Tab 2: Model Performance
             # ============================================================
+            with gr.Tab("🔬 Model Performance"):
                 gr.Markdown("""
+                ### Model Performance Analysis
+                Explore model performance across different domains and capabilities.
+                """)
+                with gr.Tabs():
+                    # Domain Sub-tab (Radar Chart)
+                    with gr.Tab("🎯 Domain Performance"):
+                        gr.Markdown("""
+                        **Radar chart** showing model performance across different domains.
+                        Click legend items to isolate specific models.
+                        """)
+                        with gr.Row():
+                            model_domain_top_n = gr.Slider(
+                                minimum=1,
+                                maximum=10,
+                                value=8,
+                                step=1,
+                                label="Show Top N Models",
+                                info="Select how many top models to display (1-10)"
+                            )
+                        model_domain_chart = gr.Plot(
+                            value=create_radar_chart_from_dict(
+                                model_domain_filtered,
+                                "Model Performance Across Domains",
+                                top_n=8
+                            )
+                        )
+                        with gr.Accordion("📊 Summary Statistics", open=True):
+                            model_domain_table = gr.Dataframe(
+                                value=create_summary_table(model_domain_filtered, "Model"),
+                                label="Average Domain Scores"
+                            )
+                        # Update chart when slider changes
+                        model_domain_top_n.change(
+                            fn=lambda n: create_radar_chart_from_dict(
+                                model_domain_filtered,
+                                "Model Performance Across Domains",
+                                top_n=int(n)
+                            ),
+                            inputs=[model_domain_top_n],
+                            outputs=[model_domain_chart]
+                        )
+                    # Capability Sub-tab (Bar Chart)
+                    with gr.Tab("⚡ Capability Performance"):
+                        gr.Markdown("""
+                       Show model performance for each capability.
+                        Each subplot represents one capability with comparative performance across all models.
+                        """)
+                        with gr.Row():
+                            model_capability_top_n = gr.Slider(
+                                minimum=1,
+                                maximum=10,
+                                value=8,
+                                step=1,
+                                label="Show Top N Models",
+                                info="Select how many top models to display per capability (1-10)"
+                            )
+                        model_capability_chart = gr.Plot(
+                            value=create_capability_subplots(
+                                MODEL_CAPABILITY,
+                                "Model Performance by Capability",
+                                top_n=8
+                            )
+                        )
+                        with gr.Accordion("📊 Summary Statistics", open=True):
+                            model_capability_table = gr.Dataframe(
+                                value=create_summary_table(MODEL_CAPABILITY, "Model"),
+                                label="Average Capability Scores"
+                            )
+                        # Update chart when slider changes
+                        model_capability_top_n.change(
+                            fn=lambda n: create_capability_subplots(
+                                MODEL_CAPABILITY,
+                                "Model Performance by Capability",
+                                top_n=int(n)
+                            ),
+                            inputs=[model_capability_top_n],
+                            outputs=[model_capability_chart]
+                        )
+            # ============================================================
+            # Tab 3: Submit
+            # ============================================================
+            with gr.Tab("📤 Submit"):
+                gr.Markdown("""
+                ### Submit Your Model/Agent for Evaluation
+                Submit your model or agent predictions to be evaluated on AMA-Bench.
+                Your results will be automatically scored and added to the leaderboard.
                 """)
                 with gr.Row():
+                    with gr.Column():
+                        model_name_textbox = gr.Textbox(
+                            label="Model/Agent Name",
+                            placeholder="e.g., GPT-4 or MyAgent-v2"
+                        )
+                        submission_type = gr.Radio(
+                            choices=["Model", "Agent"],
+                            label="Submission Type",
+                            value="Model"
+                        )
+                        url_textbox = gr.Textbox(
+                            label="URL to Model/Agent Information",
+                            placeholder="https://..."
+                        )
+                    with gr.Column():
+                        organisation = gr.Textbox(
+                            label="Organisation",
+                            placeholder="e.g., OpenAI, Anthropic"
+                        )
+                        model_family_textbox = gr.Textbox(
+                            label="Model Family",
+                            placeholder="e.g., GPT-4, Claude-3, Qwen3-32B"
+                        )
+                        mail = gr.Textbox(
+                            label="Contact Email",
+                            placeholder="your.email@example.com"
+                        )
+                        file_upload = gr.File(
+                            label="Submission File (JSONL format)",
+                            file_types=[".jsonl"]
+                        )
+                gr.Markdown("""
+                **Submission Format:**
+                Your JSONL file should contain one prediction per line:
+                ```json
+                {"episode_id": "ep_001", "question": "What is X?", "answer": "A"}
+                {"episode_id": "ep_002", "question": "What is Y?", "answer": "BC"}
+                ```
+                **Required fields:**
+                - `episode_id`: Episode identifier
+                - `question`: The question text
+                - `answer`: Your model's answer (uppercase letters: A, B, AB, etc.)
+                """)
                 with gr.Row():
+                    submit_button = gr.Button("Submit", variant="primary", size="lg")
+                submission_result = gr.Markdown()
+                submit_button.click(
+                    add_new_submission,
+                    [
+                        model_name_textbox,
+                        submission_type,
+                        url_textbox,
+                        file_upload,
+                        organisation,
+                        mail,
+                        model_family_textbox
+                    ],
+                    submission_result,
                 )
             # ============================================================
+            # Tab 4: About
             # ============================================================
+            with gr.Tab("ℹ️ About"):
                 gr.Markdown("""
 ## AMA-Bench: Agent Memory Assessment Benchmark
 AMA-Bench evaluates memory capabilities of LLMs and memory-augmented agents across four cognitive dimensions:
+**Recall** (retrieving stored info), **Causal Inference** (cause-and-effect reasoning),
+**State Updating** (tracking evolving states), and **State Abstraction** (forming higher-level representations).
+### Benchmarks
+We evaluate on two complementary subsets:
+1. **Real-world Subset:** 2,496 QA pairs from real agent environment streams
+2. **Synthetic Subset:** 1,200 QA pairs stratified across five trajectory lengths (8K, 16K, 32K, 64K, and 128K tokens)
+### Leaderboard Tabs
+- **Agent Performance**: Compares RAG and Agent Memory methods
+  - Domain Performance: Radar charts across 6 domains (Game, Embodied AI, Web, Text2SQL, Openworld QA, Software Engineer)
+  - Capability Performance: showing performance on 4 capabilities
+  - **Top N Selection**: Choose to display top 1-10 performers
+- **Model Performance**: Compares LLM models directly
+  - Domain Performance: Radar charts showing performance across different application domains
+  - Capability Performance: showing performance on each cognitive capability
+  - **Top N Selection**: Choose to display top 1-10 performers
+### Metrics
+Results are reported as **Accuracy** and **F1 Score**:
+- Charts display **Accuracy** only for clarity
+- Summary statistics tables show both **Avg Accuracy** and **Avg F1**
+- Tables include **Rank** with 🥇🥈🥉 medals for top 3 performers
+### Visualization Features
+- **Interactive Charts**: Click legend items to toggle visibility, double-click to isolate
+- **Color Scheme**: Distinct color palette for optimal differentiation between entries
+- **Top N Filter**: Dynamic slider to select how many top performers to display (1-10)
+- **Hover Details**: Hover over data points for detailed performance information
+- **Zoom & Pan**: Use chart controls to explore data interactively
 ---
+**Paper:** [https://arxiv.org/abs/2602.22769](https://arxiv.org/abs/2602.22769)
 *For questions or submissions, please open a discussion in the Community tab.*
                 """)

assets/model_colors.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "comment": "Color scheme for AMA-Bench leaderboard visualizations",
+  "models": {
+    "Claude Haiku 3.5": "#4A90E2",
+    "GPT-5-mini": "#00BFA5",
+    "GPT 5.2": "#00796B",
+    "Gemini 2.5 Flash": "#FF4081",
+    "Qwen2.5-14B-1M": "#FFC107",
+    "Qwen3-32B": "#FFB300",
+    "Qwen3-14B": "#FFA000",
+    "Qwen3-8B": "#FF8F00"
+  },
+  "methods": {
+    "BM25": "#9E9E9E",
+    "Qwen3-Emb-4B": "#FFA726",
+    "GraphRAG": "#FF7043",
+    "HippoRAG2": "#FF5722",
+    "MemAgent": "#7E57C2",
+    "Mem1": "#5E35B1",
+    "Amem": "#673AB7",
+    "Mem0": "#512DA8",
+    "MemoRAG": "#4527A0",
+    "MemGPT": "#311B92",
+    "Mem-alpha": "#6A1B9A",
+    "MemoryBank": "#8E24AA",
+    "Simple Mem": "#9C27B0",
+    "AMA Agent": "#00897B"
+  },
+  "fallback": "#808080"
+}

content.py ADDED Viewed

	@@ -0,0 +1,56 @@

+TITLE = """<h1 align="center" id="space-title">AMA-Bench Leaderboard</h1>"""
+INTRODUCTION_TEXT = """
+AMA-Bench evaluates the memory capabilities of LLMs and memory-augmented agents across four cognitive dimensions:
+**Recall** (retrieving stored information), **Causal Inference** (cause-and-effect reasoning), **State Updating** (tracking evolving states), and **State Abstraction** (forming higher-level representations).
+## Leaderboard
+Our leaderboard presents results for the multiple-choice subset, which provides objective and easier-to-score evaluation.
+See below for submission details.
+"""
+SUBMISSION_TEXT = """
+## Submissions
+Results can be submitted for evaluation. Each submission should contain answers for all questions in the benchmark.
+We expect submissions to be JSON Lines files with the following format:
+```
+{"episode_id": "traj_id_1", "answer_list": ["(A)", "(B)(C)", "(D)"], "reasoning_trace": "optional"}
+```
+**Required fields:**
+- `episode_id`: The episode identifier
+- `answer_list`: Your model's answer list for the questions in the episode (a list of strings, e.g., ["(A)", "(B)(C)", "(D)"])
+- `reasoning_trace`: (Optional) The reasoning process or explanation for the answers
+"""
+CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
+CITATION_BUTTON_TEXT = r"""@misc{ama-bench,
+      title={AMA-Bench: Agent Memory Assessment Benchmark},
+      author={AMA-Bench Team},
+      year={2024}
+}"""
+def format_error(msg):
+    """Format error message with red styling."""
+    return f"<p style='color: red; font-size: 20px; text-align: center;'>{msg}</p>"
+def format_warning(msg):
+    """Format warning message with orange styling."""
+    return f"<p style='color: orange; font-size: 20px; text-align: center;'>{msg}</p>"
+def format_log(msg):
+    """Format success message with green styling."""
+    return f"<p style='color: green; font-size: 20px; text-align: center;'>{msg}</p>"
+def model_hyperlink(link, model_name):
+    """Create a hyperlink to the model information."""
+    if not link or link.strip() == "":
+        return model_name
+    return f'<a target="_blank" href="{link}" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">{model_name}</a>'

data/agent_capability.json ADDED Viewed

	@@ -0,0 +1,270 @@

+{
+  "Recall": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.47196666666666665,
+      "model_family": "Qwen3-32B",
+      "f1": 0.14795
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.31029999999999996,
+      "model_family": "Qwen3-32B",
+      "f1": 0.28025
+    },
+    "Hipporag2": {
+      "accuracy": 0.4413833333333333,
+      "model_family": "Qwen3-32B",
+      "f1": 0.23165
+    },
+    "Memagent": {
+      "accuracy": 0.2511333333333334,
+      "model_family": "Qwen3-32B",
+      "f1": 0.13931666666666667
+    },
+    "Mem1": {
+      "accuracy": 0.12108333333333333,
+      "model_family": "Qwen3-32B",
+      "f1": 0.18071666666666666
+    },
+    "Amem": {
+      "accuracy": 0.29723333333333335,
+      "model_family": "Qwen3-32B",
+      "f1": 0.26671666666666666
+    },
+    "Mem0": {
+      "accuracy": 0.20451666666666668,
+      "model_family": "Qwen3-32B",
+      "f1": 0.24041666666666664
+    },
+    "Memorag": {
+      "accuracy": 0.44153333333333333,
+      "model_family": "Qwen3-32B",
+      "f1": 0.16653333333333334
+    },
+    "Memgpt": {
+      "accuracy": 0.32865,
+      "model_family": "Qwen3-32B",
+      "f1": 0.12778333333333333
+    },
+    "Mem-alpha": {
+      "accuracy": 0.28221666666666667,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2279
+    },
+    "Memorybank": {
+      "accuracy": 0.32088333333333335,
+      "model_family": "Qwen3-32B",
+      "f1": 0.31371666666666664
+    },
+    "Simple mem": {
+      "accuracy": 0.18241666666666667,
+      "model_family": "Qwen3-32B",
+      "f1": 0.20383333333333334
+    },
+    "Long context": {
+      "accuracy": 0.6036833333333333,
+      "model_family": "Qwen3-32B",
+      "f1": 0.4152833333333333
+    }
+  },
+  "Casual Inference": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.48618333333333336,
+      "model_family": "Qwen3-32B",
+      "f1": 0.14101666666666665
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.4079333333333333,
+      "model_family": "Qwen3-32B",
+      "f1": 0.27426666666666666
+    },
+    "Hipporag2": {
+      "accuracy": 0.4965,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1859666666666667
+    },
+    "Memagent": {
+      "accuracy": 0.33666666666666667,
+      "model_family": "Qwen3-32B",
+      "f1": 0.14706666666666665
+    },
+    "Mem1": {
+      "accuracy": 0.1495,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1698666666666667
+    },
+    "Amem": {
+      "accuracy": 0.37051666666666666,
+      "model_family": "Qwen3-32B",
+      "f1": 0.27376666666666666
+    },
+    "Mem0": {
+      "accuracy": 0.27725,
+      "model_family": "Qwen3-32B",
+      "f1": 0.24518333333333334
+    },
+    "Memorag": {
+      "accuracy": 0.5261,
+      "model_family": "Qwen3-32B",
+      "f1": 0.16540000000000002
+    },
+    "Memgpt": {
+      "accuracy": 0.4437333333333333,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1383
+    },
+    "Mem-alpha": {
+      "accuracy": 0.4193166666666667,
+      "model_family": "Qwen3-32B",
+      "f1": 0.19181666666666666
+    },
+    "Memorybank": {
+      "accuracy": 0.42110000000000003,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2900333333333333
+    },
+    "Simple mem": {
+      "accuracy": 0.18955,
+      "model_family": "Qwen3-32B",
+      "f1": 0.16668333333333332
+    },
+    "Long context": {
+      "accuracy": 0.5399999999999999,
+      "model_family": "Qwen3-32B",
+      "f1": 0.34326666666666666
+    }
+  },
+  "State Updating": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.3541,
+      "model_family": "Qwen3-32B",
+      "f1": 0.12335
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.31843333333333335,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2622666666666667
+    },
+    "Hipporag2": {
+      "accuracy": 0.43685,
+      "model_family": "Qwen3-32B",
+      "f1": 0.18171666666666667
+    },
+    "Memagent": {
+      "accuracy": 0.27918333333333334,
+      "model_family": "Qwen3-32B",
+      "f1": 0.13036666666666666
+    },
+    "Mem1": {
+      "accuracy": 0.12353333333333333,
+      "model_family": "Qwen3-32B",
+      "f1": 0.16081666666666666
+    },
+    "Amem": {
+      "accuracy": 0.30775,
+      "model_family": "Qwen3-32B",
+      "f1": 0.24678333333333335
+    },
+    "Mem0": {
+      "accuracy": 0.21891666666666665,
+      "model_family": "Qwen3-32B",
+      "f1": 0.22273333333333334
+    },
+    "Memorag": {
+      "accuracy": 0.4015666666666666,
+      "model_family": "Qwen3-32B",
+      "f1": 0.15636666666666668
+    },
+    "Memgpt": {
+      "accuracy": 0.291,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1203
+    },
+    "Mem-alpha": {
+      "accuracy": 0.2964333333333333,
+      "model_family": "Qwen3-32B",
+      "f1": 0.19146666666666667
+    },
+    "Memorybank": {
+      "accuracy": 0.30411666666666665,
+      "model_family": "Qwen3-32B",
+      "f1": 0.26855
+    },
+    "Simple mem": {
+      "accuracy": 0.17581666666666665,
+      "model_family": "Qwen3-32B",
+      "f1": 0.16231666666666666
+    },
+    "Long context": {
+      "accuracy": 0.48335,
+      "model_family": "Qwen3-32B",
+      "f1": 0.3447166666666666
+    }
+  },
+  "State abstraction": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.3022666666666667,
+      "model_family": "Qwen3-32B",
+      "f1": 0.15885
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.30451666666666666,
+      "model_family": "Qwen3-32B",
+      "f1": 0.25921666666666665
+    },
+    "Hipporag2": {
+      "accuracy": 0.36443333333333333,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1758333333333333
+    },
+    "Memagent": {
+      "accuracy": 0.22045,
+      "model_family": "Qwen3-32B",
+      "f1": 0.16438333333333333
+    },
+    "Mem1": {
+      "accuracy": 0.11385,
+      "model_family": "Qwen3-32B",
+      "f1": 0.21061666666666667
+    },
+    "Amem": {
+      "accuracy": 0.29383333333333334,
+      "model_family": "Qwen3-32B",
+      "f1": 0.297
+    },
+    "Mem0": {
+      "accuracy": 0.15946666666666667,
+      "model_family": "Qwen3-32B",
+      "f1": 0.22685
+    },
+    "Memorag": {
+      "accuracy": 0.3564333333333334,
+      "model_family": "Qwen3-32B",
+      "f1": 0.205
+    },
+    "Memgpt": {
+      "accuracy": 0.2680166666666667,
+      "model_family": "Qwen3-32B",
+      "f1": 0.14603333333333332
+    },
+    "Mem-alpha": {
+      "accuracy": 0.22561666666666666,
+      "model_family": "Qwen3-32B",
+      "f1": 0.21555
+    },
+    "Memorybank": {
+      "accuracy": 0.3507166666666666,
+      "model_family": "Qwen3-32B",
+      "f1": 0.30448333333333333
+    },
+    "Simple mem": {
+      "accuracy": 0.14003333333333332,
+      "model_family": "Qwen3-32B",
+      "f1": 0.16598333333333334
+    },
+    "Long context": {
+      "accuracy": 0.37979999999999997,
+      "model_family": "Qwen3-32B",
+      "f1": 0.3152333333333333
+    }
+  }
+}

data/agent_domain.json ADDED Viewed

	@@ -0,0 +1,404 @@

+{
+  "GAMING": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.5157,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2195
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.5595249999999999,
+      "model_family": "Qwen3-32B",
+      "f1": 0.288175
+    },
+    "Hipporag2": {
+      "accuracy": 0.60555,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2273
+    },
+    "Memagent": {
+      "accuracy": 0.31775,
+      "model_family": "Qwen3-32B",
+      "f1": 0.22945
+    },
+    "Mem1": {
+      "accuracy": 0.225875,
+      "model_family": "Qwen3-32B",
+      "f1": 0.18155
+    },
+    "Amem": {
+      "accuracy": 0.4247,
+      "model_family": "Qwen3-32B",
+      "f1": 0.343125
+    },
+    "Mem0": {
+      "accuracy": 0.39085000000000003,
+      "model_family": "Qwen3-32B",
+      "f1": 0.346
+    },
+    "Memorag": {
+      "accuracy": 0.557625,
+      "model_family": "Qwen3-32B",
+      "f1": 0.257875
+    },
+    "Memgpt": {
+      "accuracy": 0.435425,
+      "model_family": "Qwen3-32B",
+      "f1": 0.318475
+    },
+    "Mem-alpha": {
+      "accuracy": 0.43895,
+      "model_family": "Qwen3-32B",
+      "f1": 0.319875
+    },
+    "Memorybank": {
+      "accuracy": 0.43885,
+      "model_family": "Qwen3-32B",
+      "f1": 0.325325
+    },
+    "Simple mem": {
+      "accuracy": 0.288775,
+      "model_family": "Qwen3-32B",
+      "f1": 0.163
+    },
+    "Long context": {
+      "accuracy": 0.5355,
+      "model_family": "Qwen3-32B",
+      "f1": 0.321775
+    }
+  },
+  "EMBODIED_AI": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.204325,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1353
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.1476,
+      "model_family": "Qwen3-32B",
+      "f1": 0.3799
+    },
+    "Hipporag2": {
+      "accuracy": 0.17627500000000002,
+      "model_family": "Qwen3-32B",
+      "f1": 0.181875
+    },
+    "Memagent": {
+      "accuracy": 0.10617499999999999,
+      "model_family": "Qwen3-32B",
+      "f1": 0.144975
+    },
+    "Mem1": {
+      "accuracy": 0.03355,
+      "model_family": "Qwen3-32B",
+      "f1": 0.22445
+    },
+    "Amem": {
+      "accuracy": 0.183975,
+      "model_family": "Qwen3-32B",
+      "f1": 0.3524
+    },
+    "Mem0": {
+      "accuracy": 0.11109999999999999,
+      "model_family": "Qwen3-32B",
+      "f1": 0.27005
+    },
+    "Memorag": {
+      "accuracy": 0.085425,
+      "model_family": "Qwen3-32B",
+      "f1": 0.17677500000000002
+    },
+    "Memgpt": {
+      "accuracy": 0.1122,
+      "model_family": "Qwen3-32B",
+      "f1": 0.10405
+    },
+    "Mem-alpha": {
+      "accuracy": 0.15515,
+      "model_family": "Qwen3-32B",
+      "f1": 0.23735
+    },
+    "Memorybank": {
+      "accuracy": 0.16025,
+      "model_family": "Qwen3-32B",
+      "f1": 0.426475
+    },
+    "Simple mem": {
+      "accuracy": 0.045975,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2284
+    },
+    "Long context": {
+      "accuracy": 0.48185,
+      "model_family": "Qwen3-32B",
+      "f1": 0.56
+    }
+  },
+  "WEB": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.2872,
+      "model_family": "Qwen3-32B",
+      "f1": 0.08535000000000001
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.420675,
+      "model_family": "Qwen3-32B",
+      "f1": 0.268075
+    },
+    "Hipporag2": {
+      "accuracy": 0.3761,
+      "model_family": "Qwen3-32B",
+      "f1": 0.120125
+    },
+    "Memagent": {
+      "accuracy": 0.263975,
+      "model_family": "Qwen3-32B",
+      "f1": 0.09065
+    },
+    "Mem1": {
+      "accuracy": 0.131275,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1518
+    },
+    "Amem": {
+      "accuracy": 0.391525,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2294
+    },
+    "Mem0": {
+      "accuracy": 0.2705,
+      "model_family": "Qwen3-32B",
+      "f1": 0.21675
+    },
+    "Memorag": {
+      "accuracy": 0.364975,
+      "model_family": "Qwen3-32B",
+      "f1": 0.108075
+    },
+    "Memgpt": {
+      "accuracy": 0.327975,
+      "model_family": "Qwen3-32B",
+      "f1": 0.07105
+    },
+    "Mem-alpha": {
+      "accuracy": 0.362925,
+      "model_family": "Qwen3-32B",
+      "f1": 0.15944999999999998
+    },
+    "Memorybank": {
+      "accuracy": 0.401775,
+      "model_family": "Qwen3-32B",
+      "f1": 0.23704999999999998
+    },
+    "Simple mem": {
+      "accuracy": 0.13974999999999999,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1679
+    },
+    "Long context": {
+      "accuracy": 0.554275,
+      "model_family": "Qwen3-32B",
+      "f1": 0.348075
+    }
+  },
+  "TEXT2SQL": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.4164,
+      "model_family": "Qwen3-32B",
+      "f1": 0.249325
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.21665,
+      "model_family": "Qwen3-32B",
+      "f1": 0.221675
+    },
+    "Hipporag2": {
+      "accuracy": 0.46267499999999995,
+      "model_family": "Qwen3-32B",
+      "f1": 0.26935
+    },
+    "Memagent": {
+      "accuracy": 0.245375,
+      "model_family": "Qwen3-32B",
+      "f1": 0.245375
+    },
+    "Mem1": {
+      "accuracy": 0.06465,
+      "model_family": "Qwen3-32B",
+      "f1": 0.19990000000000002
+    },
+    "Amem": {
+      "accuracy": 0.31405,
+      "model_family": "Qwen3-32B",
+      "f1": 0.289625
+    },
+    "Mem0": {
+      "accuracy": 0.1192,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2326
+    },
+    "Memorag": {
+      "accuracy": 0.619,
+      "model_family": "Qwen3-32B",
+      "f1": 0.296475
+    },
+    "Memgpt": {
+      "accuracy": 0.206875,
+      "model_family": "Qwen3-32B",
+      "f1": 0.178975
+    },
+    "Mem-alpha": {
+      "accuracy": 0.30065,
+      "model_family": "Qwen3-32B",
+      "f1": 0.26505
+    },
+    "Memorybank": {
+      "accuracy": 0.23855,
+      "model_family": "Qwen3-32B",
+      "f1": 0.28355
+    },
+    "Simple mem": {
+      "accuracy": 0.192575,
+      "model_family": "Qwen3-32B",
+      "f1": 0.157225
+    },
+    "Long context": {
+      "accuracy": 0.456075,
+      "model_family": "Qwen3-32B",
+      "f1": 0.295275
+    }
+  },
+  "OPENWORLD_QA": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.399125,
+      "model_family": "Qwen3-32B",
+      "f1": 0.0837
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.31845,
+      "model_family": "Qwen3-32B",
+      "f1": 0.22635
+    },
+    "Hipporag2": {
+      "accuracy": 0.45825,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2362
+    },
+    "Memagent": {
+      "accuracy": 0.158225,
+      "model_family": "Qwen3-32B",
+      "f1": 0.0704
+    },
+    "Mem1": {
+      "accuracy": 0.12065000000000001,
+      "model_family": "Qwen3-32B",
+      "f1": 0.15005
+    },
+    "Amem": {
+      "accuracy": 0.29359999999999997,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2079
+    },
+    "Mem0": {
+      "accuracy": 0.16197499999999998,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1604
+    },
+    "Memorag": {
+      "accuracy": 0.411375,
+      "model_family": "Qwen3-32B",
+      "f1": 0.093675
+    },
+    "Memgpt": {
+      "accuracy": 0.3155,
+      "model_family": "Qwen3-32B",
+      "f1": 0.0595
+    },
+    "Mem-alpha": {
+      "accuracy": 0.2301,
+      "model_family": "Qwen3-32B",
+      "f1": 0.13345
+    },
+    "Memorybank": {
+      "accuracy": 0.3486,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2519
+    },
+    "Simple mem": {
+      "accuracy": 0.12154999999999999,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1312
+    },
+    "Long context": {
+      "accuracy": 0.49785,
+      "model_family": "Qwen3-32B",
+      "f1": 0.3349
+    }
+  },
+  "SOFTWARE": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.599025,
+      "model_family": "Qwen3-32B",
+      "f1": 0.083575
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.348875,
+      "model_family": "Qwen3-32B",
+      "f1": 0.229825
+    },
+    "Hipporag2": {
+      "accuracy": 0.5299,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1279
+    },
+    "Memagent": {
+      "accuracy": 0.53965,
+      "model_family": "Qwen3-32B",
+      "f1": 0.09085
+    },
+    "Mem1": {
+      "accuracy": 0.18595,
+      "model_family": "Qwen3-32B",
+      "f1": 0.17527500000000001
+    },
+    "Amem": {
+      "accuracy": 0.29615,
+      "model_family": "Qwen3-32B",
+      "f1": 0.20395
+    },
+    "Mem0": {
+      "accuracy": 0.2366,
+      "model_family": "Qwen3-32B",
+      "f1": 0.176975
+    },
+    "Memorag": {
+      "accuracy": 0.55005,
+      "model_family": "Qwen3-32B",
+      "f1": 0.10707499999999999
+    },
+    "Memgpt": {
+      "accuracy": 0.599125,
+      "model_family": "Qwen3-32B",
+      "f1": 0.066575
+    },
+    "Mem-alpha": {
+      "accuracy": 0.3476,
+      "model_family": "Qwen3-32B",
+      "f1": 0.12492500000000001
+    },
+    "Memorybank": {
+      "accuracy": 0.5072,
+      "model_family": "Qwen3-32B",
+      "f1": 0.240875
+    },
+    "Simple mem": {
+      "accuracy": 0.2431,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2005
+    },
+    "Long context": {
+      "accuracy": 0.4847,
+      "model_family": "Qwen3-32B",
+      "f1": 0.267725
+    }
+  }
+}

data/method_data.json DELETED Viewed

@@ -1,160 +0,0 @@
-{
-  "title": "Performance comparison of Agent Memory and RAG methods (base model: Qwen-32B) on real-world subset",
-  "metrics": ["Recall", "Causal Inference", "State Updating", "State Abstraction", "Average"],
-  "entries": [
-    {
-      "method": "BM25",
-      "category": "RAG",
-      "scores": {
-        "Recall": {"accuracy": 0.3301, "f1": 0.1465},
-        "Causal Inference": {"accuracy": 0.4264, "f1": 0.1549},
-        "State Updating": {"accuracy": 0.3450, "f1": 0.1325},
-        "State Abstraction": {"accuracy": 0.2498, "f1": 0.1623},
-        "Average": {"accuracy": 0.3436, "f1": 0.1475}
-      }
-    },
-    {
-      "method": "Qwen3-Emb-4B",
-      "category": "RAG",
-      "scores": {
-        "Recall": {"accuracy": 0.4843, "f1": 0.1590},
-        "Causal Inference": {"accuracy": 0.4974, "f1": 0.1549},
-        "State Updating": {"accuracy": 0.3520, "f1": 0.1353},
-        "State Abstraction": {"accuracy": 0.3011, "f1": 0.1610},
-        "Average": {"accuracy": 0.4227, "f1": 0.1522}
-      }
-    },
-    {
-      "method": "GraphRAG",
-      "category": "RAG",
-      "scores": {
-        "Recall": {"accuracy": 0.3077, "f1": 0.2769},
-        "Causal Inference": {"accuracy": 0.3905, "f1": 0.2634},
-        "State Updating": {"accuracy": 0.3140, "f1": 0.2551},
-        "State Abstraction": {"accuracy": 0.2879, "f1": 0.2588},
-        "Average": {"accuracy": 0.3258, "f1": 0.2650}
-      }
-    },
-    {
-      "method": "HippoRAG2",
-      "category": "RAG",
-      "scores": {
-        "Recall": {"accuracy": 0.4579, "f1": 0.2356},
-        "Causal Inference": {"accuracy": 0.5080, "f1": 0.1966},
-        "State Updating": {"accuracy": 0.4403, "f1": 0.1892},
-        "State Abstraction": {"accuracy": 0.3538, "f1": 0.1785},
-        "Average": {"accuracy": 0.4480, "f1": 0.2048}
-      }
-    },
-    {
-      "method": "MemAgent",
-      "category": "Agent Memory",
-      "scores": {
-        "Recall": {"accuracy": 0.2550, "f1": 0.1489},
-        "Causal Inference": {"accuracy": 0.3380, "f1": 0.1606},
-        "State Updating": {"accuracy": 0.2849, "f1": 0.1432},
-        "State Abstraction": {"accuracy": 0.2202, "f1": 0.1655},
-        "Average": {"accuracy": 0.2768, "f1": 0.1530}
-      }
-    },
-    {
-      "method": "Mem1",
-      "category": "Agent Memory",
-      "scores": {
-        "Recall": {"accuracy": 0.1180, "f1": 0.1857},
-        "Causal Inference": {"accuracy": 0.1427, "f1": 0.1732},
-        "State Updating": {"accuracy": 0.1205, "f1": 0.1659},
-        "State Abstraction": {"accuracy": 0.1080, "f1": 0.2042},
-        "Average": {"accuracy": 0.1229, "f1": 0.1807}
-      }
-    },
-    {
-      "method": "Amem",
-      "category": "Agent Memory",
-      "scores": {
-        "Recall": {"accuracy": 0.3084, "f1": 0.2707},
-        "Causal Inference": {"accuracy": 0.3653, "f1": 0.2731},
-        "State Updating": {"accuracy": 0.3088, "f1": 0.2480},
-        "State Abstraction": {"accuracy": 0.2873, "f1": 0.2953},
-        "Average": {"accuracy": 0.3186, "f1": 0.2695}
-      }
-    },
-    {
-      "method": "Mem0",
-      "category": "Agent Memory",
-      "scores": {
-        "Recall": {"accuracy": 0.2011, "f1": 0.2413},
-        "Causal Inference": {"accuracy": 0.2645, "f1": 0.2443},
-        "State Updating": {"accuracy": 0.2101, "f1": 0.2225},
-        "State Abstraction": {"accuracy": 0.1516, "f1": 0.2241},
-        "Average": {"accuracy": 0.2104, "f1": 0.2343}
-      }
-    },
-    {
-      "method": "MemoRAG",
-      "category": "Agent Memory",
-      "scores": {
-        "Recall": {"accuracy": 0.4708, "f1": 0.1789},
-        "Causal Inference": {"accuracy": 0.5497, "f1": 0.1811},
-        "State Updating": {"accuracy": 0.4257, "f1": 0.1713},
-        "State Abstraction": {"accuracy": 0.3659, "f1": 0.2073},
-        "Average": {"accuracy": 0.4606, "f1": 0.1822}
-      }
-    },
-    {
-      "method": "MemGPT",
-      "category": "Agent Memory",
-      "scores": {
-        "Recall": {"accuracy": 0.3289, "f1": 0.1318},
-        "Causal Inference": {"accuracy": 0.4404, "f1": 0.1475},
-        "State Updating": {"accuracy": 0.2809, "f1": 0.1259},
-        "State Abstraction": {"accuracy": 0.2526, "f1": 0.1431},
-        "Average": {"accuracy": 0.3304, "f1": 0.1359}
-      }
-    },
-    {
-      "method": "Mem-alpha",
-      "category": "Agent Memory",
-      "scores": {
-        "Recall": {"accuracy": 0.2876, "f1": 0.2325},
-        "Causal Inference": {"accuracy": 0.4172, "f1": 0.1993},
-        "State Updating": {"accuracy": 0.3064, "f1": 0.2000},
-        "State Abstraction": {"accuracy": 0.2171, "f1": 0.2135},
-        "Average": {"accuracy": 0.3117, "f1": 0.2130}
-      }
-    },
-    {
-      "method": "MemoryBank",
-      "category": "Agent Memory",
-      "scores": {
-        "Recall": {"accuracy": 0.3231, "f1": 0.3128},
-        "Causal Inference": {"accuracy": 0.4100, "f1": 0.2861},
-        "State Updating": {"accuracy": 0.3006, "f1": 0.2678},
-        "State Abstraction": {"accuracy": 0.3332, "f1": 0.3011},
-        "Average": {"accuracy": 0.3397, "f1": 0.2928}
-      }
-    },
-    {
-      "method": "Simple Mem",
-      "category": "Agent Memory",
-      "scores": {
-        "Recall": {"accuracy": 0.2012, "f1": 0.2039},
-        "Causal Inference": {"accuracy": 0.1884, "f1": 0.1612},
-        "State Updating": {"accuracy": 0.1764, "f1": 0.1594},
-        "State Abstraction": {"accuracy": 0.1373, "f1": 0.1689},
-        "Average": {"accuracy": 0.1811, "f1": 0.1764}
-      }
-    },
-    {
-      "method": "AMA Agent",
-      "category": "Agent Memory",
-      "scores": {
-        "Recall": {"accuracy": 0.6238, "f1": 0.3280},
-        "Causal Inference": {"accuracy": 0.6145, "f1": 0.3103},
-        "State Updating": {"accuracy": 0.5305, "f1": 0.2625},
-        "State Abstraction": {"accuracy": 0.4719, "f1": 0.2825},
-        "Average": {"accuracy": 0.5722, "f1": 0.2992}
-      }
-    }
-  ]
-}

data/model_capability.json ADDED Viewed

	@@ -0,0 +1,586 @@

+{
+  "Recall": {
+    "Claude Haiku 3.5": {
+      "accuracy": 0.48456666666666665,
+      "f1": 0.35600000000000004
+    },
+    "OpenAI GPT-5.1 mini": {
+      "accuracy": 0.6773166666666667,
+      "f1": 0.397
+    },
+    "gpt 5.2": {
+      "accuracy": 0.7655,
+      "f1": 0.4805333333333333
+    },
+    "Gemini 2.5 flash": {
+      "accuracy": 0.5763333333333334,
+      "f1": 0.3706
+    },
+    "Qwen2.5-14B-Instruct-1M": {
+      "accuracy": 0.5497833333333334,
+      "f1": 0.41873333333333335
+    },
+    "Qwen3-32B": {
+      "accuracy": 0.6036833333333333,
+      "f1": 0.4152833333333333
+    },
+    "Qwen3-14B": {
+      "accuracy": 0.5599999999999999,
+      "f1": 0.37024999999999997
+    },
+    "Qwen3-8B": {
+      "accuracy": 0.49710000000000004,
+      "f1": 0.3894333333333333
+    },
+    "BM25 (32B)": {
+      "accuracy": 0.3209,
+      "f1": 0.13673333333333335
+    },
+    "Qwen3-Embedding-4B (32B)": {
+      "accuracy": 0.47196666666666665,
+      "f1": 0.14795
+    },
+    "GRAPHRAG (32B)": {
+      "accuracy": 0.31029999999999996,
+      "f1": 0.28025
+    },
+    "Hipporag2 (32B)": {
+      "accuracy": 0.4413833333333333,
+      "f1": 0.23165
+    },
+    "Memagent (32B)": {
+      "accuracy": 0.2511333333333334,
+      "f1": 0.13931666666666667
+    },
+    "Mem1 (32B)": {
+      "accuracy": 0.12108333333333333,
+      "f1": 0.18071666666666666
+    },
+    "Amem (32B)": {
+      "accuracy": 0.29723333333333335,
+      "f1": 0.26671666666666666
+    },
+    "Mem0 (32B)": {
+      "accuracy": 0.20451666666666668,
+      "f1": 0.24041666666666664
+    },
+    "Memorag (32B)": {
+      "accuracy": 0.44153333333333333,
+      "f1": 0.16653333333333334
+    },
+    "Memgpt (32B)": {
+      "accuracy": 0.32865,
+      "f1": 0.12778333333333333
+    },
+    "Mem-alpha (32B)": {
+      "accuracy": 0.28221666666666667,
+      "f1": 0.2279
+    },
+    "Memorybank (32B)": {
+      "accuracy": 0.32088333333333335,
+      "f1": 0.31371666666666664
+    },
+    "Simple mem (32B)": {
+      "accuracy": 0.18241666666666667,
+      "f1": 0.20383333333333334
+    },
+    "AMA-agent (Ours) (32B)": {
+      "accuracy": 0.6319833333333333,
+      "f1": 0.32741666666666663
+    },
+    "BM25 (8B)": {
+      "accuracy": 0.3297666666666667,
+      "f1": 0.12873333333333334
+    },
+    "Qwen3-Embedding-4B (8B)": {
+      "accuracy": 0.4556166666666666,
+      "f1": 0.13745
+    },
+    "GRAPHRAG (8B)": {
+      "accuracy": 0.239,
+      "f1": 0.23536666666666664
+    },
+    "Hipporag2 (8B)": {
+      "accuracy": 0.34790000000000004,
+      "f1": 0.20298333333333332
+    },
+    "Memagent (8B)": {
+      "accuracy": 0.18251666666666666,
+      "f1": 0.13096666666666668
+    },
+    "Mem1 (8B)": {
+      "accuracy": 0.14309999999999998,
+      "f1": 0.14278333333333335
+    },
+    "Amem (8B)": {
+      "accuracy": 0.3001,
+      "f1": 0.25503333333333333
+    },
+    "Mem0 (8B)": {
+      "accuracy": 0.2809,
+      "f1": 0.23186666666666667
+    },
+    "Memgpt (8B)": {
+      "accuracy": 0.28455,
+      "f1": 0.11388333333333334
+    },
+    "Mem-alpha (8B)": {
+      "accuracy": 0.20241666666666666,
+      "f1": 0.21398333333333333
+    },
+    "Memorag (8B)": {
+      "accuracy": 0.37543333333333334,
+      "f1": 0.1662
+    },
+    "Memorybank (8B)": {
+      "accuracy": 0.23948333333333335,
+      "f1": 0.28055
+    },
+    "Simple mem (8B)": {
+      "accuracy": 0.17913333333333334,
+      "f1": 0.1920833333333333
+    },
+    "AMA-agent (Ours) (8B)": {
+      "accuracy": 0.60195,
+      "f1": 0.3065
+    }
+  },
+  "Casual Inference": {
+    "Claude Haiku 3.5": {
+      "accuracy": 0.4799333333333333,
+      "f1": 0.29278333333333334
+    },
+    "OpenAI GPT-5.1 mini": {
+      "accuracy": 0.7091333333333334,
+      "f1": 0.3001666666666667
+    },
+    "gpt 5.2": {
+      "accuracy": 0.7995166666666668,
+      "f1": 0.35365
+    },
+    "Gemini 2.5 flash": {
+      "accuracy": 0.49951666666666666,
+      "f1": 0.26445
+    },
+    "Qwen2.5-14B-Instruct-1M": {
+      "accuracy": 0.4305666666666667,
+      "f1": 0.3269
+    },
+    "Qwen3-32B": {
+      "accuracy": 0.5399999999999999,
+      "f1": 0.34326666666666666
+    },
+    "Qwen3-14B": {
+      "accuracy": 0.46735,
+      "f1": 0.3073
+    },
+    "Qwen3-8B": {
+      "accuracy": 0.39735000000000004,
+      "f1": 0.29578333333333334
+    },
+    "BM25 (32B)": {
+      "accuracy": 0.42081666666666667,
+      "f1": 0.14131666666666667
+    },
+    "Qwen3-Embedding-4B (32B)": {
+      "accuracy": 0.48618333333333336,
+      "f1": 0.14101666666666665
+    },
+    "GRAPHRAG (32B)": {
+      "accuracy": 0.4079333333333333,
+      "f1": 0.27426666666666666
+    },
+    "Hipporag2 (32B)": {
+      "accuracy": 0.4965,
+      "f1": 0.1859666666666667
+    },
+    "Memagent (32B)": {
+      "accuracy": 0.33666666666666667,
+      "f1": 0.14706666666666665
+    },
+    "Mem1 (32B)": {
+      "accuracy": 0.1495,
+      "f1": 0.1698666666666667
+    },
+    "Amem (32B)": {
+      "accuracy": 0.37051666666666666,
+      "f1": 0.27376666666666666
+    },
+    "Mem0 (32B)": {
+      "accuracy": 0.27725,
+      "f1": 0.24518333333333334
+    },
+    "Memorag (32B)": {
+      "accuracy": 0.5261,
+      "f1": 0.16540000000000002
+    },
+    "Memgpt (32B)": {
+      "accuracy": 0.4437333333333333,
+      "f1": 0.1383
+    },
+    "Mem-alpha (32B)": {
+      "accuracy": 0.4193166666666667,
+      "f1": 0.19181666666666666
+    },
+    "Memorybank (32B)": {
+      "accuracy": 0.42110000000000003,
+      "f1": 0.2900333333333333
+    },
+    "Simple mem (32B)": {
+      "accuracy": 0.18955,
+      "f1": 0.16668333333333332
+    },
+    "AMA-agent (Ours) (32B)": {
+      "accuracy": 0.6169833333333333,
+      "f1": 0.30663333333333337
+    },
+    "BM25 (8B)": {
+      "accuracy": 0.43721666666666664,
+      "f1": 0.13381666666666667
+    },
+    "Qwen3-Embedding-4B (8B)": {
+      "accuracy": 0.42788333333333334,
+      "f1": 0.13291666666666666
+    },
+    "GRAPHRAG (8B)": {
+      "accuracy": 0.26385,
+      "f1": 0.2061333333333333
+    },
+    "Hipporag2 (8B)": {
+      "accuracy": 0.44411666666666666,
+      "f1": 0.18869999999999998
+    },
+    "Memagent (8B)": {
+      "accuracy": 0.29035,
+      "f1": 0.13751666666666668
+    },
+    "Mem1 (8B)": {
+      "accuracy": 0.19256666666666666,
+      "f1": 0.14903333333333332
+    },
+    "Amem (8B)": {
+      "accuracy": 0.4492833333333333,
+      "f1": 0.26935
+    },
+    "Mem0 (8B)": {
+      "accuracy": 0.34385,
+      "f1": 0.22716666666666666
+    },
+    "Memgpt (8B)": {
+      "accuracy": 0.3446833333333333,
+      "f1": 0.12268333333333332
+    },
+    "Mem-alpha (8B)": {
+      "accuracy": 0.30363333333333337,
+      "f1": 0.18689999999999998
+    },
+    "Memorag (8B)": {
+      "accuracy": 0.46485,
+      "f1": 0.16515
+    },
+    "Memorybank (8B)": {
+      "accuracy": 0.32225,
+      "f1": 0.2800833333333333
+    },
+    "Simple mem (8B)": {
+      "accuracy": 0.22571666666666668,
+      "f1": 0.17606666666666668
+    },
+    "AMA-agent (Ours) (8B)": {
+      "accuracy": 0.4806166666666667,
+      "f1": 0.23224999999999998
+    }
+  },
+  "State Updating": {
+    "Claude Haiku 3.5": {
+      "accuracy": 0.4325666666666667,
+      "f1": 0.31329999999999997
+    },
+    "OpenAI GPT-5.1 mini": {
+      "accuracy": 0.6369,
+      "f1": 0.32348333333333334
+    },
+    "gpt 5.2": {
+      "accuracy": 0.6355666666666667,
+      "f1": 0.3697333333333333
+    },
+    "Gemini 2.5 flash": {
+      "accuracy": 0.4866,
+      "f1": 0.23691666666666666
+    },
+    "Qwen2.5-14B-Instruct-1M": {
+      "accuracy": 0.4663833333333333,
+      "f1": 0.33735
+    },
+    "Qwen3-32B": {
+      "accuracy": 0.48335,
+      "f1": 0.3447166666666666
+    },
+    "Qwen3-14B": {
+      "accuracy": 0.4473666666666667,
+      "f1": 0.33188333333333336
+    },
+    "Qwen3-8B": {
+      "accuracy": 0.39466666666666667,
+      "f1": 0.32993333333333336
+    },
+    "BM25 (32B)": {
+      "accuracy": 0.33854999999999996,
+      "f1": 0.12065
+    },
+    "Qwen3-Embedding-4B (32B)": {
+      "accuracy": 0.3541,
+      "f1": 0.12335
+    },
+    "GRAPHRAG (32B)": {
+      "accuracy": 0.31843333333333335,
+      "f1": 0.2622666666666667
+    },
+    "Hipporag2 (32B)": {
+      "accuracy": 0.43685,
+      "f1": 0.18171666666666667
+    },
+    "Memagent (32B)": {
+      "accuracy": 0.27918333333333334,
+      "f1": 0.13036666666666666
+    },
+    "Mem1 (32B)": {
+      "accuracy": 0.12353333333333333,
+      "f1": 0.16081666666666666
+    },
+    "Amem (32B)": {
+      "accuracy": 0.30775,
+      "f1": 0.24678333333333335
+    },
+    "Mem0 (32B)": {
+      "accuracy": 0.21891666666666665,
+      "f1": 0.22273333333333334
+    },
+    "Memorag (32B)": {
+      "accuracy": 0.4015666666666666,
+      "f1": 0.15636666666666668
+    },
+    "Memgpt (32B)": {
+      "accuracy": 0.291,
+      "f1": 0.1203
+    },
+    "Mem-alpha (32B)": {
+      "accuracy": 0.2964333333333333,
+      "f1": 0.19146666666666667
+    },
+    "Memorybank (32B)": {
+      "accuracy": 0.30411666666666665,
+      "f1": 0.26855
+    },
+    "Simple mem (32B)": {
+      "accuracy": 0.17581666666666665,
+      "f1": 0.16231666666666666
+    },
+    "AMA-agent (Ours) (32B)": {
+      "accuracy": 0.5138666666666667,
+      "f1": 0.25103333333333333
+    },
+    "BM25 (8B)": {
+      "accuracy": 0.3229666666666667,
+      "f1": 0.11235
+    },
+    "Qwen3-Embedding-4B (8B)": {
+      "accuracy": 0.34371666666666667,
+      "f1": 0.11576666666666667
+    },
+    "GRAPHRAG (8B)": {
+      "accuracy": 0.23753333333333335,
+      "f1": 0.22826666666666665
+    },
+    "Hipporag2 (8B)": {
+      "accuracy": 0.40763333333333335,
+      "f1": 0.18355
+    },
+    "Memagent (8B)": {
+      "accuracy": 0.2063,
+      "f1": 0.1215
+    },
+    "Mem1 (8B)": {
+      "accuracy": 0.12731666666666666,
+      "f1": 0.13308333333333333
+    },
+    "Amem (8B)": {
+      "accuracy": 0.3300666666666667,
+      "f1": 0.23895
+    },
+    "Mem0 (8B)": {
+      "accuracy": 0.24305,
+      "f1": 0.20679999999999998
+    },
+    "Memgpt (8B)": {
+      "accuracy": 0.24914999999999998,
+      "f1": 0.11001666666666667
+    },
+    "Mem-alpha (8B)": {
+      "accuracy": 0.2172666666666667,
+      "f1": 0.18433333333333332
+    },
+    "Memorag (8B)": {
+      "accuracy": 0.3682666666666667,
+      "f1": 0.14901666666666666
+    },
+    "Memorybank (8B)": {
+      "accuracy": 0.22931666666666664,
+      "f1": 0.25906666666666667
+    },
+    "Simple mem (8B)": {
+      "accuracy": 0.17063333333333333,
+      "f1": 0.17784999999999998
+    },
+    "AMA-agent (Ours) (8B)": {
+      "accuracy": 0.43645,
+      "f1": 0.21893333333333334
+    }
+  },
+  "State abstraction": {
+    "Claude Haiku 3.5": {
+      "accuracy": 0.32758333333333334,
+      "f1": 0.2684166666666667
+    },
+    "OpenAI GPT-5.1 mini": {
+      "accuracy": 0.6024333333333333,
+      "f1": 0.31545
+    },
+    "gpt 5.2": {
+      "accuracy": 0.59255,
+      "f1": 0.34695000000000004
+    },
+    "Gemini 2.5 flash": {
+      "accuracy": 0.40641666666666665,
+      "f1": 0.2329
+    },
+    "Qwen2.5-14B-Instruct-1M": {
+      "accuracy": 0.3559666666666667,
+      "f1": 0.3595
+    },
+    "Qwen3-32B": {
+      "accuracy": 0.37979999999999997,
+      "f1": 0.3152333333333333
+    },
+    "Qwen3-14B": {
+      "accuracy": 0.33476666666666666,
+      "f1": 0.2716
+    },
+    "Qwen3-8B": {
+      "accuracy": 0.3063166666666667,
+      "f1": 0.27915
+    },
+    "BM25 (32B)": {
+      "accuracy": 0.25508333333333333,
+      "f1": 0.16045
+    },
+    "Qwen3-Embedding-4B (32B)": {
+      "accuracy": 0.3022666666666667,
+      "f1": 0.15885
+    },
+    "GRAPHRAG (32B)": {
+      "accuracy": 0.30451666666666666,
+      "f1": 0.25921666666666665
+    },
+    "Hipporag2 (32B)": {
+      "accuracy": 0.36443333333333333,
+      "f1": 0.1758333333333333
+    },
+    "Memagent (32B)": {
+      "accuracy": 0.22045,
+      "f1": 0.16438333333333333
+    },
+    "Mem1 (32B)": {
+      "accuracy": 0.11385,
+      "f1": 0.21061666666666667
+    },
+    "Amem (32B)": {
+      "accuracy": 0.29383333333333334,
+      "f1": 0.297
+    },
+    "Mem0 (32B)": {
+      "accuracy": 0.15946666666666667,
+      "f1": 0.22685
+    },
+    "Memorag (32B)": {
+      "accuracy": 0.3564333333333334,
+      "f1": 0.205
+    },
+    "Memgpt (32B)": {
+      "accuracy": 0.2680166666666667,
+      "f1": 0.14603333333333332
+    },
+    "Mem-alpha (32B)": {
+      "accuracy": 0.22561666666666666,
+      "f1": 0.21555
+    },
+    "Memorybank (32B)": {
+      "accuracy": 0.3507166666666666,
+      "f1": 0.30448333333333333
+    },
+    "Simple mem (32B)": {
+      "accuracy": 0.14003333333333332,
+      "f1": 0.16598333333333334
+    },
+    "AMA-agent (Ours) (32B)": {
+      "accuracy": 0.4688666666666667,
+      "f1": 0.2747
+    },
+    "BM25 (8B)": {
+      "accuracy": 0.27895,
+      "f1": 0.14856666666666665
+    },
+    "Qwen3-Embedding-4B (8B)": {
+      "accuracy": 0.2748333333333333,
+      "f1": 0.14676666666666668
+    },
+    "GRAPHRAG (8B)": {
+      "accuracy": 0.22055,
+      "f1": 0.19723333333333334
+    },
+    "Hipporag2 (8B)": {
+      "accuracy": 0.292,
+      "f1": 0.17103333333333334
+    },
+    "Memagent (8B)": {
+      "accuracy": 0.14305,
+      "f1": 0.15775
+    },
+    "Mem1 (8B)": {
+      "accuracy": 0.1189,
+      "f1": 0.1691666666666667
+    },
+    "Amem (8B)": {
+      "accuracy": 0.31046666666666667,
+      "f1": 0.25876666666666664
+    },
+    "Mem0 (8B)": {
+      "accuracy": 0.2598,
+      "f1": 0.19686666666666666
+    },
+    "Memgpt (8B)": {
+      "accuracy": 0.24563333333333334,
+      "f1": 0.11535
+    },
+    "Mem-alpha (8B)": {
+      "accuracy": 0.20698333333333332,
+      "f1": 0.21046666666666666
+    },
+    "Memorag (8B)": {
+      "accuracy": 0.32411666666666666,
+      "f1": 0.1984
+    },
+    "Memorybank (8B)": {
+      "accuracy": 0.32095,
+      "f1": 0.28145
+    },
+    "Simple mem (8B)": {
+      "accuracy": 0.17876666666666666,
+      "f1": 0.15215
+    },
+    "AMA-agent (Ours) (8B)": {
+      "accuracy": 0.37873333333333337,
+      "f1": 0.21493333333333334
+    }
+  }
+}

data/model_data.json DELETED Viewed

@@ -1,94 +0,0 @@
-{
-  "title": "Performance of different models on real-world subset",
-  "metrics": ["Recall", "Causal Inference", "State Updating", "State Abstraction", "Average"],
-  "entries": [
-    {
-      "method": "Claude Haiku 3.5",
-      "category": null,
-      "scores": {
-        "Recall": {"accuracy": 0.4943, "f1": 0.3510},
-        "Causal Inference": {"accuracy": 0.4507, "f1": 0.2792},
-        "State Updating": {"accuracy": 0.4287, "f1": 0.3015},
-        "State Abstraction": {"accuracy": 0.3090, "f1": 0.2648},
-        "Average": {"accuracy": 0.4361, "f1": 0.3067}
-      }
-    },
-    {
-      "method": "GPT-5-mini",
-      "category": null,
-      "scores": {
-        "Recall": {"accuracy": 0.6951, "f1": 0.4010},
-        "Causal Inference": {"accuracy": 0.7157, "f1": 0.3027},
-        "State Updating": {"accuracy": 0.6575, "f1": 0.3288},
-        "State Abstraction": {"accuracy": 0.6235, "f1": 0.3262},
-        "Average": {"accuracy": 0.6784, "f1": 0.3464}
-      }
-    },
-    {
-      "method": "GPT 5.2",
-      "category": null,
-      "scores": {
-        "Recall": {"accuracy": 0.7741, "f1": 0.4758},
-        "Causal Inference": {"accuracy": 0.8047, "f1": 0.3512},
-        "State Updating": {"accuracy": 0.6563, "f1": 0.3686},
-        "State Abstraction": {"accuracy": 0.6037, "f1": 0.3582},
-        "Average": {"accuracy": 0.7226, "f1": 0.3988}
-      }
-    },
-    {
-      "method": "Gemini 2.5 Flash",
-      "category": null,
-      "scores": {
-        "Recall": {"accuracy": 0.5834, "f1": 0.3682},
-        "Causal Inference": {"accuracy": 0.5087, "f1": 0.2628},
-        "State Updating": {"accuracy": 0.5000, "f1": 0.2395},
-        "State Abstraction": {"accuracy": 0.4196, "f1": 0.2361},
-        "Average": {"accuracy": 0.5168, "f1": 0.2878}
-      }
-    },
-    {
-      "method": "Qwen2.5-14B-1M",
-      "category": null,
-      "scores": {
-        "Recall": {"accuracy": 0.5570, "f1": 0.4157},
-        "Causal Inference": {"accuracy": 0.4111, "f1": 0.3209},
-        "State Updating": {"accuracy": 0.4728, "f1": 0.3348},
-        "State Abstraction": {"accuracy": 0.3368, "f1": 0.3560},
-        "Average": {"accuracy": 0.4638, "f1": 0.3622}
-      }
-    },
-    {
-      "method": "Qwen3-32B",
-      "category": null,
-      "scores": {
-        "Recall": {"accuracy": 0.6149, "f1": 0.4074},
-        "Causal Inference": {"accuracy": 0.5178, "f1": 0.3289},
-        "State Updating": {"accuracy": 0.4903, "f1": 0.3334},
-        "State Abstraction": {"accuracy": 0.3657, "f1": 0.3172},
-        "Average": {"accuracy": 0.5181, "f1": 0.3545}
-      }
-    },
-    {
-      "method": "Qwen3-14B",
-      "category": null,
-      "scores": {
-        "Recall": {"accuracy": 0.5675, "f1": 0.3636},
-        "Causal Inference": {"accuracy": 0.4430, "f1": 0.2931},
-        "State Updating": {"accuracy": 0.4502, "f1": 0.3204},
-        "State Abstraction": {"accuracy": 0.3176, "f1": 0.2716},
-        "Average": {"accuracy": 0.4659, "f1": 0.3203}
-      }
-    },
-    {
-      "method": "Qwen3-8B",
-      "category": null,
-      "scores": {
-        "Recall": {"accuracy": 0.5024, "f1": 0.3801},
-        "Causal Inference": {"accuracy": 0.3776, "f1": 0.2830},
-        "State Updating": {"accuracy": 0.3987, "f1": 0.3177},
-        "State Abstraction": {"accuracy": 0.2923, "f1": 0.2792},
-        "Average": {"accuracy": 0.4109, "f1": 0.3240}
-      }
-    }
-  ]
-}

data/model_domain.json ADDED Viewed

	@@ -0,0 +1,404 @@

+{
+  "GAMING": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.5157,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2195
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.5595249999999999,
+      "model_family": "Qwen3-32B",
+      "f1": 0.288175
+    },
+    "Hipporag2": {
+      "accuracy": 0.60555,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2273
+    },
+    "Memagent": {
+      "accuracy": 0.31775,
+      "model_family": "Qwen3-32B",
+      "f1": 0.22945
+    },
+    "Mem1": {
+      "accuracy": 0.225875,
+      "model_family": "Qwen3-32B",
+      "f1": 0.18155
+    },
+    "Amem": {
+      "accuracy": 0.4247,
+      "model_family": "Qwen3-32B",
+      "f1": 0.343125
+    },
+    "Mem0": {
+      "accuracy": 0.39085000000000003,
+      "model_family": "Qwen3-32B",
+      "f1": 0.346
+    },
+    "Memorag": {
+      "accuracy": 0.557625,
+      "model_family": "Qwen3-32B",
+      "f1": 0.257875
+    },
+    "Memgpt": {
+      "accuracy": 0.435425,
+      "model_family": "Qwen3-32B",
+      "f1": 0.318475
+    },
+    "Mem-alpha": {
+      "accuracy": 0.43895,
+      "model_family": "Qwen3-32B",
+      "f1": 0.319875
+    },
+    "Memorybank": {
+      "accuracy": 0.43885,
+      "model_family": "Qwen3-32B",
+      "f1": 0.325325
+    },
+    "Simple mem": {
+      "accuracy": 0.288775,
+      "model_family": "Qwen3-32B",
+      "f1": 0.163
+    },
+    "Long context": {
+      "accuracy": 0.5355,
+      "model_family": "Qwen3-32B",
+      "f1": 0.321775
+    }
+  },
+  "EMBODIED_AI": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.204325,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1353
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.1476,
+      "model_family": "Qwen3-32B",
+      "f1": 0.3799
+    },
+    "Hipporag2": {
+      "accuracy": 0.17627500000000002,
+      "model_family": "Qwen3-32B",
+      "f1": 0.181875
+    },
+    "Memagent": {
+      "accuracy": 0.10617499999999999,
+      "model_family": "Qwen3-32B",
+      "f1": 0.144975
+    },
+    "Mem1": {
+      "accuracy": 0.03355,
+      "model_family": "Qwen3-32B",
+      "f1": 0.22445
+    },
+    "Amem": {
+      "accuracy": 0.183975,
+      "model_family": "Qwen3-32B",
+      "f1": 0.3524
+    },
+    "Mem0": {
+      "accuracy": 0.11109999999999999,
+      "model_family": "Qwen3-32B",
+      "f1": 0.27005
+    },
+    "Memorag": {
+      "accuracy": 0.085425,
+      "model_family": "Qwen3-32B",
+      "f1": 0.17677500000000002
+    },
+    "Memgpt": {
+      "accuracy": 0.1122,
+      "model_family": "Qwen3-32B",
+      "f1": 0.10405
+    },
+    "Mem-alpha": {
+      "accuracy": 0.15515,
+      "model_family": "Qwen3-32B",
+      "f1": 0.23735
+    },
+    "Memorybank": {
+      "accuracy": 0.16025,
+      "model_family": "Qwen3-32B",
+      "f1": 0.426475
+    },
+    "Simple mem": {
+      "accuracy": 0.045975,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2284
+    },
+    "Long context": {
+      "accuracy": 0.48185,
+      "model_family": "Qwen3-32B",
+      "f1": 0.56
+    }
+  },
+  "WEB": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.2872,
+      "model_family": "Qwen3-32B",
+      "f1": 0.08535000000000001
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.420675,
+      "model_family": "Qwen3-32B",
+      "f1": 0.268075
+    },
+    "Hipporag2": {
+      "accuracy": 0.3761,
+      "model_family": "Qwen3-32B",
+      "f1": 0.120125
+    },
+    "Memagent": {
+      "accuracy": 0.263975,
+      "model_family": "Qwen3-32B",
+      "f1": 0.09065
+    },
+    "Mem1": {
+      "accuracy": 0.131275,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1518
+    },
+    "Amem": {
+      "accuracy": 0.391525,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2294
+    },
+    "Mem0": {
+      "accuracy": 0.2705,
+      "model_family": "Qwen3-32B",
+      "f1": 0.21675
+    },
+    "Memorag": {
+      "accuracy": 0.364975,
+      "model_family": "Qwen3-32B",
+      "f1": 0.108075
+    },
+    "Memgpt": {
+      "accuracy": 0.327975,
+      "model_family": "Qwen3-32B",
+      "f1": 0.07105
+    },
+    "Mem-alpha": {
+      "accuracy": 0.362925,
+      "model_family": "Qwen3-32B",
+      "f1": 0.15944999999999998
+    },
+    "Memorybank": {
+      "accuracy": 0.401775,
+      "model_family": "Qwen3-32B",
+      "f1": 0.23704999999999998
+    },
+    "Simple mem": {
+      "accuracy": 0.13974999999999999,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1679
+    },
+    "Long context": {
+      "accuracy": 0.554275,
+      "model_family": "Qwen3-32B",
+      "f1": 0.348075
+    }
+  },
+  "TEXT2SQL": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.4164,
+      "model_family": "Qwen3-32B",
+      "f1": 0.249325
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.21665,
+      "model_family": "Qwen3-32B",
+      "f1": 0.221675
+    },
+    "Hipporag2": {
+      "accuracy": 0.46267499999999995,
+      "model_family": "Qwen3-32B",
+      "f1": 0.26935
+    },
+    "Memagent": {
+      "accuracy": 0.245375,
+      "model_family": "Qwen3-32B",
+      "f1": 0.245375
+    },
+    "Mem1": {
+      "accuracy": 0.06465,
+      "model_family": "Qwen3-32B",
+      "f1": 0.19990000000000002
+    },
+    "Amem": {
+      "accuracy": 0.31405,
+      "model_family": "Qwen3-32B",
+      "f1": 0.289625
+    },
+    "Mem0": {
+      "accuracy": 0.1192,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2326
+    },
+    "Memorag": {
+      "accuracy": 0.619,
+      "model_family": "Qwen3-32B",
+      "f1": 0.296475
+    },
+    "Memgpt": {
+      "accuracy": 0.206875,
+      "model_family": "Qwen3-32B",
+      "f1": 0.178975
+    },
+    "Mem-alpha": {
+      "accuracy": 0.30065,
+      "model_family": "Qwen3-32B",
+      "f1": 0.26505
+    },
+    "Memorybank": {
+      "accuracy": 0.23855,
+      "model_family": "Qwen3-32B",
+      "f1": 0.28355
+    },
+    "Simple mem": {
+      "accuracy": 0.192575,
+      "model_family": "Qwen3-32B",
+      "f1": 0.157225
+    },
+    "Long context": {
+      "accuracy": 0.456075,
+      "model_family": "Qwen3-32B",
+      "f1": 0.295275
+    }
+  },
+  "OPENWORLD_QA": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.399125,
+      "model_family": "Qwen3-32B",
+      "f1": 0.0837
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.31845,
+      "model_family": "Qwen3-32B",
+      "f1": 0.22635
+    },
+    "Hipporag2": {
+      "accuracy": 0.45825,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2362
+    },
+    "Memagent": {
+      "accuracy": 0.158225,
+      "model_family": "Qwen3-32B",
+      "f1": 0.0704
+    },
+    "Mem1": {
+      "accuracy": 0.12065000000000001,
+      "model_family": "Qwen3-32B",
+      "f1": 0.15005
+    },
+    "Amem": {
+      "accuracy": 0.29359999999999997,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2079
+    },
+    "Mem0": {
+      "accuracy": 0.16197499999999998,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1604
+    },
+    "Memorag": {
+      "accuracy": 0.411375,
+      "model_family": "Qwen3-32B",
+      "f1": 0.093675
+    },
+    "Memgpt": {
+      "accuracy": 0.3155,
+      "model_family": "Qwen3-32B",
+      "f1": 0.0595
+    },
+    "Mem-alpha": {
+      "accuracy": 0.2301,
+      "model_family": "Qwen3-32B",
+      "f1": 0.13345
+    },
+    "Memorybank": {
+      "accuracy": 0.3486,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2519
+    },
+    "Simple mem": {
+      "accuracy": 0.12154999999999999,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1312
+    },
+    "Long context": {
+      "accuracy": 0.49785,
+      "model_family": "Qwen3-32B",
+      "f1": 0.3349
+    }
+  },
+  "SOFTWARE": {
+    "Qwen3-Embedding-4B": {
+      "accuracy": 0.599025,
+      "model_family": "Qwen3-32B",
+      "f1": 0.083575
+    },
+    "GRAPHRAG": {
+      "accuracy": 0.348875,
+      "model_family": "Qwen3-32B",
+      "f1": 0.229825
+    },
+    "Hipporag2": {
+      "accuracy": 0.5299,
+      "model_family": "Qwen3-32B",
+      "f1": 0.1279
+    },
+    "Memagent": {
+      "accuracy": 0.53965,
+      "model_family": "Qwen3-32B",
+      "f1": 0.09085
+    },
+    "Mem1": {
+      "accuracy": 0.18595,
+      "model_family": "Qwen3-32B",
+      "f1": 0.17527500000000001
+    },
+    "Amem": {
+      "accuracy": 0.29615,
+      "model_family": "Qwen3-32B",
+      "f1": 0.20395
+    },
+    "Mem0": {
+      "accuracy": 0.2366,
+      "model_family": "Qwen3-32B",
+      "f1": 0.176975
+    },
+    "Memorag": {
+      "accuracy": 0.55005,
+      "model_family": "Qwen3-32B",
+      "f1": 0.10707499999999999
+    },
+    "Memgpt": {
+      "accuracy": 0.599125,
+      "model_family": "Qwen3-32B",
+      "f1": 0.066575
+    },
+    "Mem-alpha": {
+      "accuracy": 0.3476,
+      "model_family": "Qwen3-32B",
+      "f1": 0.12492500000000001
+    },
+    "Memorybank": {
+      "accuracy": 0.5072,
+      "model_family": "Qwen3-32B",
+      "f1": 0.240875
+    },
+    "Simple mem": {
+      "accuracy": 0.2431,
+      "model_family": "Qwen3-32B",
+      "f1": 0.2005
+    },
+    "Long context": {
+      "accuracy": 0.4847,
+      "model_family": "Qwen3-32B",
+      "f1": 0.267725
+    }
+  }
+}

gaia-leaderboard ADDED Viewed

	@@ -0,0 +1 @@


1	+ Subproject commit d34b929801f4ff3f73aaa392d5ca593eba0766e7

lmgame_bench ADDED Viewed

	@@ -0,0 +1 @@


1	+ Subproject commit aa854e662254e5454fea0705a6525b02620bcceb

requirements.txt CHANGED Viewed

@@ -1,4 +1,7 @@
-gradio==5.23.3
 pandas>=2.0.0
 plotly>=5.15.0
 numpy>=1.24.0

+gradio>=5.0.0
 pandas>=2.0.0
 plotly>=5.15.0
 numpy>=1.24.0
+datasets>=2.10.0
+huggingface_hub>=0.16.0
+requests>=2.28.0

scorer.py ADDED Viewed

	@@ -0,0 +1,166 @@

+"""
+Scoring functions for AMA-Bench submissions.
+This module implements evaluation logic for multiple-choice questions,
+calculating accuracy by comparing uppercase letters in answers.
+"""
+import re
+from typing import Union, List, Dict
+def extract_uppercase_letters(text: str) -> str:
+    """
+    Extract all uppercase letters from text.
+    Used for multiple-choice answer comparison where answers are like
+    "A", "B", "AB", "ACD", etc.
+    Args:
+        text: Input text containing answer choices
+    Returns:
+        String of uppercase letters only, sorted alphabetically
+    """
+    if not isinstance(text, str):
+        text = str(text)
+    # Extract all uppercase letters
+    letters = [c for c in text if c.isupper() and c.isalpha()]
+    # Sort and join to ensure consistent ordering
+    return ''.join(sorted(set(letters)))
+def multiple_choice_accuracy(prediction: str, reference: str) -> float:
+    """
+    Calculate accuracy for multiple-choice answers.
+    Compares uppercase letters extracted from both prediction and reference.
+    Returns 1.0 if they match exactly, 0.0 otherwise.
+    Args:
+        prediction: Model's predicted answer
+        reference: Ground truth reference answer
+    Returns:
+        1.0 if exact match, 0.0 otherwise
+    """
+    pred_letters = extract_uppercase_letters(prediction)
+    ref_letters = extract_uppercase_letters(reference)
+    return 1.0 if pred_letters == ref_letters else 0.0
+def calculate_accuracy(scores: List[float]) -> Dict[str, float]:
+    """
+    Calculate accuracy metric from individual question scores.
+    Args:
+        scores: List of question scores (0.0 or 1.0)
+    Returns:
+        Dictionary with accuracy metric
+    """
+    if not scores:
+        return {"accuracy": 0.0, "count": 0}
+    import numpy as np
+    return {
+        "accuracy": float(np.mean(scores)),
+        "count": len(scores),
+        "correct": int(sum(scores)),
+    }
+def score_submission(
+    submissions: List[Dict],
+    groundtruth: Dict[str, Dict],
+    metrics_mapping: Dict[str, str] = None
+) -> Dict:
+    """
+    Score a complete submission against ground truth.
+    Args:
+        submissions: List of submission dicts with episode_id, question, answer
+        groundtruth: Dict mapping (episode_id, question) to ground truth info
+        metrics_mapping: Optional dict mapping question types to metric categories
+    Returns:
+        Dictionary with overall and per-metric scores
+    """
+    # Default metric mapping based on question type
+    if metrics_mapping is None:
+        metrics_mapping = {
+            "Recall": "Recall",
+            "Causal": "Causal Inference",
+            "State": "State Updating",
+            "Abstraction": "State Abstraction",
+        }
+    # Initialize scores by metric
+    scores_by_metric = {
+        "Recall": [],
+        "Causal Inference": [],
+        "State Updating": [],
+        "State Abstraction": [],
+    }
+    all_scores = []
+    scored_submissions = []
+    for submission in submissions:
+        episode_id = submission.get("episode_id", "")
+        question = submission.get("question", "")
+        answer = submission.get("answer", "")
+        # Look up ground truth
+        key = f"{episode_id}_{question}"
+        gt_info = groundtruth.get(key)
+        if gt_info is None:
+            # Question not found in ground truth
+            score = 0.0
+            reference = ""
+            qa_type = "Unknown"
+        else:
+            reference = gt_info["answer"]
+            qa_type = gt_info.get("type", "Recall")
+            # Calculate accuracy
+            score = multiple_choice_accuracy(answer, reference)
+        # Map question type to metric category
+        metric_category = "Recall"  # default
+        for key_term, metric in metrics_mapping.items():
+            if key_term.lower() in qa_type.lower():
+                metric_category = metric
+                break
+        # Add to appropriate metric bucket
+        if metric_category in scores_by_metric:
+            scores_by_metric[metric_category].append(score)
+        all_scores.append(score)
+        # Store scored submission
+        scored_submissions.append({
+            **submission,
+            "score": score,
+            "reference_answer": reference,
+            "metric_category": metric_category,
+        })
+    # Calculate metrics for each category
+    results = {}
+    for metric_name, metric_scores in scores_by_metric.items():
+        results[metric_name] = calculate_accuracy(metric_scores)
+    # Calculate overall average
+    results["Average"] = calculate_accuracy(all_scores)
+    return {
+        "scores": results,
+        "scored_submissions": scored_submissions,
+    }

utils.py ADDED Viewed

	@@ -0,0 +1,224 @@

+"""
+Utility functions for AMA-Bench Leaderboard.
+This module contains helper functions for:
+- DataFrame building and manipulation
+- Chart generation
+- Data validation
+"""
+import pandas as pd
+import plotly.graph_objects as go
+from typing import List, Dict
+# Metrics configuration
+METRICS = ["Recall", "Causal Inference", "State Updating", "State Abstraction"]
+ALL_METRICS = METRICS + ["Average"]
+# Chart colors moved to visualization.py
+def build_dataframe(data: Dict) -> pd.DataFrame:
+    """
+    Build a pandas DataFrame showing Accuracy for each metric.
+    Args:
+        data: Dictionary with 'entries' key containing list of results
+    Returns:
+        DataFrame with Method and metric columns
+    """
+    rows = []
+    for entry in data["entries"]:
+        row = {"Method": entry["method"]}
+        if entry.get("category"):
+            row["Category"] = entry["category"]
+        for m in ALL_METRICS:
+            accuracy = entry["scores"][m]["accuracy"]
+            row[m] = f"{accuracy:.4f}"
+        # Store raw average accuracy for sorting
+        row["_sort_avg"] = entry["scores"]["Average"]["accuracy"]
+        rows.append(row)
+    df = pd.DataFrame(rows)
+    df = df.sort_values("_sort_avg", ascending=False).reset_index(drop=True)
+    df = df.drop(columns=["_sort_avg"])
+    return df
+def add_medals(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Add medal emojis to the top-3 Method names.
+    Args:
+        df: DataFrame with 'Method' column
+    Returns:
+        DataFrame with medals added to top 3 methods
+    """
+    df = df.copy()
+    medals = ["\U0001f947", "\U0001f948", "\U0001f949"]  # 🥇 🥈 🥉
+    for i in range(min(3, len(df))):
+        df.loc[i, "Method"] = f"{medals[i]} {df.loc[i, 'Method']}"
+    return df
+def load_groundtruth(dataset_name: str, token: str = None) -> Dict[str, str]:
+    """
+    Load ground truth Q&A pairs from HuggingFace dataset.
+    Expected schema in the dataset:
+    {
+      "episode_id": "string",
+      "qa_pairs": [
+        {
+          "question": "string",
+          "answer": "string",
+          "type": "string",
+          "sub_type": "string"
+        }
+      ]
+    }
+    Args:
+        dataset_name: HuggingFace dataset name (e.g., "Pettingllms/AMA-bench")
+        token: Optional HuggingFace token for private datasets
+    Returns:
+        Dictionary mapping (episode_id, question) to answer info
+    """
+    groundtruth = {}
+    try:
+        from datasets import load_dataset, VerificationMode
+        # Try loading from HuggingFace dataset
+        try:
+            dataset = load_dataset(
+                dataset_name,
+                split="test",
+                token=token,
+                verification_mode=VerificationMode.NO_CHECKS,
+                trust_remote_code=True
+            )
+            print(f"Loaded dataset from HuggingFace: {dataset_name}")
+            for row in dataset:
+                episode_id = row.get("episode_id", "")
+                qa_pairs = row.get("qa_pairs", [])
+                for qa in qa_pairs:
+                    question = qa.get("question", "")
+                    answer = qa.get("answer", "")
+                    qa_type = qa.get("type", "")
+                    # Create unique key for this Q&A pair
+                    key = f"{episode_id}_{question}"
+                    groundtruth[key] = {
+                        "answer": answer,
+                        "type": qa_type,
+                        "sub_type": qa.get("sub_type", "")
+                    }
+        except Exception as hf_error:
+            print(f"Warning: Could not load from HuggingFace ({hf_error})")
+            print("Trying local file test/test.jsonl...")
+            # Fallback to local file
+            import json
+            local_path = "test/test.jsonl"
+            try:
+                with open(local_path, 'r', encoding='utf-8') as f:
+                    for line in f:
+                        line = line.strip()
+                        if not line:
+                            continue
+                        data = json.loads(line)
+                        episode_id = data.get("episode_id", "")
+                        qa_pairs = data.get("qa_pairs", [])
+                        for qa in qa_pairs:
+                            question = qa.get("question", "")
+                            answer = qa.get("answer", "")
+                            qa_type = qa.get("type", "")
+                            # Create unique key for this Q&A pair
+                            key = f"{episode_id}_{question}"
+                            groundtruth[key] = {
+                                "answer": answer,
+                                "type": qa_type,
+                                "sub_type": qa.get("sub_type", "")
+                            }
+                print(f"Loaded from local file: {local_path}")
+            except FileNotFoundError:
+                print(f"Warning: Local ground truth file not found: {local_path}")
+            except Exception as e:
+                print(f"Warning: Error loading local ground truth: {e}")
+    except ImportError:
+        print("Warning: datasets library not available, cannot load ground truth")
+    return groundtruth
+def validate_submission_file(file_path: str) -> tuple:
+    """
+    Validate submission file format.
+    Expected format:
+    {"episode_id": "...", "question": "...", "answer": "...", ...}
+    Args:
+        file_path: Path to submission JSONL file
+    Returns:
+        Tuple of (is_valid, error_message, submissions_list)
+    """
+    import json
+    submissions = []
+    seen_pairs = set()
+    try:
+        with open(file_path, 'r', encoding='utf-8') as f:
+            for ix, line in enumerate(f):
+                line = line.strip()
+                if not line:
+                    continue
+                try:
+                    task = json.loads(line)
+                except json.JSONDecodeError:
+                    return False, f"Line {ix+1} is incorrectly formatted JSON.", []
+                # Check required fields
+                required_fields = ["episode_id", "question", "answer"]
+                for field in required_fields:
+                    if field not in task:
+                        return False, f"Line {ix+1} is missing required field '{field}'.", []
+                episode_id = task["episode_id"]
+                question = task["question"]
+                pair_key = (episode_id, question)
+                if pair_key in seen_pairs:
+                    return False, f"Line {ix+1} contains duplicate episode_id/question pair.", []
+                seen_pairs.add(pair_key)
+                submissions.append(task)
+        if len(submissions) == 0:
+            return False, "No valid submissions found in the file.", []
+        return True, "", submissions
+    except FileNotFoundError:
+        return False, "File not found.", []
+    except Exception as e:
+        return False, f"Error reading file: {str(e)}", []

validate_jsonl.py ADDED Viewed

	@@ -0,0 +1,205 @@

+#!/usr/bin/env python3
+"""
+Validate the processed JSONL file and generate statistics.
+"""
+import json
+from collections import Counter, defaultdict
+from pathlib import Path
+def validate_jsonl(file_path: Path):
+    """
+    Validate JSONL file and generate comprehensive statistics.
+    """
+    print("=" * 80)
+    print(f"Validating: {file_path}")
+    print("=" * 80)
+    print()
+    # Statistics
+    task_types = Counter()
+    domains = Counter()
+    qa_type_counts = Counter()
+    qa_subtype_counts = Counter()
+    total_qa_pairs = 0
+    success_count = 0
+    total_count = 0
+    total_turns = 0
+    total_tokens = 0
+    # Per task type statistics
+    task_type_stats = defaultdict(lambda: {
+        'count': 0,
+        'success': 0,
+        'qa_pairs': 0,
+        'total_turns': 0,
+        'total_tokens': 0
+    })
+    # Per domain statistics
+    domain_stats = defaultdict(lambda: {
+        'count': 0,
+        'success': 0,
+        'qa_pairs': 0,
+        'total_turns': 0,
+        'total_tokens': 0
+    })
+    errors = []
+    line_num = 0
+    with open(file_path, 'r', encoding='utf-8') as f:
+        for line in f:
+            line_num += 1
+            try:
+                data = json.loads(line)
+                # Validate required fields
+                required_fields = ["episode_id", "task", "task_type", "domain",
+                                 "success", "num_turns", "total_tokens",
+                                 "trajectory", "qa_pairs"]
+                for field in required_fields:
+                    if field not in data:
+                        errors.append(f"Line {line_num}: Missing field '{field}'")
+                        continue
+                # Update counters
+                task_type = data["task_type"]
+                domain = data["domain"]
+                task_types[task_type] += 1
+                domains[domain] += 1
+                total_count += 1
+                if data["success"]:
+                    success_count += 1
+                    task_type_stats[task_type]['success'] += 1
+                    domain_stats[domain]['success'] += 1
+                num_qa = len(data["qa_pairs"])
+                total_qa_pairs += num_qa
+                task_type_stats[task_type]['qa_pairs'] += num_qa
+                task_type_stats[task_type]['count'] += 1
+                domain_stats[domain]['qa_pairs'] += num_qa
+                domain_stats[domain]['count'] += 1
+                total_turns += data["num_turns"]
+                total_tokens += data["total_tokens"]
+                task_type_stats[task_type]['total_turns'] += data["num_turns"]
+                task_type_stats[task_type]['total_tokens'] += data["total_tokens"]
+                domain_stats[domain]['total_turns'] += data["num_turns"]
+                domain_stats[domain]['total_tokens'] += data["total_tokens"]
+                # QA pairs type distribution
+                for qa in data["qa_pairs"]:
+                    qa_type = qa.get("type", "unknown")
+                    qa_type_counts[qa_type] += 1
+                    if "sub_type" in qa:
+                        qa_subtype_counts[qa["sub_type"]] += 1
+            except json.JSONDecodeError as e:
+                errors.append(f"Line {line_num}: JSON decode error - {e}")
+            except Exception as e:
+                errors.append(f"Line {line_num}: Error - {e}")
+    # Print validation results
+    if errors:
+        print("VALIDATION ERRORS:")
+        print("-" * 80)
+        for error in errors[:10]:  # Show first 10 errors
+            print(f"  {error}")
+        if len(errors) > 10:
+            print(f"  ... and {len(errors) - 10} more errors")
+        print()
+    else:
+        print("✓ No validation errors found!")
+        print()
+    # Print overall statistics
+    print("OVERALL STATISTICS")
+    print("-" * 80)
+    print(f"Total records:        {total_count:>6d}")
+    print(f"Total QA pairs:       {total_qa_pairs:>6d}")
+    print(f"Successful episodes:  {success_count:>6d} ({success_count/total_count*100:>5.1f}%)")
+    print(f"Failed episodes:      {total_count - success_count:>6d} ({(total_count - success_count)/total_count*100:>5.1f}%)")
+    print(f"Total turns:          {total_turns:>6d} (avg: {total_turns/total_count:.1f})")
+    print(f"Total tokens:         {total_tokens:>6d} (avg: {total_tokens/total_count:.1f})")
+    print()
+    # Print domain distribution
+    print("DOMAIN DISTRIBUTION")
+    print("-" * 80)
+    print(f"{'Domain':<20} {'Count':>6} {'Success':>7} {'QA Pairs':>9} {'Avg Turns':>10} {'Avg Tokens':>11}")
+    print("-" * 80)
+    for domain in sorted(domains.keys()):
+        count = domain_stats[domain]['count']
+        success = domain_stats[domain]['success']
+        success_pct = (success / count * 100) if count > 0 else 0
+        qa_pairs = domain_stats[domain]['qa_pairs']
+        avg_turns = domain_stats[domain]['total_turns'] / count if count > 0 else 0
+        avg_tokens = domain_stats[domain]['total_tokens'] / count if count > 0 else 0
+        print(f"{domain:<20} {count:>6} {success_pct:>6.1f}% {qa_pairs:>9} {avg_turns:>10.1f} {avg_tokens:>11.1f}")
+    print()
+    # Print task type distribution
+    print("TASK TYPE DISTRIBUTION")
+    print("-" * 80)
+    print(f"{'Task Type':<40} {'Count':>6} {'Success':>7} {'QA Pairs':>9} {'Avg Turns':>10} {'Avg Tokens':>11}")
+    print("-" * 80)
+    for task_type in sorted(task_types.keys()):
+        count = task_type_stats[task_type]['count']
+        success = task_type_stats[task_type]['success']
+        qa_pairs = task_type_stats[task_type]['qa_pairs']
+        avg_turns = task_type_stats[task_type]['total_turns'] / count if count > 0 else 0
+        avg_tokens = task_type_stats[task_type]['total_tokens'] / count if count > 0 else 0
+        print(f"{task_type:<40} {count:>6} {success:>6}% {qa_pairs:>9} {avg_turns:>10.1f} {avg_tokens:>11.1f}")
+    print()
+    # Print QA type distribution
+    print("QA TYPE DISTRIBUTION")
+    print("-" * 80)
+    print(f"{'Type':<20} {'Count':>10} {'Percentage':>12}")
+    print("-" * 80)
+    for qa_type, count in sorted(qa_type_counts.items()):
+        percentage = count / total_qa_pairs * 100 if total_qa_pairs > 0 else 0
+        print(f"{qa_type:<20} {count:>10} {percentage:>11.1f}%")
+    print()
+    # Print QA subtype distribution
+    if qa_subtype_counts:
+        print("QA SUBTYPE DISTRIBUTION")
+        print("-" * 80)
+        print(f"{'Subtype':<20} {'Count':>10} {'Percentage':>12}")
+        print("-" * 80)
+        for subtype in sorted(qa_subtype_counts.keys()):
+            count = qa_subtype_counts[subtype]
+            percentage = count / total_qa_pairs * 100 if total_qa_pairs > 0 else 0
+            print(f"{subtype:<20} {count:>10} {percentage:>11.1f}%")
+        print()
+    print("=" * 80)
+    print("Validation complete!")
+    print("=" * 80)
+if __name__ == "__main__":
+    jsonl_file = Path(__file__).parent / "processed_open_end.jsonl"
+    if not jsonl_file.exists():
+        print(f"Error: {jsonl_file} not found!")
+        print("Please run process_open_end.py first.")
+        exit(1)
+    validate_jsonl(jsonl_file)

view_samples.py ADDED Viewed

	@@ -0,0 +1,181 @@

+#!/usr/bin/env python3
+"""
+View sample records from the processed JSONL file.
+"""
+import json
+import sys
+from pathlib import Path
+def print_record(data, show_full=False):
+    """
+    Print a single record in a readable format.
+    """
+    print("=" * 80)
+    print(f"Episode ID: {data['episode_id']}")
+    print(f"Task Type:  {data['task_type']}")
+    print(f"Domain:     {data['domain']}")
+    print(f"Success:    {data['success']}")
+    print(f"Turns:      {data['num_turns']}")
+    print(f"Tokens:     {data['total_tokens']}")
+    if data['task']:
+        task_preview = data['task'][:150]
+        print(f"\nTask:\n{task_preview}..." if len(data['task']) > 150 else f"\nTask:\n{task_preview}")
+    print(f"\nQA Pairs: {len(data['qa_pairs'])}")
+    if show_full:
+        print("\nAll QA Pairs:")
+        print("-" * 80)
+        for i, qa in enumerate(data['qa_pairs'], 1):
+            print(f"\n[{i}] Type: {qa['type']}", end="")
+            if 'sub_type' in qa:
+                print(f" / Subtype: {qa['sub_type']}")
+            else:
+                print()
+            print(f"Q: {qa['question'][:120]}...")
+            print(f"A: {qa['answer'][:120]}...")
+    else:
+        # Show first 2 QA pairs as preview
+        print("\nSample QA Pairs (first 2):")
+        print("-" * 80)
+        for i, qa in enumerate(data['qa_pairs'][:2], 1):
+            print(f"\n[{i}] Type: {qa['type']}", end="")
+            if 'sub_type' in qa:
+                print(f" / Subtype: {qa['sub_type']}")
+            else:
+                print()
+            print(f"Q: {qa['question'][:120]}...")
+            print(f"A: {qa['answer'][:120]}...")
+    if data['trajectory']:
+        print(f"\nTrajectory: {len(data['trajectory'])} turns")
+        if show_full and len(data['trajectory']) > 0:
+            print("\nFirst 3 turns:")
+            print("-" * 80)
+            for turn in data['trajectory'][:3]:
+                print(f"\nTurn {turn['turn_idx']}:")
+                action = str(turn['action'])[:100] if turn['action'] else "None"
+                observation = str(turn['observation'])[:100] if turn['observation'] else "None"
+                print(f"  Action: {action}...")
+                print(f"  Observation: {observation}...")
+    print("=" * 80)
+    print()
+def view_by_task_type(file_path: Path, task_type: str, count: int = 3):
+    """
+    View samples of a specific task type.
+    """
+    print(f"\nShowing {count} samples for task type: {task_type}\n")
+    shown = 0
+    with open(file_path, 'r', encoding='utf-8') as f:
+        for line in f:
+            data = json.loads(line)
+            if data['task_type'] == task_type:
+                print_record(data, show_full=False)
+                shown += 1
+                if shown >= count:
+                    break
+    if shown == 0:
+        print(f"No records found for task type: {task_type}")
+def view_by_index(file_path: Path, index: int):
+    """
+    View a specific record by index (0-based).
+    """
+    with open(file_path, 'r', encoding='utf-8') as f:
+        for i, line in enumerate(f):
+            if i == index:
+                data = json.loads(line)
+                print_record(data, show_full=True)
+                return
+    print(f"Index {index} not found (file has fewer records)")
+def list_task_types(file_path: Path):
+    """
+    List all unique task types in the file.
+    """
+    task_types = set()
+    with open(file_path, 'r', encoding='utf-8') as f:
+        for line in f:
+            data = json.loads(line)
+            task_types.add(data['task_type'])
+    print("\nAvailable task types:")
+    print("-" * 80)
+    for i, task_type in enumerate(sorted(task_types), 1):
+        print(f"  {i:2d}. {task_type}")
+    print()
+def main():
+    jsonl_file = Path(__file__).parent / "processed_open_end.jsonl"
+    if not jsonl_file.exists():
+        print(f"Error: {jsonl_file} not found!")
+        print("Please run process_open_end.py first.")
+        exit(1)
+    # Command line interface
+    if len(sys.argv) < 2:
+        print("Usage:")
+        print("  python3 view_samples.py list                    # List all task types")
+        print("  python3 view_samples.py index <n>               # View record at index n")
+        print("  python3 view_samples.py type <task_type> [n]    # View n samples of task type (default 3)")
+        print("\nExamples:")
+        print("  python3 view_samples.py list")
+        print("  python3 view_samples.py index 0")
+        print("  python3 view_samples.py type text2sql/spider2 5")
+        return
+    command = sys.argv[1]
+    if command == "list":
+        list_task_types(jsonl_file)
+    elif command == "index":
+        if len(sys.argv) < 3:
+            print("Error: Please specify an index")
+            return
+        try:
+            index = int(sys.argv[2])
+            view_by_index(jsonl_file, index)
+        except ValueError:
+            print("Error: Index must be an integer")
+    elif command == "type":
+        if len(sys.argv) < 3:
+            print("Error: Please specify a task type")
+            return
+        task_type = sys.argv[2]
+        count = 3
+        if len(sys.argv) >= 4:
+            try:
+                count = int(sys.argv[3])
+            except ValueError:
+                print("Error: Count must be an integer")
+                return
+        view_by_task_type(jsonl_file, task_type, count)
+    else:
+        print(f"Unknown command: {command}")
+        print("Use: list, index, or type")
+if __name__ == "__main__":
+    main()

visualization.py ADDED Viewed

	@@ -0,0 +1,664 @@

+"""
+Visualization module for AMA-Bench leaderboard
+Adapted from lmgame_bench patterns with AMA-specific customizations
+"""
+import plotly.graph_objects as go
+import numpy as np
+import pandas as pd
+import json
+import os
+from typing import Dict, List, Optional, Tuple
+# Constants
+METRICS = ["Recall", "Causal Inference", "State Updating", "State Abstraction"]
+ALL_METRICS = METRICS + ["Average"]
+def load_model_colors(filepath: str = "assets/model_colors.json") -> Dict[str, str]:
+    """
+    Load color scheme for models and methods from JSON file.
+    Args:
+        filepath: Path to color configuration JSON
+    Returns:
+        Dictionary mapping model/method names to hex colors
+    """
+    try:
+        with open(filepath, 'r', encoding='utf-8') as f:
+            color_data = json.load(f)
+        # Merge models and methods into single dictionary
+        colors = {}
+        if 'models' in color_data:
+            colors.update(color_data['models'])
+        if 'methods' in color_data:
+            colors.update(color_data['methods'])
+        # Store fallback color
+        fallback = color_data.get('fallback', '#808080')
+        return colors, fallback
+    except Exception as e:
+        print(f"Warning: Could not load colors from {filepath}: {e}")
+        return {}, '#808080'
+def normalize_scores(values: List[float], mean: float, std: float) -> List[float]:
+    """
+    Normalize scores using z-score and scale to 0-100 range.
+    Adapted from lmgame_bench's normalize_values() function.
+    Args:
+        values: List of accuracy values (0-1 range)
+        mean: Mean value for normalization
+        std: Standard deviation for normalization
+    Returns:
+        List of normalized scores (0-100 range)
+    Formula:
+        z_score = (value - mean) / std
+        normalized = clamp((z_score * 30) + 35, 0, 100)
+    """
+    # Handle zero std case (all values are the same)
+    if std < 0.05:  # Minimum std threshold to prevent extreme values
+        std = 0.05
+    normalized = []
+    for v in values:
+        z_score = (v - mean) / std
+        scaled = (z_score * 30) + 35
+        clamped = max(0, min(100, scaled))
+        normalized.append(clamped)
+    return normalized
+def filter_by_category(data: Dict, category: str) -> Dict:
+    """
+    Filter method data by category.
+    Args:
+        data: Full dataset with entries
+        category: "All", "RAG", or "Agent Memory"
+    Returns:
+        Filtered data dictionary
+    """
+    if category == "All":
+        return data
+    filtered_data = data.copy()
+    filtered_data['entries'] = [
+        entry for entry in data['entries']
+        if entry.get('category') == category
+    ]
+    return filtered_data
+def prepare_dataframe_for_visualization(
+    data: Dict,
+    top_n: Optional[int] = None,
+    category_filter: str = "All",
+    selected_metrics: Optional[List[str]] = None
+) -> pd.DataFrame:
+    """
+    Build DataFrame with both raw and normalized scores.
+    Args:
+        data: Raw data from model_data.json or method_data.json
+        top_n: Number of top entries to include (None = all)
+        category_filter: "All", "RAG", or "Agent Memory" (for methods only)
+        selected_metrics: List of metrics to include (None = all)
+    Returns:
+        DataFrame with columns:
+        - Method/Model (name)
+        - Category (if applicable)
+        - {Metric} (raw accuracy 0-1) for each metric
+        - norm_{Metric} (normalized 0-100) for each metric
+        - Avg Normalized Score (mean of normalized scores)
+    """
+    # Filter by category first
+    if category_filter != "All":
+        data = filter_by_category(data, category_filter)
+    if not data['entries']:
+        # Return empty DataFrame if no entries
+        return pd.DataFrame()
+    # Use all metrics if none specified
+    if selected_metrics is None:
+        selected_metrics = METRICS
+    # Build basic DataFrame
+    rows = []
+    for entry in data['entries']:
+        row = {
+            'Name': entry['method'],
+        }
+        # Add category if present
+        if entry.get('category') is not None:
+            row['Category'] = entry['category']
+        # Add raw scores
+        for metric in selected_metrics:
+            score_data = entry['scores'].get(metric, {})
+            row[metric] = score_data.get('accuracy', 0.0)
+        # Add average
+        row['Average'] = entry['scores'].get('Average', {}).get('accuracy', 0.0)
+        rows.append(row)
+    df = pd.DataFrame(rows)
+    # Sort by average accuracy (descending)
+    df = df.sort_values(by='Average', ascending=False)
+    # Calculate normalization parameters from FULL dataset (before limiting)
+    norm_params = {}
+    for metric in selected_metrics:
+        values = df[metric].values
+        mean = values.mean()
+        std = values.std()
+        norm_params[metric] = (mean, std)
+    # Apply top_n limit if specified
+    if top_n is not None and top_n > 0:
+        df = df.head(top_n)
+    # Add normalized scores
+    for metric in selected_metrics:
+        mean, std = norm_params[metric]
+        values = df[metric].values
+        df[f'norm_{metric}'] = normalize_scores(values.tolist(), mean, std)
+    # Calculate average normalized score
+    norm_cols = [f'norm_{metric}' for metric in selected_metrics]
+    df['Avg Normalized Score'] = df[norm_cols].mean(axis=1)
+    # Reset index
+    df = df.reset_index(drop=True)
+    return df
+def hex_to_rgba(hex_color: str, alpha: float = 0.2) -> str:
+    """
+    Convert hex color to RGBA with specified alpha.
+    Args:
+        hex_color: Hex color code (e.g., "#FF0000")
+        alpha: Alpha value (0-1)
+    Returns:
+        RGBA color string
+    """
+    hex_color = hex_color.lstrip('#')
+    r = int(hex_color[0:2], 16)
+    g = int(hex_color[2:4], 16)
+    b = int(hex_color[4:6], 16)
+    return f'rgba({r}, {g}, {b}, {alpha})'
+def create_radar_chart(
+    df: pd.DataFrame,
+    selected_metrics: List[str],
+    title: str = "Performance Across Metrics",
+    color_map: Optional[Dict[str, str]] = None
+) -> go.Figure:
+    """
+    Create radar chart with normalized scores.
+    Adapted from lmgame_bench's create_single_radar_chart().
+    Args:
+        df: DataFrame from prepare_dataframe_for_visualization()
+        selected_metrics: List of metric names to include as axes
+        title: Chart title
+        color_map: Dictionary mapping names to colors
+    Returns:
+        Plotly Figure with radar chart
+    Features:
+        - Each axis = one metric
+        - Each trace = one model/method
+        - Range: 0-100 (normalized)
+        - Interactive legend (click to isolate, double-click to toggle)
+    """
+    if df.empty:
+        fig = go.Figure()
+        fig.update_layout(title="No data available")
+        return fig
+    # Load colors if not provided
+    if color_map is None:
+        color_map, fallback_color = load_model_colors()
+    else:
+        fallback_color = '#808080'
+    # Check if we have normalized columns
+    norm_cols = [f'norm_{metric}' for metric in selected_metrics]
+    if not all(col in df.columns for col in norm_cols):
+        fig = go.Figure()
+        fig.update_layout(title="Missing normalized data")
+        return fig
+    fig = go.Figure()
+    # Add trace for each model/method
+    for _, row in df.iterrows():
+        name = row['Name']
+        # Get normalized values for selected metrics
+        r = [row[f'norm_{metric}'] for metric in selected_metrics]
+        # Get color
+        color = color_map.get(name, fallback_color)
+        fillcolor = hex_to_rgba(color, 0.2)
+        # Add trace
+        fig.add_trace(go.Scatterpolar(
+            r=r + [r[0]],  # Close the polygon
+            theta=selected_metrics + [selected_metrics[0]],
+            mode='lines+markers',
+            fill='toself',
+            name=name.lower(),  # Lowercase for legend
+            line=dict(color=color, width=2),
+            marker=dict(color=color, size=6),
+            fillcolor=fillcolor,
+            opacity=0.7,
+            hovertemplate='<b>%{fullData.name}</b><br>%{theta}: %{r:.1f}<extra></extra>'
+        ))
+    # Update layout
+    fig.update_layout(
+        title=dict(
+            text=title,
+            x=0.5,
+            xanchor='center',
+            font=dict(size=18)
+        ),
+        polar=dict(
+            radialaxis=dict(
+                visible=True,
+                range=[0, 100],
+                tickfont=dict(size=11),
+                gridcolor='lightgray',
+                gridwidth=1
+            ),
+            angularaxis=dict(
+                tickfont=dict(size=12, weight='bold')
+            )
+        ),
+        legend=dict(
+            font=dict(size=11),
+            title=dict(text="Models/Methods 💡", font=dict(size=12)),
+            itemsizing='trace',
+            x=1.05,
+            y=1,
+            xanchor='left',
+            yanchor='top',
+            bgcolor='rgba(255,255,255,0.6)',
+            bordercolor='gray',
+            borderwidth=1,
+            itemclick="toggleothers",
+            itemdoubleclick="toggle"
+        ),
+        height=550,
+        margin=dict(l=80, r=200, t=80, b=80)
+    )
+    return fig
+def create_group_bar_chart(
+    df: pd.DataFrame,
+    selected_metrics: List[str],
+    top_n: int = 5,
+    color_map: Optional[Dict[str, str]] = None
+) -> go.Figure:
+    """
+    Create grouped bar chart showing top N performers per metric.
+    Adapted from lmgame_bench's create_group_bar_chart().
+    Args:
+        df: DataFrame with normalized scores
+        selected_metrics: List of metrics to display
+        top_n: Number of top performers to show per metric
+        color_map: Dictionary mapping names to colors
+    Returns:
+        Plotly Figure with grouped bar chart
+    Structure:
+        - X-axis: Metrics with rank positions (e.g., "Recall #1", "Recall #2")
+        - Y-axis: Normalized score (0-100)
+        - Bars: Grouped by model/method
+    """
+    if df.empty:
+        fig = go.Figure()
+        fig.update_layout(title="No data available")
+        return fig
+    # Load colors if not provided
+    if color_map is None:
+        color_map, fallback_color = load_model_colors()
+    else:
+        fallback_color = '#808080'
+    # Check for normalized columns
+    norm_cols = [f'norm_{metric}' for metric in selected_metrics]
+    if not all(col in df.columns for col in norm_cols):
+        fig = go.Figure()
+        fig.update_layout(title="Missing normalized data")
+        return fig
+    # Build x-axis categories and data structure
+    all_x_categories = []
+    all_names = set()
+    metric_rankings = {}
+    for metric in selected_metrics:
+        norm_col = f'norm_{metric}'
+        # Get top N for this metric
+        metric_df = df[df[norm_col].notna()].copy()
+        metric_df = metric_df.sort_values(by=norm_col, ascending=False).head(top_n)
+        metric_rankings[metric] = []
+        for rank, (_, row) in enumerate(metric_df.iterrows(), 1):
+            name = row['Name']
+            score = row[norm_col]
+            x_category = f"{metric}<br>#{rank}"
+            metric_rankings[metric].append({
+                'name': name,
+                'score': score,
+                'x_category': x_category,
+                'rank': rank
+            })
+            all_x_categories.append(x_category)
+            all_names.add(name)
+    # Create traces for each model/method
+    fig = go.Figure()
+    for name in sorted(all_names):
+        x_vals = []
+        y_vals = []
+        for metric in selected_metrics:
+            # Find this model/method's data for this metric
+            for data in metric_rankings[metric]:
+                if data['name'] == name:
+                    x_vals.append(data['x_category'])
+                    y_vals.append(data['score'])
+                    break
+        if x_vals:  # Only add if has data
+            color = color_map.get(name, fallback_color)
+            fig.add_trace(go.Bar(
+                name=name,
+                x=x_vals,
+                y=y_vals,
+                marker_color=color,
+                hovertemplate="<b>%{fullData.name}</b><br>Score: %{y:.1f}<extra></extra>"
+            ))
+    # Update layout
+    fig.update_layout(
+        title=dict(
+            text=f"Top {top_n} Performers by Metric",
+            x=0.5,
+            xanchor='center',
+            font=dict(size=18)
+        ),
+        xaxis_title="Metrics (Ranked by Performance)",
+        yaxis_title="Normalized Score",
+        xaxis=dict(
+            categoryorder='array',
+            categoryarray=all_x_categories,
+            tickangle=0
+        ),
+        yaxis=dict(range=[0, 100]),
+        barmode='group',
+        bargap=0.15,
+        bargroupgap=0.1,
+        height=550,
+        margin=dict(l=60, r=200, t=80, b=80),
+        legend=dict(
+            font=dict(size=11),
+            title=dict(text="Models/Methods 💡", font=dict(size=12)),
+            itemsizing='trace',
+            x=1.05,
+            y=1,
+            xanchor='left',
+            yanchor='top',
+            bgcolor='rgba(255,255,255,0.6)',
+            bordercolor='gray',
+            borderwidth=1
+        )
+    )
+    return fig
+def create_horizontal_bar_chart(
+    df: pd.DataFrame,
+    metric: str,
+    color_map: Optional[Dict[str, str]] = None
+) -> go.Figure:
+    """
+    Create horizontal bar chart for single metric details view.
+    Adapted from lmgame_bench's create_horizontal_bar_chart().
+    Args:
+        df: DataFrame with scores
+        metric: Metric name (e.g., "Recall")
+        color_map: Dictionary mapping names to colors
+    Returns:
+        Plotly Figure with horizontal bar chart
+    Features:
+        - Y-axis: Model/method names (sorted by score, descending)
+        - X-axis: Raw accuracy score (0-1 range)
+        - Uses raw scores, not normalized
+    """
+    if df.empty or metric not in df.columns:
+        fig = go.Figure()
+        fig.update_layout(title=f"No data available for {metric}")
+        return fig
+    # Load colors if not provided
+    if color_map is None:
+        color_map, fallback_color = load_model_colors()
+    else:
+        fallback_color = '#808080'
+    # Filter and sort
+    metric_df = df[df[metric].notna()].copy()
+    metric_df = metric_df.sort_values(by=metric, ascending=True)  # Lowest at top
+    if metric_df.empty:
+        fig = go.Figure()
+        fig.update_layout(title=f"No valid data for {metric}")
+        return fig
+    # Create bar chart
+    colors = [color_map.get(name, fallback_color) for name in metric_df['Name']]
+    fig = go.Figure(
+        go.Bar(
+            y=metric_df['Name'],
+            x=metric_df[metric],
+            orientation='h',
+            marker=dict(
+                color=colors,
+                line=dict(color='#2c3e50', width=1)
+            ),
+            hovertemplate='%{y}<br>Accuracy: %{x:.4f}<extra></extra>'
+        )
+    )
+    # Update layout
+    fig.update_layout(
+        title=dict(
+            text=f'{metric} - Detailed Rankings',
+            x=0.5,
+            xanchor='center',
+            font=dict(size=18)
+        ),
+        xaxis_title="Accuracy",
+        yaxis_title="Model/Method",
+        xaxis=dict(
+            range=[0, 1],
+            gridcolor='#e0e0e0'
+        ),
+        plot_bgcolor='rgba(0,0,0,0)',
+        paper_bgcolor='rgba(0,0,0,0)',
+        font=dict(color='#2c3e50'),
+        height=max(400, len(metric_df) * 30),  # Dynamic height based on entries
+        margin=dict(l=200, r=40, t=80, b=60),
+        showlegend=False
+    )
+    return fig
+def create_multi_metric_bar_chart(
+    df: pd.DataFrame,
+    selected_metrics: List[str],
+    color_map: Optional[Dict[str, str]] = None
+) -> go.Figure:
+    """
+    Create grouped horizontal bar chart showing multiple metrics for each model/method.
+    Args:
+        df: DataFrame with scores
+        selected_metrics: List of metrics to display (e.g., ["Recall", "Causal Inference"])
+        color_map: Dictionary mapping names to colors
+    Returns:
+        Plotly Figure with grouped horizontal bar chart
+    Features:
+        - Y-axis: Model/method names
+        - X-axis: Raw accuracy score (0-1 range)
+        - Multiple bars per model/method (one per selected metric)
+        - Sorted by average score across selected metrics
+    """
+    if df.empty or not selected_metrics:
+        fig = go.Figure()
+        fig.update_layout(title="No data available")
+        return fig
+    # Check if all selected metrics exist
+    missing_metrics = [m for m in selected_metrics if m not in df.columns]
+    if missing_metrics:
+        fig = go.Figure()
+        fig.update_layout(title=f"Missing metrics: {', '.join(missing_metrics)}")
+        return fig
+    # Filter to entries that have at least one selected metric
+    metric_df = df.copy()
+    metric_df = metric_df[metric_df[selected_metrics].notna().any(axis=1)]
+    if metric_df.empty:
+        fig = go.Figure()
+        fig.update_layout(title="No valid data for selected metrics")
+        return fig
+    # Calculate average score across selected metrics for sorting
+    metric_df['avg_score'] = metric_df[selected_metrics].mean(axis=1)
+    metric_df = metric_df.sort_values(by='avg_score', ascending=True)  # Lowest at top
+    # Use single base color with gradient based on capability
+    base_color = "#636EFA"  # Blue color
+    # Normalize avg_score to create gradient (0.3 to 1.0 range for visibility)
+    min_score = metric_df['avg_score'].min()
+    max_score = metric_df['avg_score'].max()
+    score_range = max_score - min_score if max_score > min_score else 1
+    # Create color gradient based on model capability (higher score = deeper color)
+    def get_gradient_color(score, min_val, max_val, score_range):
+        """Generate color with gradient based on score"""
+        # Normalize to 0-1 range, then scale to 0.3-1.0 for better visibility
+        normalized = (score - min_val) / score_range if score_range > 0 else 0.5
+        intensity = 0.3 + (normalized * 0.7)  # Range: 0.3 (light) to 1.0 (deep)
+        # Convert base color to RGB and apply intensity with 50% opacity
+        hex_color = base_color.lstrip('#')
+        r = int(hex_color[0:2], 16)
+        g = int(hex_color[2:4], 16)
+        b = int(hex_color[4:6], 16)
+        # Apply intensity to RGB values
+        r = int(255 - (255 - r) * intensity)
+        g = int(255 - (255 - g) * intensity)
+        b = int(255 - (255 - b) * intensity)
+        return f'rgba({r}, {g}, {b}, 0.5)'  # 50% transparency
+    # Create grouped bar chart
+    fig = go.Figure()
+    for metric in selected_metrics:
+        # Create color array for each model based on their avg_score
+        colors = [
+            get_gradient_color(row['avg_score'], min_score, max_score, score_range)
+            for _, row in metric_df.iterrows()
+        ]
+        fig.add_trace(go.Bar(
+            name=metric,
+            y=metric_df['Name'],
+            x=metric_df[metric],
+            orientation='h',
+            marker=dict(
+                color=colors,
+                line=dict(color='#2c3e50', width=0.5)
+            ),
+            hovertemplate=f'<b>%{{y}}</b><br>{metric}: %{{x:.4f}}<extra></extra>'
+        ))
+    # Update layout
+    fig.update_layout(
+        title=dict(
+            text=f'Detailed Comparison - {", ".join(selected_metrics)}',
+            x=0.5,
+            xanchor='center',
+            font=dict(size=18)
+        ),
+        xaxis_title="Accuracy",
+        yaxis_title="Model/Method",
+        xaxis=dict(
+            range=[0, 1],
+            gridcolor='#e0e0e0'
+        ),
+        barmode='group',
+        plot_bgcolor='rgba(0,0,0,0)',
+        paper_bgcolor='rgba(0,0,0,0)',
+        font=dict(color='#2c3e50'),
+        height=max(500, len(metric_df) * 40),  # Dynamic height
+        margin=dict(l=200, r=40, t=80, b=80),
+        legend=dict(
+            orientation="h",
+            yanchor="bottom",
+            y=1.02,
+            xanchor="center",
+            x=0.5,
+            font=dict(size=12)
+        )
+    )
+    return fig