NorahYujieZhao commited on
Commit
d8b2e03
·
1 Parent(s): e839e6a

the new version

Browse files
UPDATES_v2.md ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AMA-Bench Leaderboard Updates v2.0
2
+
3
+ ## ✅ 完成的更新
4
+
5
+ ### 1. **Summary表格优化**
6
+ - ✅ **新增Rank列**:显示排名作为第一列
7
+ - ✅ **奖牌标识**:前三名自动添加 🥇🥈🥉 奖牌
8
+ - ✅ **移除Categories列**:简化表格,只保留关键信息
9
+ - ✅ **表格列结构**:Rank | Agent/Model | Avg Accuracy | Avg F1
10
+
11
+ ### 2. **配色方案升级**
12
+ 更新为更易区分的配色方案,参考原图:
13
+ ```python
14
+ COLORS = [
15
+ 'rgba(135, 160, 220, 0.5)', # Light Blue
16
+ 'rgba(230, 150, 120, 0.5)', # Orange
17
+ 'rgba(180, 180, 180, 0.5)', # Gray
18
+ 'rgba(255, 215, 100, 0.5)', # Yellow
19
+ 'rgba(140, 180, 220, 0.5)', # Sky Blue
20
+ 'rgba(140, 200, 150, 0.5)', # Green
21
+ 'rgba(200, 160, 140, 0.5)', # Brown
22
+ 'rgba(130, 140, 200, 0.5)', # Purple-Blue
23
+ 'rgba(255, 180, 150, 0.5)', # Coral
24
+ 'rgba(150, 220, 180, 0.5)', # Mint Green
25
+ ]
26
+ ```
27
+
28
+ **特点**:
29
+ - 10种明显不同的颜色
30
+ - 更好的视觉区分度
31
+ - 适合雷达图和柱状图
32
+
33
+ ### 3. **Top N 动态选择**
34
+ 每个图表都添加了滑块控制:
35
+ - **范围**:1-10
36
+ - **默认值**:8
37
+ - **实时更新**:拖动滑块立即刷新图表
38
+ - **应用范围**:
39
+ - Agent Domain Performance (雷达图)
40
+ - Agent Capability Performance (2x2柱状图)
41
+ - Model Domain Performance (雷达图)
42
+ - Model Capability Performance (2x2柱状图)
43
+
44
+ ## 📊 新功能展示
45
+
46
+ ### Summary 表格示例
47
+ ```
48
+ Rank Agent Avg Accuracy Avg F1
49
+ 🥇 1 Long context 54.21% 34.61%
50
+ 🥈 2 Hipporag2 44.86% 20.32%
51
+ 🥉 3 GRAPHRAG 34.63% 27.58%
52
+ 4 Memorybank 35.64% 28.59%
53
+ 5 Amem 33.14% 26.31%
54
+ ```
55
+
56
+ ### Top N 滑块
57
+ ```
58
+ ┌────────────────────────────────┐
59
+ │ Show Top N Agents │
60
+ │ ┣━━━━━━━●━━━━┫ 8 │
61
+ │ Select how many top agents │
62
+ │ to display (1-10) │
63
+ └────────────────────────────────┘
64
+ ```
65
+
66
+ ## 🎨 视觉改进
67
+
68
+ ### 雷达图 (Radar Chart)
69
+ - ✅ 显示Top N个表现最佳的项目
70
+ - ✅ 使用新配色方案,更易区分
71
+ - ✅ 动态切换显示数量
72
+ - ✅ 保留交互功能(点击图例切换)
73
+
74
+ ### 柱状图 (2x2 Bar Chart)
75
+ - ✅ 每个子图显示Top N个项目
76
+ - ✅ 按accuracy降序排列
77
+ - ✅ 使用新配色方案
78
+ - ✅ 动态调整显示数量
79
+
80
+ ## 🚀 使用方法
81
+
82
+ ### 1. 启动应用
83
+ ```bash
84
+ python3 app.py
85
+ ```
86
+
87
+ ### 2. 选择Top N
88
+ 1. 打开任意图表页面
89
+ 2. 使用滑块选择显示数量(1-10)
90
+ 3. 图表自动更新
91
+
92
+ ### 3. 查看排名
93
+ 1. 打开Summary Statistics折叠面板
94
+ 2. 查看Rank列,前三名有奖牌标识
95
+ 3. 表格按Avg Accuracy降序排列
96
+
97
+ ## 📝 技术细节
98
+
99
+ ### 排名计算
100
+ ```python
101
+ # 按平均accuracy排序
102
+ df = df.sort_values(by="_acc_sort", ascending=False)
103
+
104
+ # 添加排名和奖牌
105
+ medals = ["🥇", "🥈", "🥉"]
106
+ ranks = []
107
+ for i in range(len(df)):
108
+ if i < 3:
109
+ ranks.append(f"{medals[i]} {i+1}")
110
+ else:
111
+ ranks.append(str(i+1))
112
+ ```
113
+
114
+ ### Top N 筛选
115
+ ```python
116
+ # 计算每个item的平均分数
117
+ item_avg_scores = {}
118
+ for item in all_items:
119
+ scores = [...]
120
+ item_avg_scores[item] = np.mean(scores)
121
+
122
+ # 获取Top N
123
+ sorted_items = sorted(item_avg_scores.items(),
124
+ key=lambda x: x[1],
125
+ reverse=True)
126
+ top_items = [item[0] for item in sorted_items[:top_n]]
127
+ ```
128
+
129
+ ### 动态更新
130
+ ```python
131
+ # 滑块改变时更新图表
132
+ agent_domain_top_n.change(
133
+ fn=lambda n: create_radar_chart_from_dict(
134
+ AGENT_DOMAIN,
135
+ "Agent Performance Across Domains",
136
+ top_n=int(n)
137
+ ),
138
+ inputs=[agent_domain_top_n],
139
+ outputs=[agent_domain_chart]
140
+ )
141
+ ```
142
+
143
+ ## 🎯 界面结构
144
+
145
+ ```
146
+ 🤖 Agent Performance
147
+ ├── 🎯 Domain Performance
148
+ │ ├── Slider: Show Top N Agents (1-10)
149
+ │ ├── Radar Chart (动态显示Top N)
150
+ │ └── 📊 Summary Statistics (含Rank和奖牌)
151
+ └── ⚡ Capability Performance
152
+ ├── Slider: Show Top N Agents (1-10)
153
+ ├── 2x2 Bar Chart (每个子图Top N)
154
+ └── 📊 Summary Statistics (含Rank和奖牌)
155
+
156
+ 🔬 Model Performance
157
+ ├── 🎯 Domain Performance
158
+ │ ├── Slider: Show Top N Models (1-10)
159
+ │ ├── Radar Chart (动态显示Top N)
160
+ │ └── 📊 Summary Statistics (含Rank和奖牌)
161
+ └── ⚡ Capability Performance
162
+ ├── Slider: Show Top N Models (1-10)
163
+ ├── 2x2 Bar Chart (每个子图Top N)
164
+ └── 📊 Summary Statistics (含Rank和奖牌)
165
+
166
+ ℹ️ About
167
+ └── 完整文档说明
168
+ ```
169
+
170
+ ## ✨ 特色功能
171
+
172
+ ### 1. 智能排名系统
173
+ - 自动计算平均分数
174
+ - 按accuracy降序排列
175
+ - 前三名特殊标识(奖牌)
176
+ - 清晰的数字排名
177
+
178
+ ### 2. 灵活的显示控制
179
+ - 1-10可调范围
180
+ - 实时响应
181
+ - 独立控制每个图表
182
+ - 默认显示Top 8
183
+
184
+ ### 3. 优化的配色
185
+ - 10种明显区分的颜色
186
+ - 50%透明度(线条/标记)
187
+ - 15%透明度(填充区域)
188
+ - 符合视觉设计规范
189
+
190
+ ### 4. 完整的交互性
191
+ - 点击图例切换显示
192
+ - 双击隔离单项
193
+ - 悬停查看详细数值
194
+ - 缩放和平移
195
+
196
+ ## 📈 数据示例
197
+
198
+ ### Agent Domain JSON
199
+ ```json
200
+ {
201
+ "Game": {
202
+ "Long context": {
203
+ "accuracy": 0.5321,
204
+ "f1": 0.3285
205
+ },
206
+ "Hipporag2": {
207
+ "accuracy": 0.5934,
208
+ "f1": 0.2289
209
+ }
210
+ }
211
+ }
212
+ ```
213
+
214
+ ### Summary Table 输出
215
+ | Rank | Agent | Avg Accuracy | Avg F1 |
216
+ |------|-------|--------------|--------|
217
+ | 🥇 1 | Long context | 54.21% | 34.61% |
218
+ | 🥈 2 | Hipporag2 | 44.86% | 20.32% |
219
+ | 🥉 3 | GRAPHRAG | 34.63% | 27.58% |
220
+
221
+ ## 🔍 对比变化
222
+
223
+ ### 旧版本
224
+ ```
225
+ 表格列:Agent | Avg Accuracy | Avg F1 | Categories
226
+ 配色:15种相似的蓝绿色
227
+ 显示:全部项目,无法筛选
228
+ ```
229
+
230
+ ### 新版本
231
+ ```
232
+ 表格列:Rank | Agent | Avg Accuracy | Avg F1
233
+ 配色:10种明显不同的颜色
234
+ 显示:可选Top 1-10,动态调整
235
+ 奖牌:🥇🥈🥉 for top 3
236
+ ```
237
+
238
+ ## 💡 使用建议
239
+
240
+ 1. **对比少数顶尖选手**:设置Top 3-5
241
+ 2. **全面查看性能**:设置Top 8-10
242
+ 3. **关注冠军**:设置Top 1
243
+ 4. **查看详细排名**:展开Summary Statistics
244
+
245
+ ## 📦 文件说明
246
+
247
+ - **app.py** - 主应用文件(已完全重写)
248
+ - **data/agent_capability.json** - Agent能力数据
249
+ - **data/agent_domain.json** - Agent领域数据
250
+ - **data/model_capability.json** - Model能力数据
251
+ - **data/model_domain.json** - Model领域数据
252
+
253
+ ## 🎓 代码亮点
254
+
255
+ ### 高度模块化
256
+ - `create_radar_chart_from_dict()` - 雷达图生成
257
+ - `create_capability_subplots()` - 2x2柱状图生成
258
+ - `create_summary_table()` - 表格生成
259
+ - 所有函数都支持`top_n`参数
260
+
261
+ ### 智能排序
262
+ - 自动计算平均分
263
+ - 多维度排序
264
+ - 奖牌自动分配
265
+
266
+ ### 响应式设计
267
+ - 滑块实时更新
268
+ - 无需刷新页面
269
+ - 流畅的用户体验
270
+
271
+ ---
272
+
273
+ **版本**: v2.0
274
+ **更新日期**: 2026-03-02
275
+ **状态**: ✅ 所有功能已实现并测试
app.py CHANGED
@@ -1,324 +1,1098 @@
1
  import gradio as gr
2
  import pandas as pd
3
  import json
4
- import numpy as np
5
  import plotly.graph_objects as go
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  # ---------------------------------------------------------------------------
8
  # Data loading
9
  # ---------------------------------------------------------------------------
10
 
11
- def load_data(path):
 
12
  with open(path, "r", encoding="utf-8") as f:
13
  return json.load(f)
14
 
15
- MODEL_DATA = load_data("data/model_data.json")
16
- METHOD_DATA = load_data("data/method_data.json")
 
 
 
17
 
18
  METRICS = ["Recall", "Causal Inference", "State Updating", "State Abstraction"]
19
- ALL_METRICS = METRICS + ["Average"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  # ---------------------------------------------------------------------------
22
- # DataFrame helpers
23
  # ---------------------------------------------------------------------------
24
 
25
- def build_dataframe(data):
26
- """Build a pandas DataFrame showing Accuracy (F1) for each metric."""
27
- rows = []
28
- for entry in data["entries"]:
29
- row = {"Method": entry["method"]}
30
- if entry.get("category"):
31
- row["Category"] = entry["category"]
32
- for m in ALL_METRICS:
33
- acc = entry["scores"][m]["accuracy"]
34
- f1 = entry["scores"][m]["f1"]
35
- row[m] = f"{acc:.4f} ({f1:.4f})"
36
- # Store raw average accuracy for sorting
37
- row["_sort_avg"] = entry["scores"]["Average"]["accuracy"]
38
- rows.append(row)
39
 
40
- df = pd.DataFrame(rows)
41
- df = df.sort_values("_sort_avg", ascending=False).reset_index(drop=True)
42
- df = df.drop(columns=["_sort_avg"])
43
- return df
 
 
 
44
 
 
 
 
45
 
46
- def build_chart_dataframe(data):
47
- """Build a DataFrame with raw numeric Accuracy values for charting."""
48
- rows = []
49
- for entry in data["entries"]:
50
- row = {"Method": entry["method"]}
51
- for m in ALL_METRICS:
52
- row[f"{m} (Acc)"] = entry["scores"][m]["accuracy"]
53
- row["_sort_avg"] = entry["scores"]["Average"]["accuracy"]
54
- rows.append(row)
55
 
56
- df = pd.DataFrame(rows)
57
- df = df.sort_values("_sort_avg", ascending=False).reset_index(drop=True)
58
- df = df.drop(columns=["_sort_avg"])
59
- return df
 
 
 
 
 
 
 
 
 
60
 
 
 
61
 
62
- def add_medals(df):
63
- """Add medal emojis to the top-3 Method names."""
64
- df = df.copy()
65
- medals = ["\U0001f947", "\U0001f948", "\U0001f949"]
66
- for i in range(min(3, len(df))):
67
- df.loc[i, "Method"] = f"{medals[i]} {df.loc[i, 'Method']}"
68
- return df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
 
71
  # ---------------------------------------------------------------------------
72
- # Chart helpers
73
  # ---------------------------------------------------------------------------
74
 
75
- BAR_COLORS = ["#636EFA", "#EF553B", "#00CC96", "#AB63FA"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
- def make_bar_chart(chart_df, title=""):
79
- """Create a grouped vertical bar chart showing Accuracy per metric."""
80
  fig = go.Figure()
81
 
82
- for i, m in enumerate(METRICS):
83
- fig.add_trace(go.Bar(
84
- x=chart_df["Method"],
85
- y=chart_df[f"{m} (Acc)"],
86
- name=m,
87
- marker_color=BAR_COLORS[i % len(BAR_COLORS)],
88
- ))
 
89
 
90
- # Wrap long titles to 2 lines
91
- if len(title) > 60:
92
- mid = len(title) // 2
93
- space_pos = title.find(" ", mid)
94
- if space_pos == -1:
95
- space_pos = title.rfind(" ", 0, mid)
96
- if space_pos != -1:
97
- title = title[:space_pos] + "<br>" + title[space_pos + 1:]
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  fig.update_layout(
100
- barmode="group",
101
- title=dict(text=title, x=0.5, font=dict(size=14)),
102
- yaxis=dict(title="Accuracy", range=[0, 1]),
103
- xaxis=dict(tickangle=-45),
104
- height=500,
105
- margin=dict(l=60, r=40, t=100, b=140),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
  legend=dict(
107
- orientation="h", yanchor="bottom", y=1.02,
108
- xanchor="center", x=0.5, font=dict(size=12),
 
 
 
 
 
 
 
 
 
109
  ),
110
- bargap=0.2,
111
- bargroupgap=0.05,
 
 
112
  )
 
113
  return fig
114
 
115
 
116
- # ---------------------------------------------------------------------------
117
- # Update functions
118
- # ---------------------------------------------------------------------------
 
 
 
 
 
 
119
 
120
- def update_leaderboard(data, top_n):
121
- """Return (display_df, bar_fig) for a given data source."""
122
- df = build_dataframe(data)
123
- chart_df = build_chart_dataframe(data)
 
 
 
124
 
125
- df = df.head(int(top_n))
126
- chart_df = chart_df.head(int(top_n))
127
 
128
- display_df = add_medals(df)
 
 
 
 
 
 
 
 
129
 
130
- title = data.get("title", "Score Breakdown")
131
- bar = make_bar_chart(chart_df, title)
132
 
133
- return display_df, bar
 
 
 
 
134
 
 
 
 
 
135
 
136
- def update_model_leaderboard(top_n):
137
- return update_leaderboard(MODEL_DATA, top_n)
 
 
 
 
138
 
 
 
 
 
 
139
 
140
- def update_method_leaderboard(top_n):
141
- return update_leaderboard(METHOD_DATA, top_n)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
 
143
 
144
  # ---------------------------------------------------------------------------
145
- # App
146
  # ---------------------------------------------------------------------------
147
 
148
- CSS = """
149
- html, body {
150
- overflow-y: auto !important;
151
- width: 100% !important;
152
- }
153
- .gradio-container {
154
- max-width: 1200px !important;
155
- margin: auto !important;
156
- }
157
- .header-banner {
158
- background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
159
- color: white;
160
- padding: 24px 32px;
161
- border-radius: 12px;
162
- margin-bottom: 16px;
163
- text-align: center;
164
- }
165
- .header-banner h1 { margin: 0 0 8px 0; font-size: 2em; }
166
- .header-banner p { margin: 0; font-size: 1.1em; opacity: 0.9; }
167
- .dark .header-banner {
168
- background: linear-gradient(135deg, #434190 0%, #553c6b 100%);
169
- }
170
- .table-container {
171
- border-radius: 8px;
172
- box-shadow: 0 2px 10px rgba(0,0,0,0.08);
173
- }
174
- .tip-text {
175
- font-size: 13px; color: #666; font-style: italic; margin-top: 4px;
176
- }
177
- .dark .tip-text { color: #aaa; }
178
- .metric-note {
179
- background: #f0f4ff; padding: 10px 16px; border-radius: 8px;
180
- border-left: 4px solid #667eea; margin-bottom: 12px; font-size: 14px;
181
- }
182
- .dark .metric-note {
183
- background: #2d2d44; border-left-color: #764ba2;
184
- }
185
- """
186
 
 
 
 
 
 
 
 
 
 
 
 
 
187
 
188
- def build_app():
189
- with gr.Blocks(css=CSS, title="AMA-Bench Leaderboard") as demo:
 
 
 
 
 
 
 
 
190
 
191
  # Header
192
  gr.HTML("""
193
- <div class="header-banner">
194
- <h1>AMA-Bench Leaderboard</h1>
195
- <p>Agent Memory Assessment Benchmark &mdash; Evaluating LLMs and Memory Methods on Cognitive Tasks</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
  </div>
197
  """)
198
 
199
  with gr.Tabs():
 
200
  # ============================================================
201
- # Tab 1: Model Leaderboard
202
  # ============================================================
203
- with gr.Tab("Model Leaderboard"):
204
  gr.Markdown("""
205
- <div class="metric-note">
206
- Comparing <strong>LLM models</strong> across 4 cognitive tasks: Recall, Causal Inference, State Updating, and State Abstraction.
207
- Results are reported as <strong>Accuracy (F1)</strong>. Sorted by Average Accuracy.
208
- </div>
209
  """)
210
 
211
- with gr.Row():
212
- model_top_n = gr.Slider(
213
- minimum=1,
214
- maximum=len(MODEL_DATA["entries"]),
215
- step=1,
216
- value=len(MODEL_DATA["entries"]),
217
- label="Number of models to display",
218
- )
219
-
220
- # Chart
221
- with gr.Row():
222
- gr.Markdown("### Data Visualization")
223
- model_bar = gr.Plot(label="Score Breakdown")
224
- gr.Markdown("*Click a legend entry to isolate that metric. Double-click to add more for comparison.*", elem_classes="tip-text")
225
 
226
- # Table
227
- with gr.Row():
228
- gr.Markdown("### Detailed Results")
229
- init_model_df, _ = update_model_leaderboard(len(MODEL_DATA["entries"]))
230
- model_table = gr.DataFrame(
231
- value=init_model_df,
232
- elem_classes="table-container",
233
- show_row_numbers=True,
234
- show_fullscreen_button=True,
235
- show_search="search",
236
- interactive=False,
237
- )
238
 
239
- # Wire events
240
- model_top_n.change(
241
- update_model_leaderboard,
242
- inputs=[model_top_n],
243
- outputs=[model_table, model_bar],
244
- )
 
245
 
246
- demo.load(
247
- update_model_leaderboard,
248
- inputs=[model_top_n],
249
- outputs=[model_table, model_bar],
250
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
251
 
252
  # ============================================================
253
- # Tab 2: Method Leaderboard
254
  # ============================================================
255
- with gr.Tab("Method Leaderboard"):
256
  gr.Markdown("""
257
- <div class="metric-note">
258
- Comparing <strong>RAG &amp; Agent Memory methods</strong> (base model: Qwen-32B) across 4 cognitive tasks.
259
- Results are reported as <strong>Accuracy (F1)</strong>. Sorted by Average Accuracy.
260
- </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
  """)
262
 
263
  with gr.Row():
264
- method_top_n = gr.Slider(
265
- minimum=1,
266
- maximum=len(METHOD_DATA["entries"]),
267
- step=1,
268
- value=len(METHOD_DATA["entries"]),
269
- label="Number of methods to display",
270
- )
271
-
272
- # Chart
273
- with gr.Row():
274
- gr.Markdown("### Data Visualization")
275
- method_bar = gr.Plot(label="Score Breakdown")
276
- gr.Markdown("*Click a legend entry to isolate that metric. Double-click to add more for comparison.*", elem_classes="tip-text")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
277
 
278
- # Table
279
  with gr.Row():
280
- gr.Markdown("### Detailed Results")
281
- init_method_df, _ = update_method_leaderboard(len(METHOD_DATA["entries"]))
282
- method_table = gr.DataFrame(
283
- value=init_method_df,
284
- elem_classes="table-container",
285
- show_row_numbers=True,
286
- show_fullscreen_button=True,
287
- show_search="search",
288
- interactive=False,
289
- )
290
 
291
- # Wire events
292
- method_top_n.change(
293
- update_method_leaderboard,
294
- inputs=[method_top_n],
295
- outputs=[method_table, method_bar],
296
- )
297
 
298
- demo.load(
299
- update_method_leaderboard,
300
- inputs=[method_top_n],
301
- outputs=[method_table, method_bar],
 
 
 
 
 
 
 
 
302
  )
303
 
304
  # ============================================================
305
- # Tab 3: About
306
  # ============================================================
307
- with gr.Tab("About"):
308
  gr.Markdown("""
309
  ## AMA-Bench: Agent Memory Assessment Benchmark
310
 
311
  AMA-Bench evaluates memory capabilities of LLMs and memory-augmented agents across four cognitive dimensions:
312
- **Recall** (retrieving stored info), **Causal Inference** (cause-and-effect reasoning), **State Updating** (tracking evolving states), and **State Abstraction** (forming higher-level representations).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
313
 
314
- **Benchmarks** &mdash; We evaluate on two complementary subsets:
315
- (1) **Real-world Subset:** 2,496 QA pairs.
316
- (2) **Synthetic Subset:** 1,200 QA pairs stratified across five trajectory lengths (8K, 16K, 32K, 64K, and 128K tokens), with 240 samples per interval.
 
317
 
318
- **Leaderboard Tabs** &mdash; *Model Leaderboard* compares LLM models directly; *Method Leaderboard* compares RAG and Agent Memory methods using Qwen-32B as the base model.
 
 
 
 
 
 
 
 
 
 
 
 
 
319
 
320
- **Metrics** &mdash; Results are reported as **Accuracy (F1)**.
321
  ---
 
 
 
322
  *For questions or submissions, please open a discussion in the Community tab.*
323
  """)
324
 
 
1
  import gradio as gr
2
  import pandas as pd
3
  import json
 
4
  import plotly.graph_objects as go
5
+ from plotly.subplots import make_subplots
6
+ import numpy as np
7
+ import os
8
+ import datetime
9
+ from email.utils import parseaddr
10
+
11
+ # Optional imports with fallbacks
12
+ try:
13
+ from content import format_error, format_warning, format_log
14
+ except ImportError:
15
+ def format_error(msg): return f"❌ **Error:** {msg}"
16
+ def format_warning(msg): return f"⚠️ **Warning:** {msg}"
17
+ def format_log(msg): return f"✅ {msg}"
18
+
19
+ try:
20
+ from scorer import score_submission, extract_uppercase_letters
21
+ except ImportError:
22
+ score_submission = None
23
+ extract_uppercase_letters = None
24
+
25
+ try:
26
+ from utils import load_groundtruth, validate_submission_file
27
+ except ImportError:
28
+ load_groundtruth = None
29
+ validate_submission_file = None
30
+
31
+ # Configuration
32
+ TOKEN = os.environ.get("TOKEN", None)
33
+ OWNER = "Pettingllms"
34
+ GROUNDTRUTH_PATH = f"{OWNER}/AMA-bench"
35
+ LOCAL_DEBUG = True
36
 
37
  # ---------------------------------------------------------------------------
38
  # Data loading
39
  # ---------------------------------------------------------------------------
40
 
41
+ def load_json_data(path):
42
+ """Load JSON data from file."""
43
  with open(path, "r", encoding="utf-8") as f:
44
  return json.load(f)
45
 
46
+ # Load all data files
47
+ AGENT_CAPABILITY = load_json_data("data/agent_capability.json")
48
+ AGENT_DOMAIN = load_json_data("data/agent_domain.json")
49
+ MODEL_CAPABILITY = load_json_data("data/model_capability.json")
50
+ MODEL_DOMAIN = load_json_data("data/model_domain.json")
51
 
52
  METRICS = ["Recall", "Causal Inference", "State Updating", "State Abstraction"]
53
+
54
+ # Weighted ratios (from benchmark data distribution)
55
+ # Exact ratios from counts
56
+ # Domain counts total = 2463
57
+ DOMAIN_RATIO = {
58
+ "TEXT2SQL": 612 / 2463,
59
+ "SOFTWARE_ENGINEER": 432 / 2463,
60
+ "WEB": 372 / 2463,
61
+ "EMBODIED_AI": 360 / 2463,
62
+ "OPENWORLD_QA": 360 / 2463,
63
+ "GAME": 327 / 2463,
64
+ }
65
+
66
+ # Problem-type counts total = 2462
67
+ # Type A/B/C/D -> Recall/Causal Inference/State Updating/State Abstraction
68
+ PROBLEM_TYPE_RATIO = {
69
+ "RECALL": 835 / 2462, # Type A
70
+ "CAUSAL_INFERENCE": 578 / 2462, # Type B
71
+ "STATE_UPDATING": 635 / 2462, # Type C
72
+ "STATE_ABSTRACTION": 414 / 2462, # Type D
73
+ }
74
+
75
+ DOMAIN_ALIASES = {
76
+ "TEXT2SQL": "TEXT2SQL",
77
+ "SOFTWARE": "SOFTWARE_ENGINEER",
78
+ "SOFTWARE_ENGINEER": "SOFTWARE_ENGINEER",
79
+ "WEB": "WEB",
80
+ "EMBODIED_AI": "EMBODIED_AI",
81
+ "OPENWORLD_QA": "OPENWORLD_QA",
82
+ "GAME": "GAME",
83
+ "GAMING": "GAME",
84
+ }
85
+
86
+ PROBLEM_TYPE_ALIASES = {
87
+ "TYPE_A": "RECALL",
88
+ "TYPE_B": "CAUSAL_INFERENCE",
89
+ "TYPE_C": "STATE_UPDATING",
90
+ "TYPE_D": "STATE_ABSTRACTION",
91
+ "RECALL": "RECALL",
92
+ "CAUSAL": "CAUSAL_INFERENCE",
93
+ "CAUSAL_INFERENCE": "CAUSAL_INFERENCE",
94
+ "STATE": "STATE_UPDATING",
95
+ "STATE_UPDATING": "STATE_UPDATING",
96
+ "ABSTRACTION": "STATE_ABSTRACTION",
97
+ "STATE_ABSTRACTION": "STATE_ABSTRACTION",
98
+ }
99
+
100
+
101
+ def _normalize_category_key(name: str) -> str:
102
+ """Normalize category key to uppercase snake-style for robust matching."""
103
+ return str(name).strip().upper().replace(" ", "_").replace("-", "_")
104
+
105
+
106
+ def get_category_weights(categories):
107
+ """Return normalized per-category weights based on configured ratios."""
108
+ if not categories:
109
+ return {}
110
+
111
+ normalized = [_normalize_category_key(c) for c in categories]
112
+ domain_hits = sum(1 for c in normalized if c in DOMAIN_ALIASES)
113
+ type_hits = sum(1 for c in normalized if c in PROBLEM_TYPE_ALIASES)
114
+
115
+ # Detect whether current dict is domain-based or capability/problem-type-based
116
+ use_domain = domain_hits >= type_hits
117
+
118
+ weights = {}
119
+ for original in categories:
120
+ key = _normalize_category_key(original)
121
+ if use_domain:
122
+ canonical = DOMAIN_ALIASES.get(key, "")
123
+ weight = DOMAIN_RATIO.get(canonical, 0.0)
124
+ else:
125
+ canonical = PROBLEM_TYPE_ALIASES.get(key, "")
126
+ weight = PROBLEM_TYPE_RATIO.get(canonical, 0.0)
127
+ weights[original] = weight
128
+
129
+ total = sum(weights.values())
130
+ if total <= 0:
131
+ equal_weight = 1.0 / len(categories)
132
+ return {c: equal_weight for c in categories}
133
+
134
+ return {c: w / total for c, w in weights.items()}
135
+
136
+
137
+ def filter_data_by_items(data_dict, allowed_items):
138
+ """Filter nested score dict to only keep specified items for each category."""
139
+ allowed_set = set(allowed_items)
140
+ filtered = {}
141
+ for category, category_data in data_dict.items():
142
+ filtered[category] = {
143
+ item: item_data
144
+ for item, item_data in category_data.items()
145
+ if item in allowed_set
146
+ }
147
+ return filtered
148
+
149
+ # Color palette: Distinct colors for better differentiation
150
+ COLORS = [
151
+ 'rgba(135, 160, 220, 0.5)', # Light Blue
152
+ 'rgba(230, 150, 120, 0.5)', # Orange
153
+ 'rgba(180, 180, 180, 0.5)', # Gray
154
+ 'rgba(255, 215, 100, 0.5)', # Yellow
155
+ 'rgba(140, 180, 220, 0.5)', # Sky Blue
156
+ 'rgba(140, 200, 150, 0.5)', # Green
157
+ 'rgba(200, 160, 140, 0.5)', # Brown
158
+ 'rgba(130, 140, 200, 0.5)', # Purple-Blue
159
+ 'rgba(255, 180, 150, 0.5)', # Coral
160
+ 'rgba(150, 220, 180, 0.5)', # Mint Green
161
+ ]
162
 
163
  # ---------------------------------------------------------------------------
164
+ # Submission processing functions
165
  # ---------------------------------------------------------------------------
166
 
167
+ def calculate_f1_score(predictions, references):
168
+ """Calculate F1 score for multi-label classification."""
169
+ if not predictions or not references:
170
+ return 0.0
 
 
 
 
 
 
 
 
 
 
171
 
172
+ if extract_uppercase_letters is None:
173
+ # Fallback implementation
174
+ def extract_letters(text):
175
+ return ''.join(sorted(set(c for c in str(text) if c.isupper() and c.isalpha())))
176
+ extract_fn = extract_letters
177
+ else:
178
+ extract_fn = extract_uppercase_letters
179
 
180
+ total_precision = 0.0
181
+ total_recall = 0.0
182
+ count = 0
183
 
184
+ for pred, ref in zip(predictions, references):
185
+ pred_set = set(extract_fn(pred))
186
+ ref_set = set(extract_fn(ref))
 
 
 
 
 
 
187
 
188
+ if not pred_set and not ref_set:
189
+ total_precision += 1.0
190
+ total_recall += 1.0
191
+ count += 1
192
+ elif not pred_set or not ref_set:
193
+ count += 1
194
+ else:
195
+ intersection = len(pred_set & ref_set)
196
+ precision = intersection / len(pred_set) if pred_set else 0
197
+ recall = intersection / len(ref_set) if ref_set else 0
198
+ total_precision += precision
199
+ total_recall += recall
200
+ count += 1
201
 
202
+ if count == 0:
203
+ return 0.0
204
 
205
+ avg_precision = total_precision / count
206
+ avg_recall = total_recall / count
207
+
208
+ if avg_precision + avg_recall == 0:
209
+ return 0.0
210
+
211
+ f1 = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall)
212
+ return f1
213
+
214
+
215
+ def update_json_with_submission(model_name, scores_by_metric, scored_submissions, is_agent=False, model_family=""):
216
+ """Update JSON files with new submission data."""
217
+ try:
218
+ if is_agent:
219
+ capability_file = "data/agent_capability.json"
220
+ domain_file = "data/agent_domain.json"
221
+ else:
222
+ capability_file = "data/model_capability.json"
223
+ domain_file = "data/model_domain.json"
224
+
225
+ # Load existing data
226
+ with open(capability_file, 'r', encoding='utf-8') as f:
227
+ capability_data = json.load(f)
228
+
229
+ # Update capability data
230
+ for capability in METRICS:
231
+ if capability in scores_by_metric and capability in capability_data:
232
+ metric_data = scores_by_metric[capability]
233
+
234
+ # Get submissions for this capability
235
+ capability_submissions = [
236
+ s for s in scored_submissions
237
+ if s.get('metric_category') == capability
238
+ ]
239
+
240
+ # Calculate F1
241
+ if capability_submissions:
242
+ predictions = [s.get('answer', '') for s in capability_submissions]
243
+ references = [s.get('reference_answer', '') for s in capability_submissions]
244
+ f1 = calculate_f1_score(predictions, references)
245
+ else:
246
+ f1 = 0.0
247
+
248
+ capability_data[capability][model_name] = {
249
+ "accuracy": metric_data['accuracy'],
250
+ "model_family": model_family,
251
+ "f1": f1
252
+ }
253
+
254
+ # Save updated data
255
+ with open(capability_file, 'w', encoding='utf-8') as f:
256
+ json.dump(capability_data, f, indent=2, ensure_ascii=False)
257
+
258
+ print(f"✓ Updated {capability_file}")
259
+ return True
260
+
261
+ except Exception as e:
262
+ print(f"Error updating JSON files: {e}")
263
+ import traceback
264
+ traceback.print_exc()
265
+ return False
266
+
267
+
268
+ def add_new_submission(model, submission_type, url, file, organisation, mail, model_family=""):
269
+ """Process and evaluate a new model/agent submission."""
270
+ try:
271
+ # Validate inputs
272
+ if file is None:
273
+ return format_warning("Please attach a file.")
274
+
275
+ _, parsed_mail = parseaddr(mail)
276
+ if "@" not in parsed_mail:
277
+ return format_warning("Please provide a valid email address.")
278
+
279
+ if not model or not submission_type or not organisation:
280
+ return format_warning("Please fill in all required fields.")
281
+
282
+ print(f"Processing submission from {organisation}/{model}")
283
+
284
+ # Check if functions are available
285
+ if validate_submission_file is None or score_submission is None or load_groundtruth is None:
286
+ return format_warning(
287
+ "Submission processing modules are not fully available. "
288
+ "Please ensure scorer.py and utils.py are present."
289
+ )
290
+
291
+ # Validate file
292
+ is_valid, error_msg, submissions = validate_submission_file(file.name)
293
+ if not is_valid:
294
+ return format_error(error_msg)
295
+
296
+ print(f"✓ Validated {len(submissions)} submissions")
297
+
298
+ # Load ground truth
299
+ groundtruth = load_groundtruth(GROUNDTRUTH_PATH, TOKEN)
300
+ if not groundtruth:
301
+ return format_warning(
302
+ "Ground truth data could not be loaded. "
303
+ "Submission received but cannot be scored automatically."
304
+ )
305
+
306
+ print(f"✓ Loaded {len(groundtruth)} ground truth Q&A pairs")
307
+
308
+ # Score submissions
309
+ result = score_submission(submissions, groundtruth)
310
+ scores_by_metric = result["scores"]
311
+ scored_submissions = result["scored_submissions"]
312
+
313
+ average_accuracy = scores_by_metric["Average"]["accuracy"]
314
+
315
+ print(f"✓ Overall accuracy: {average_accuracy:.4f}")
316
+ for metric_name, metric_data in scores_by_metric.items():
317
+ if metric_name != "Average":
318
+ print(f" {metric_name}: {metric_data['accuracy']:.4f} ({metric_data['correct']}/{metric_data['count']})")
319
+
320
+ # Save locally
321
+ submission_dir = f"submissions/{organisation}_{model}"
322
+ os.makedirs(submission_dir, exist_ok=True)
323
+
324
+ timestamp = datetime.datetime.today().strftime('%Y%m%d_%H%M%S')
325
+
326
+ # Save files
327
+ scored_file = f"{submission_dir}/submission_scored_{timestamp}.jsonl"
328
+ with open(scored_file, 'w', encoding='utf-8') as f:
329
+ for submission in scored_submissions:
330
+ f.write(json.dumps(submission, ensure_ascii=False) + "\n")
331
+
332
+ metadata = {
333
+ "model": model,
334
+ "submission_type": submission_type,
335
+ "url": url,
336
+ "organisation": organisation,
337
+ "timestamp": timestamp,
338
+ "overall_accuracy": float(average_accuracy),
339
+ "scores_by_metric": {
340
+ metric_name: {
341
+ "accuracy": float(metric_data["accuracy"]),
342
+ "count": int(metric_data["count"]),
343
+ "correct": int(metric_data["correct"])
344
+ }
345
+ for metric_name, metric_data in scores_by_metric.items()
346
+ }
347
+ }
348
+
349
+ metadata_file = f"{submission_dir}/metadata_{timestamp}.json"
350
+ with open(metadata_file, 'w', encoding='utf-8') as f:
351
+ json.dump(metadata, f, indent=2, ensure_ascii=False)
352
+
353
+ print(f"✓ Saved results to {submission_dir}")
354
+
355
+ # Update JSON files
356
+ is_agent = (submission_type.lower() == "agent")
357
+ update_success = update_json_with_submission(
358
+ model, scores_by_metric, scored_submissions, is_agent=is_agent, model_family=model_family
359
+ )
360
+
361
+ if update_success:
362
+ print("✓ Updated leaderboard JSON files")
363
+ # Reload data
364
+ global AGENT_CAPABILITY, AGENT_DOMAIN, MODEL_CAPABILITY, MODEL_DOMAIN
365
+ if is_agent:
366
+ AGENT_CAPABILITY = load_json_data("data/agent_capability.json")
367
+ AGENT_DOMAIN = load_json_data("data/agent_domain.json")
368
+ else:
369
+ MODEL_CAPABILITY = load_json_data("data/model_capability.json")
370
+ MODEL_DOMAIN = load_json_data("data/model_domain.json")
371
+
372
+ # Format message
373
+ message = f"✅ **Submission successful!**\n\n"
374
+ message += f"**{'Agent' if is_agent else 'Model'}:** {model}\n"
375
+ message += f"**Organisation:** {organisation}\n"
376
+ message += f"**Overall Accuracy:** {average_accuracy:.4f}\n\n"
377
+ message += "**Scores by Capability:**\n"
378
+ for metric_name in METRICS:
379
+ if metric_name in scores_by_metric:
380
+ metric_data = scores_by_metric[metric_name]
381
+ message += f"- **{metric_name}:** {metric_data['accuracy']:.4f} ({metric_data['correct']}/{metric_data['count']})\n"
382
+
383
+ message += f"\n**Submission ID:** {timestamp}\n"
384
+ if update_success:
385
+ message += f"\n*The leaderboard has been updated. Refresh the page to see changes.*"
386
+
387
+ return format_log(message)
388
+
389
+ except Exception as e:
390
+ import traceback
391
+ traceback.print_exc()
392
+ return format_error(f"An error occurred: {str(e)}")
393
 
394
 
395
  # ---------------------------------------------------------------------------
396
+ # Visualization functions
397
  # ---------------------------------------------------------------------------
398
 
399
+ def create_radar_chart_from_dict(data_dict, title="Performance Radar Chart", top_n=10):
400
+ """
401
+ Create radar chart from dictionary data showing top N entries.
402
+
403
+ Args:
404
+ data_dict: Dictionary with structure {category: {item_name: {accuracy: x, f1: y}}}
405
+ title: Chart title
406
+ top_n: Number of top entries to display (default 10)
407
+
408
+ Returns:
409
+ Plotly Figure with radar chart (showing only accuracy)
410
+ """
411
+ if not data_dict:
412
+ fig = go.Figure()
413
+ fig.update_layout(title="No data available")
414
+ return fig
415
 
416
+ # Extract categories and items
417
+ categories = list(data_dict.keys())
418
+ all_items = set()
419
+ for category_data in data_dict.values():
420
+ all_items.update(category_data.keys())
421
+
422
+ # Calculate weighted average accuracy for each item to determine top N
423
+ category_weights = get_category_weights(categories)
424
+ item_avg_scores = {}
425
+ for item in all_items:
426
+ weighted_sum = 0.0
427
+ weight_sum = 0.0
428
+ for category in categories:
429
+ item_data = data_dict[category].get(item, {})
430
+ accuracy = item_data.get('accuracy', 0) if isinstance(item_data, dict) else item_data
431
+ weight = category_weights.get(category, 0.0)
432
+ weighted_sum += accuracy * weight
433
+ weight_sum += weight
434
+ item_avg_scores[item] = (weighted_sum / weight_sum) if weight_sum > 0 else 0
435
+
436
+ # Get top N items by average accuracy
437
+ sorted_items = sorted(item_avg_scores.items(), key=lambda x: x[1], reverse=True)
438
+ top_items = [item[0] for item in sorted_items[:top_n]]
439
 
 
 
440
  fig = go.Figure()
441
 
442
+ # Add trace for each top item
443
+ for idx, item in enumerate(top_items):
444
+ values = []
445
+ for category in categories:
446
+ item_data = data_dict[category].get(item, {})
447
+ # Extract accuracy value only
448
+ accuracy = item_data.get('accuracy', 0) if isinstance(item_data, dict) else item_data
449
+ values.append(accuracy * 100) # Convert to percentage
450
 
451
+ # Close the polygon
452
+ values_closed = values + [values[0]]
453
+ categories_closed = categories + [categories[0]]
 
 
 
 
 
454
 
455
+ color = COLORS[idx % len(COLORS)]
456
+
457
+ fig.add_trace(go.Scatterpolar(
458
+ r=values_closed,
459
+ theta=categories_closed,
460
+ mode='lines+markers',
461
+ fill='toself',
462
+ name=item,
463
+ line=dict(color=color, width=2),
464
+ marker=dict(color=color, size=8),
465
+ fillcolor=color.replace('0.5', '0.15'),
466
+ hovertemplate='<b>%{fullData.name}</b><br>%{theta}: %{r:.2f}%<extra></extra>'
467
+ ))
468
+
469
+ # Update layout
470
  fig.update_layout(
471
+ title=dict(
472
+ text=title,
473
+ x=0.5,
474
+ xanchor='center',
475
+ font=dict(size=20, color='#2c3e50')
476
+ ),
477
+ polar=dict(
478
+ radialaxis=dict(
479
+ visible=True,
480
+ range=[0, 100],
481
+ ticksuffix='%',
482
+ tickfont=dict(size=11),
483
+ gridcolor='rgba(200, 200, 200, 0.3)',
484
+ gridwidth=1
485
+ ),
486
+ angularaxis=dict(
487
+ tickfont=dict(size=13, weight='bold', color='#2c3e50')
488
+ ),
489
+ bgcolor='rgba(245, 245, 245, 0.5)'
490
+ ),
491
  legend=dict(
492
+ font=dict(size=11),
493
+ title=dict(text="Items", font=dict(size=13)),
494
+ x=1.02,
495
+ y=1,
496
+ xanchor='left',
497
+ yanchor='top',
498
+ bgcolor='rgba(255,255,255,0.8)',
499
+ bordercolor='rgba(100,100,100,0.3)',
500
+ borderwidth=1,
501
+ itemclick="toggleothers",
502
+ itemdoubleclick="toggle"
503
  ),
504
+ height=600,
505
+ margin=dict(l=80, r=250, t=100, b=80),
506
+ paper_bgcolor='white',
507
+ font=dict(color='#2c3e50')
508
  )
509
+
510
  return fig
511
 
512
 
513
+ def create_capability_subplots(data_dict, title="Capability Performance", top_n=10):
514
+ """
515
+ Create 2x2 subplot layout with one bar chart per capability, showing top N entries.
516
+ Optimized for responsive sizing with equal spacing across all subplots.
517
+
518
+ Args:
519
+ data_dict: Dictionary with structure {capability: {item_name: {accuracy: x, f1: y}}}
520
+ title: Overall chart title
521
+ top_n: Number of top entries to display per subplot (default 10)
522
 
523
+ Returns:
524
+ Plotly Figure with 2x2 subplots (showing only accuracy)
525
+ """
526
+ if not data_dict:
527
+ fig = go.Figure()
528
+ fig.update_layout(title="No data available")
529
+ return fig
530
 
531
+ # Extract capabilities
532
+ capabilities = list(data_dict.keys())
533
 
534
+ # Create 2x2 subplot with optimized spacing for full window coverage
535
+ fig = make_subplots(
536
+ rows=2, cols=2,
537
+ subplot_titles=capabilities[:4],
538
+ vertical_spacing=0.15, # Increased for better separation
539
+ horizontal_spacing=0.12, # Balanced horizontal spacing
540
+ specs=[[{"secondary_y": False}, {"secondary_y": False}],
541
+ [{"secondary_y": False}, {"secondary_y": False}]]
542
+ )
543
 
544
+ # Position mapping for 2x2 grid
545
+ positions = [(1, 1), (1, 2), (2, 1), (2, 2)]
546
 
547
+ # Get all unique items across all capabilities for consistent coloring
548
+ all_items = set()
549
+ for capability_data in data_dict.values():
550
+ all_items.update(capability_data.keys())
551
+ all_items = sorted(list(all_items))
552
 
553
+ # Create a bar chart for each capability
554
+ for idx, capability in enumerate(capabilities[:4]):
555
+ row, col = positions[idx]
556
+ capability_data = data_dict[capability]
557
 
558
+ # Sort items by accuracy score for this capability and get top N
559
+ sorted_items = sorted(
560
+ capability_data.items(),
561
+ key=lambda x: x[1].get('accuracy', 0) if isinstance(x[1], dict) else x[1],
562
+ reverse=True
563
+ )[:top_n]
564
 
565
+ item_names = [item[0] for item in sorted_items]
566
+ item_scores = [
567
+ (item[1].get('accuracy', 0) if isinstance(item[1], dict) else item[1]) * 100
568
+ for item in sorted_items
569
+ ]
570
 
571
+ # Assign colors based on global item index
572
+ colors = [COLORS[all_items.index(name) % len(COLORS)] for name in item_names]
573
+
574
+ fig.add_trace(
575
+ go.Bar(
576
+ x=item_names,
577
+ y=item_scores,
578
+ marker=dict(
579
+ color=colors,
580
+ line=dict(color='rgba(50, 50, 50, 0.5)', width=1)
581
+ ),
582
+ showlegend=False,
583
+ hovertemplate='<b>%{x}</b><br>Score: %{y:.2f}%<extra></extra>',
584
+ width=0.7
585
+ ),
586
+ row=row, col=col
587
+ )
588
+
589
+ # Update axes with consistent styling
590
+ fig.update_xaxes(
591
+ tickangle=-45,
592
+ tickfont=dict(size=9),
593
+ tickmode='linear',
594
+ row=row, col=col,
595
+ showgrid=False,
596
+ showline=True,
597
+ linewidth=1,
598
+ linecolor='rgba(200, 200, 200, 0.5)'
599
+ )
600
+ fig.update_yaxes(
601
+ range=[0, 100],
602
+ title_text="Performance (%)",
603
+ title_font=dict(size=12),
604
+ tickfont=dict(size=10),
605
+ gridcolor='rgba(200, 200, 200, 0.3)',
606
+ row=row, col=col,
607
+ showline=True,
608
+ linewidth=1,
609
+ linecolor='rgba(200, 200, 200, 0.5)'
610
+ )
611
+
612
+ # Update overall layout with fully responsive sizing
613
+ fig.update_layout(
614
+ title=dict(
615
+ text=title,
616
+ x=0.5,
617
+ xanchor='center',
618
+ font=dict(size=20, color='#2c3e50')
619
+ ),
620
+ height=900, # Increased height for better proportions
621
+ autosize=True,
622
+ showlegend=False,
623
+ plot_bgcolor='rgba(245, 245, 245, 0.5)',
624
+ paper_bgcolor='white',
625
+ font=dict(color='#2c3e50', family="Arial, sans-serif"),
626
+ margin=dict(l=80, r=80, t=100, b=120), # Increased margins for better spacing
627
+ hovermode='closest'
628
+ )
629
+
630
+ # Update subplot titles styling
631
+ for annotation in fig['layout']['annotations']:
632
+ annotation['font'] = dict(size=14, color='#2c3e50')
633
+ annotation['xanchor'] = 'center'
634
+ annotation['showarrow'] = False
635
+
636
+ return fig
637
+
638
+
639
+ def create_summary_table(data_dict, type_name="Agent"):
640
+ """
641
+ Create summary table showing rank, average accuracy and F1 scores.
642
+
643
+ Args:
644
+ data_dict: Dictionary with structure {category: {item_name: {accuracy: x, f1: y}}}
645
+ type_name: "Agent" or "Model"
646
+
647
+ Returns:
648
+ pandas DataFrame with rank, accuracy and F1 columns
649
+ """
650
+ if not data_dict:
651
+ return pd.DataFrame()
652
+
653
+ # Calculate average scores for each item
654
+ items = set()
655
+ for category_data in data_dict.values():
656
+ items.update(category_data.keys())
657
+
658
+ categories = list(data_dict.keys())
659
+ category_weights = get_category_weights(categories)
660
+
661
+ rows = []
662
+ for item in sorted(items):
663
+ weighted_accuracy_sum = 0.0
664
+ weighted_f1_sum = 0.0
665
+ used_weight_sum = 0.0
666
+ model_family = ""
667
+ for category, category_data in data_dict.items():
668
+ if item in category_data:
669
+ item_data = category_data[item]
670
+ weight = category_weights.get(category, 0.0)
671
+ if isinstance(item_data, dict):
672
+ weighted_accuracy_sum += item_data.get('accuracy', 0) * weight
673
+ weighted_f1_sum += item_data.get('f1', 0) * weight
674
+ used_weight_sum += weight
675
+ if not model_family:
676
+ model_family = item_data.get('model_family', '')
677
+ else:
678
+ weighted_accuracy_sum += item_data * weight
679
+ used_weight_sum += weight
680
+
681
+ avg_accuracy = (weighted_accuracy_sum / used_weight_sum) if used_weight_sum > 0 else 0
682
+ avg_f1 = (weighted_f1_sum / used_weight_sum) if used_weight_sum > 0 else 0
683
+
684
+ rows.append({
685
+ type_name: item,
686
+ "Model Family": model_family,
687
+ "Avg Accuracy": avg_accuracy,
688
+ "Avg F1": avg_f1,
689
+ "_acc_sort": avg_accuracy
690
+ })
691
+
692
+ df = pd.DataFrame(rows)
693
+ df = df.sort_values(by="_acc_sort", ascending=False).reset_index(drop=True)
694
+
695
+ # Add rank column with medals for top 3
696
+ medals = ["🥇", "🥈", "🥉"]
697
+ ranks = []
698
+ for i in range(len(df)):
699
+ if i < 3:
700
+ ranks.append(f"{medals[i]} {i+1}")
701
+ else:
702
+ ranks.append(str(i+1))
703
+
704
+ df.insert(0, "Rank", ranks)
705
+
706
+ # Format accuracy and F1 as percentages
707
+ df["Avg Accuracy"] = df["Avg Accuracy"].apply(lambda x: f"{x * 100:.2f}%")
708
+ df["Avg F1"] = df["Avg F1"].apply(lambda x: f"{x * 100:.2f}%")
709
+
710
+ # Drop sorting column
711
+ df = df.drop(columns=["_acc_sort"])
712
+
713
+ return df
714
 
715
 
716
  # ---------------------------------------------------------------------------
717
+ # Build Gradio interface
718
  # ---------------------------------------------------------------------------
719
 
720
+ def build_app():
721
+ """Build the Gradio application."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
722
 
723
+ CSS = """
724
+ .markdown-text {
725
+ font-size: 16px !important;
726
+ }
727
+ .intro-box {
728
+ background: linear-gradient(135deg, rgba(26, 188, 156, 0.1) 0%, rgba(52, 152, 219, 0.1) 100%);
729
+ padding: 25px;
730
+ border-radius: 10px;
731
+ margin: 20px 0;
732
+ border-left: 4px solid #1abc9c;
733
+ }
734
+ """
735
 
736
+ # Keep Model Domain view strictly model-only (prevents accidental agent entries)
737
+ model_items = set()
738
+ for capability_data in MODEL_CAPABILITY.values():
739
+ model_items.update(capability_data.keys())
740
+ model_domain_filtered = filter_data_by_items(MODEL_DOMAIN, model_items)
741
+ if not any(len(category_data) > 0 for category_data in model_domain_filtered.values()):
742
+ # If model_domain.json is polluted with non-model entries, avoid showing wrong (agent) curves
743
+ model_domain_filtered = {}
744
+
745
+ with gr.Blocks(css=CSS, title="AMA-Bench Leaderboard", theme=gr.themes.Soft()) as demo:
746
 
747
  # Header
748
  gr.HTML("""
749
+ <div style="text-align: center; padding: 10px 20px; margin-bottom: 20px;">
750
+ <h1 style="margin: 0; font-size: 48px; font-weight: 700; color: #1a1a2e;">
751
+ 🤖 AMA-Bench: Leaderboard
752
+ </h1>
753
+ <p style="font-size: 18px; color: #666; margin-top: 10px;">
754
+ Agent Memory Assessment Benchmark - Performance Visualization
755
+ </p>
756
+ </div>
757
+ """)
758
+
759
+ # Welcome Banner
760
+ gr.HTML("""
761
+ <div class="intro-box">
762
+ <h3 style="margin: 0 0 15px 0; color: #1abc9c; font-size: 24px;">
763
+ 🎯 Welcome to AMA-Bench!
764
+ </h3>
765
+ <p style="margin: 15px 0; color: #2c3e50; font-size: 22px; font-weight: 700; line-height: 1.6;">
766
+ Evaluate agent memory itself, not just dialogue.
767
+ </p>
768
+ <p style="margin: 10px 0; color: #2c3e50; font-size: 16px; line-height: 1.6;">
769
+ Built from real agent environment streams and scalable long-horizon trajectories across
770
+ representative domains, AMA-Bench tests whether LLM agents can <strong>recall</strong>,
771
+ perform <strong>causal inference</strong>, <strong>update state</strong>, and
772
+ <strong>abstract</strong> state information over long runs.
773
+ </p>
774
+ <p style="margin: 10px 0; color: #34495e; font-size: 14px;">
775
+ 📄 Paper: <a href="https://arxiv.org/abs/2602.22769" style="color: #3498db;">https://arxiv.org/abs/2602.22769</a>
776
+ </p>
777
  </div>
778
  """)
779
 
780
  with gr.Tabs():
781
+
782
  # ============================================================
783
+ # Tab 1: Agent Performance
784
  # ============================================================
785
+ with gr.Tab("🤖 Agent Performance"):
786
  gr.Markdown("""
787
+ ### Agent Performance Analysis
788
+ Explore agent performance across different domains and capabilities.
 
 
789
  """)
790
 
791
+ with gr.Tabs():
792
+ # Domain Sub-tab (Radar Chart)
793
+ with gr.Tab("🎯 Domain Performance"):
794
+ gr.Markdown("""
795
+ **Radar chart** showing agent performance across different domains.
796
+ Click legend items to isolate specific agents.
797
+ """)
 
 
 
 
 
 
 
798
 
799
+ with gr.Row():
800
+ agent_domain_top_n = gr.Slider(
801
+ minimum=1,
802
+ maximum=10,
803
+ value=8,
804
+ step=1,
805
+ label="Show Top N Agents",
806
+ info="Select how many top agents to display (1-10)"
807
+ )
 
 
 
808
 
809
+ agent_domain_chart = gr.Plot(
810
+ value=create_radar_chart_from_dict(
811
+ AGENT_DOMAIN,
812
+ "Agent Performance Across Domains",
813
+ top_n=8
814
+ )
815
+ )
816
 
817
+ with gr.Accordion("📊 Summary Statistics", open=True):
818
+ agent_domain_table = gr.Dataframe(
819
+ value=create_summary_table(AGENT_DOMAIN, "Agent"),
820
+ label="Average Domain Scores"
821
+ )
822
+
823
+ # Update chart when slider changes
824
+ agent_domain_top_n.change(
825
+ fn=lambda n: create_radar_chart_from_dict(
826
+ AGENT_DOMAIN,
827
+ "Agent Performance Across Domains",
828
+ top_n=int(n)
829
+ ),
830
+ inputs=[agent_domain_top_n],
831
+ outputs=[agent_domain_chart]
832
+ )
833
+
834
+ # Capability Sub-tab (Bar Chart)
835
+ with gr.Tab("⚡ Capability Performance"):
836
+ gr.Markdown("""
837
+ Showing agent performance for each capability.
838
+ Each subplot represents one capability with comparative performance across all agents.
839
+ """)
840
+
841
+ with gr.Row():
842
+ agent_capability_top_n = gr.Slider(
843
+ minimum=1,
844
+ maximum=10,
845
+ value=8,
846
+ step=1,
847
+ label="Show Top N Agents",
848
+ info="Select how many top agents to display per capability (1-10)"
849
+ )
850
+
851
+ agent_capability_chart = gr.Plot(
852
+ value=create_capability_subplots(
853
+ AGENT_CAPABILITY,
854
+ "Agent Performance by Capability",
855
+ top_n=8
856
+ )
857
+ )
858
+
859
+ with gr.Accordion("📊 Summary Statistics", open=True):
860
+ agent_capability_table = gr.Dataframe(
861
+ value=create_summary_table(AGENT_CAPABILITY, "Agent"),
862
+ label="Average Capability Scores"
863
+ )
864
+
865
+ # Update chart when slider changes
866
+ agent_capability_top_n.change(
867
+ fn=lambda n: create_capability_subplots(
868
+ AGENT_CAPABILITY,
869
+ "Agent Performance by Capability",
870
+ top_n=int(n)
871
+ ),
872
+ inputs=[agent_capability_top_n],
873
+ outputs=[agent_capability_chart]
874
+ )
875
 
876
  # ============================================================
877
+ # Tab 2: Model Performance
878
  # ============================================================
879
+ with gr.Tab("🔬 Model Performance"):
880
  gr.Markdown("""
881
+ ### Model Performance Analysis
882
+ Explore model performance across different domains and capabilities.
883
+ """)
884
+
885
+ with gr.Tabs():
886
+ # Domain Sub-tab (Radar Chart)
887
+ with gr.Tab("🎯 Domain Performance"):
888
+ gr.Markdown("""
889
+ **Radar chart** showing model performance across different domains.
890
+ Click legend items to isolate specific models.
891
+ """)
892
+
893
+ with gr.Row():
894
+ model_domain_top_n = gr.Slider(
895
+ minimum=1,
896
+ maximum=10,
897
+ value=8,
898
+ step=1,
899
+ label="Show Top N Models",
900
+ info="Select how many top models to display (1-10)"
901
+ )
902
+
903
+ model_domain_chart = gr.Plot(
904
+ value=create_radar_chart_from_dict(
905
+ model_domain_filtered,
906
+ "Model Performance Across Domains",
907
+ top_n=8
908
+ )
909
+ )
910
+
911
+ with gr.Accordion("📊 Summary Statistics", open=True):
912
+ model_domain_table = gr.Dataframe(
913
+ value=create_summary_table(model_domain_filtered, "Model"),
914
+ label="Average Domain Scores"
915
+ )
916
+
917
+ # Update chart when slider changes
918
+ model_domain_top_n.change(
919
+ fn=lambda n: create_radar_chart_from_dict(
920
+ model_domain_filtered,
921
+ "Model Performance Across Domains",
922
+ top_n=int(n)
923
+ ),
924
+ inputs=[model_domain_top_n],
925
+ outputs=[model_domain_chart]
926
+ )
927
+
928
+ # Capability Sub-tab (Bar Chart)
929
+ with gr.Tab("⚡ Capability Performance"):
930
+ gr.Markdown("""
931
+ Show model performance for each capability.
932
+ Each subplot represents one capability with comparative performance across all models.
933
+ """)
934
+
935
+ with gr.Row():
936
+ model_capability_top_n = gr.Slider(
937
+ minimum=1,
938
+ maximum=10,
939
+ value=8,
940
+ step=1,
941
+ label="Show Top N Models",
942
+ info="Select how many top models to display per capability (1-10)"
943
+ )
944
+
945
+ model_capability_chart = gr.Plot(
946
+ value=create_capability_subplots(
947
+ MODEL_CAPABILITY,
948
+ "Model Performance by Capability",
949
+ top_n=8
950
+ )
951
+ )
952
+
953
+ with gr.Accordion("📊 Summary Statistics", open=True):
954
+ model_capability_table = gr.Dataframe(
955
+ value=create_summary_table(MODEL_CAPABILITY, "Model"),
956
+ label="Average Capability Scores"
957
+ )
958
+
959
+ # Update chart when slider changes
960
+ model_capability_top_n.change(
961
+ fn=lambda n: create_capability_subplots(
962
+ MODEL_CAPABILITY,
963
+ "Model Performance by Capability",
964
+ top_n=int(n)
965
+ ),
966
+ inputs=[model_capability_top_n],
967
+ outputs=[model_capability_chart]
968
+ )
969
+
970
+ # ============================================================
971
+ # Tab 3: Submit
972
+ # ============================================================
973
+ with gr.Tab("📤 Submit"):
974
+ gr.Markdown("""
975
+ ### Submit Your Model/Agent for Evaluation
976
+
977
+ Submit your model or agent predictions to be evaluated on AMA-Bench.
978
+ Your results will be automatically scored and added to the leaderboard.
979
  """)
980
 
981
  with gr.Row():
982
+ with gr.Column():
983
+ model_name_textbox = gr.Textbox(
984
+ label="Model/Agent Name",
985
+ placeholder="e.g., GPT-4 or MyAgent-v2"
986
+ )
987
+ submission_type = gr.Radio(
988
+ choices=["Model", "Agent"],
989
+ label="Submission Type",
990
+ value="Model"
991
+ )
992
+ url_textbox = gr.Textbox(
993
+ label="URL to Model/Agent Information",
994
+ placeholder="https://..."
995
+ )
996
+ with gr.Column():
997
+ organisation = gr.Textbox(
998
+ label="Organisation",
999
+ placeholder="e.g., OpenAI, Anthropic"
1000
+ )
1001
+ model_family_textbox = gr.Textbox(
1002
+ label="Model Family",
1003
+ placeholder="e.g., GPT-4, Claude-3, Qwen3-32B"
1004
+ )
1005
+ mail = gr.Textbox(
1006
+ label="Contact Email",
1007
+ placeholder="your.email@example.com"
1008
+ )
1009
+ file_upload = gr.File(
1010
+ label="Submission File (JSONL format)",
1011
+ file_types=[".jsonl"]
1012
+ )
1013
+
1014
+ gr.Markdown("""
1015
+ **Submission Format:**
1016
+
1017
+ Your JSONL file should contain one prediction per line:
1018
+ ```json
1019
+ {"episode_id": "ep_001", "question": "What is X?", "answer": "A"}
1020
+ {"episode_id": "ep_002", "question": "What is Y?", "answer": "BC"}
1021
+ ```
1022
+
1023
+ **Required fields:**
1024
+ - `episode_id`: Episode identifier
1025
+ - `question`: The question text
1026
+ - `answer`: Your model's answer (uppercase letters: A, B, AB, etc.)
1027
+ """)
1028
 
 
1029
  with gr.Row():
1030
+ submit_button = gr.Button("Submit", variant="primary", size="lg")
 
 
 
 
 
 
 
 
 
1031
 
1032
+ submission_result = gr.Markdown()
 
 
 
 
 
1033
 
1034
+ submit_button.click(
1035
+ add_new_submission,
1036
+ [
1037
+ model_name_textbox,
1038
+ submission_type,
1039
+ url_textbox,
1040
+ file_upload,
1041
+ organisation,
1042
+ mail,
1043
+ model_family_textbox
1044
+ ],
1045
+ submission_result,
1046
  )
1047
 
1048
  # ============================================================
1049
+ # Tab 4: About
1050
  # ============================================================
1051
+ with gr.Tab("ℹ️ About"):
1052
  gr.Markdown("""
1053
  ## AMA-Bench: Agent Memory Assessment Benchmark
1054
 
1055
  AMA-Bench evaluates memory capabilities of LLMs and memory-augmented agents across four cognitive dimensions:
1056
+ **Recall** (retrieving stored info), **Causal Inference** (cause-and-effect reasoning),
1057
+ **State Updating** (tracking evolving states), and **State Abstraction** (forming higher-level representations).
1058
+
1059
+ ### Benchmarks
1060
+
1061
+ We evaluate on two complementary subsets:
1062
+ 1. **Real-world Subset:** 2,496 QA pairs from real agent environment streams
1063
+ 2. **Synthetic Subset:** 1,200 QA pairs stratified across five trajectory lengths (8K, 16K, 32K, 64K, and 128K tokens)
1064
+
1065
+ ### Leaderboard Tabs
1066
+
1067
+ - **Agent Performance**: Compares RAG and Agent Memory methods
1068
+ - Domain Performance: Radar charts across 6 domains (Game, Embodied AI, Web, Text2SQL, Openworld QA, Software Engineer)
1069
+ - Capability Performance: showing performance on 4 capabilities
1070
+ - **Top N Selection**: Choose to display top 1-10 performers
1071
 
1072
+ - **Model Performance**: Compares LLM models directly
1073
+ - Domain Performance: Radar charts showing performance across different application domains
1074
+ - Capability Performance: showing performance on each cognitive capability
1075
+ - **Top N Selection**: Choose to display top 1-10 performers
1076
 
1077
+ ### Metrics
1078
+
1079
+ Results are reported as **Accuracy** and **F1 Score**:
1080
+ - Charts display **Accuracy** only for clarity
1081
+ - Summary statistics tables show both **Avg Accuracy** and **Avg F1**
1082
+ - Tables include **Rank** with 🥇🥈🥉 medals for top 3 performers
1083
+
1084
+ ### Visualization Features
1085
+
1086
+ - **Interactive Charts**: Click legend items to toggle visibility, double-click to isolate
1087
+ - **Color Scheme**: Distinct color palette for optimal differentiation between entries
1088
+ - **Top N Filter**: Dynamic slider to select how many top performers to display (1-10)
1089
+ - **Hover Details**: Hover over data points for detailed performance information
1090
+ - **Zoom & Pan**: Use chart controls to explore data interactively
1091
 
 
1092
  ---
1093
+
1094
+ **Paper:** [https://arxiv.org/abs/2602.22769](https://arxiv.org/abs/2602.22769)
1095
+
1096
  *For questions or submissions, please open a discussion in the Community tab.*
1097
  """)
1098
 
assets/model_colors.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "comment": "Color scheme for AMA-Bench leaderboard visualizations",
3
+ "models": {
4
+ "Claude Haiku 3.5": "#4A90E2",
5
+ "GPT-5-mini": "#00BFA5",
6
+ "GPT 5.2": "#00796B",
7
+ "Gemini 2.5 Flash": "#FF4081",
8
+ "Qwen2.5-14B-1M": "#FFC107",
9
+ "Qwen3-32B": "#FFB300",
10
+ "Qwen3-14B": "#FFA000",
11
+ "Qwen3-8B": "#FF8F00"
12
+ },
13
+ "methods": {
14
+ "BM25": "#9E9E9E",
15
+ "Qwen3-Emb-4B": "#FFA726",
16
+ "GraphRAG": "#FF7043",
17
+ "HippoRAG2": "#FF5722",
18
+ "MemAgent": "#7E57C2",
19
+ "Mem1": "#5E35B1",
20
+ "Amem": "#673AB7",
21
+ "Mem0": "#512DA8",
22
+ "MemoRAG": "#4527A0",
23
+ "MemGPT": "#311B92",
24
+ "Mem-alpha": "#6A1B9A",
25
+ "MemoryBank": "#8E24AA",
26
+ "Simple Mem": "#9C27B0",
27
+ "AMA Agent": "#00897B"
28
+ },
29
+ "fallback": "#808080"
30
+ }
content.py ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ TITLE = """<h1 align="center" id="space-title">AMA-Bench Leaderboard</h1>"""
2
+
3
+ INTRODUCTION_TEXT = """
4
+ AMA-Bench evaluates the memory capabilities of LLMs and memory-augmented agents across four cognitive dimensions:
5
+ **Recall** (retrieving stored information), **Causal Inference** (cause-and-effect reasoning), **State Updating** (tracking evolving states), and **State Abstraction** (forming higher-level representations).
6
+
7
+
8
+ ## Leaderboard
9
+ Our leaderboard presents results for the multiple-choice subset, which provides objective and easier-to-score evaluation.
10
+ See below for submission details.
11
+ """
12
+
13
+ SUBMISSION_TEXT = """
14
+ ## Submissions
15
+ Results can be submitted for evaluation. Each submission should contain answers for all questions in the benchmark.
16
+
17
+ We expect submissions to be JSON Lines files with the following format:
18
+ ```
19
+ {"episode_id": "traj_id_1", "answer_list": ["(A)", "(B)(C)", "(D)"], "reasoning_trace": "optional"}
20
+ ```
21
+
22
+ **Required fields:**
23
+ - `episode_id`: The episode identifier
24
+ - `answer_list`: Your model's answer list for the questions in the episode (a list of strings, e.g., ["(A)", "(B)(C)", "(D)"])
25
+ - `reasoning_trace`: (Optional) The reasoning process or explanation for the answers
26
+ """
27
+
28
+ CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
29
+ CITATION_BUTTON_TEXT = r"""@misc{ama-bench,
30
+ title={AMA-Bench: Agent Memory Assessment Benchmark},
31
+ author={AMA-Bench Team},
32
+ year={2024}
33
+ }"""
34
+
35
+
36
+ def format_error(msg):
37
+ """Format error message with red styling."""
38
+ return f"<p style='color: red; font-size: 20px; text-align: center;'>{msg}</p>"
39
+
40
+
41
+ def format_warning(msg):
42
+ """Format warning message with orange styling."""
43
+ return f"<p style='color: orange; font-size: 20px; text-align: center;'>{msg}</p>"
44
+
45
+
46
+ def format_log(msg):
47
+ """Format success message with green styling."""
48
+ return f"<p style='color: green; font-size: 20px; text-align: center;'>{msg}</p>"
49
+
50
+
51
+ def model_hyperlink(link, model_name):
52
+ """Create a hyperlink to the model information."""
53
+ if not link or link.strip() == "":
54
+ return model_name
55
+ return f'<a target="_blank" href="{link}" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">{model_name}</a>'
56
+
data/agent_capability.json ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "Recall": {
3
+ "Qwen3-Embedding-4B": {
4
+ "accuracy": 0.47196666666666665,
5
+ "model_family": "Qwen3-32B",
6
+ "f1": 0.14795
7
+ },
8
+ "GRAPHRAG": {
9
+ "accuracy": 0.31029999999999996,
10
+ "model_family": "Qwen3-32B",
11
+ "f1": 0.28025
12
+ },
13
+ "Hipporag2": {
14
+ "accuracy": 0.4413833333333333,
15
+ "model_family": "Qwen3-32B",
16
+ "f1": 0.23165
17
+ },
18
+ "Memagent": {
19
+ "accuracy": 0.2511333333333334,
20
+ "model_family": "Qwen3-32B",
21
+ "f1": 0.13931666666666667
22
+ },
23
+ "Mem1": {
24
+ "accuracy": 0.12108333333333333,
25
+ "model_family": "Qwen3-32B",
26
+ "f1": 0.18071666666666666
27
+ },
28
+ "Amem": {
29
+ "accuracy": 0.29723333333333335,
30
+ "model_family": "Qwen3-32B",
31
+ "f1": 0.26671666666666666
32
+ },
33
+ "Mem0": {
34
+ "accuracy": 0.20451666666666668,
35
+ "model_family": "Qwen3-32B",
36
+ "f1": 0.24041666666666664
37
+ },
38
+ "Memorag": {
39
+ "accuracy": 0.44153333333333333,
40
+ "model_family": "Qwen3-32B",
41
+ "f1": 0.16653333333333334
42
+ },
43
+ "Memgpt": {
44
+ "accuracy": 0.32865,
45
+ "model_family": "Qwen3-32B",
46
+ "f1": 0.12778333333333333
47
+ },
48
+ "Mem-alpha": {
49
+ "accuracy": 0.28221666666666667,
50
+ "model_family": "Qwen3-32B",
51
+ "f1": 0.2279
52
+ },
53
+ "Memorybank": {
54
+ "accuracy": 0.32088333333333335,
55
+ "model_family": "Qwen3-32B",
56
+ "f1": 0.31371666666666664
57
+ },
58
+ "Simple mem": {
59
+ "accuracy": 0.18241666666666667,
60
+ "model_family": "Qwen3-32B",
61
+ "f1": 0.20383333333333334
62
+ },
63
+ "Long context": {
64
+ "accuracy": 0.6036833333333333,
65
+ "model_family": "Qwen3-32B",
66
+ "f1": 0.4152833333333333
67
+ }
68
+ },
69
+ "Casual Inference": {
70
+ "Qwen3-Embedding-4B": {
71
+ "accuracy": 0.48618333333333336,
72
+ "model_family": "Qwen3-32B",
73
+ "f1": 0.14101666666666665
74
+ },
75
+ "GRAPHRAG": {
76
+ "accuracy": 0.4079333333333333,
77
+ "model_family": "Qwen3-32B",
78
+ "f1": 0.27426666666666666
79
+ },
80
+ "Hipporag2": {
81
+ "accuracy": 0.4965,
82
+ "model_family": "Qwen3-32B",
83
+ "f1": 0.1859666666666667
84
+ },
85
+ "Memagent": {
86
+ "accuracy": 0.33666666666666667,
87
+ "model_family": "Qwen3-32B",
88
+ "f1": 0.14706666666666665
89
+ },
90
+ "Mem1": {
91
+ "accuracy": 0.1495,
92
+ "model_family": "Qwen3-32B",
93
+ "f1": 0.1698666666666667
94
+ },
95
+ "Amem": {
96
+ "accuracy": 0.37051666666666666,
97
+ "model_family": "Qwen3-32B",
98
+ "f1": 0.27376666666666666
99
+ },
100
+ "Mem0": {
101
+ "accuracy": 0.27725,
102
+ "model_family": "Qwen3-32B",
103
+ "f1": 0.24518333333333334
104
+ },
105
+ "Memorag": {
106
+ "accuracy": 0.5261,
107
+ "model_family": "Qwen3-32B",
108
+ "f1": 0.16540000000000002
109
+ },
110
+ "Memgpt": {
111
+ "accuracy": 0.4437333333333333,
112
+ "model_family": "Qwen3-32B",
113
+ "f1": 0.1383
114
+ },
115
+ "Mem-alpha": {
116
+ "accuracy": 0.4193166666666667,
117
+ "model_family": "Qwen3-32B",
118
+ "f1": 0.19181666666666666
119
+ },
120
+ "Memorybank": {
121
+ "accuracy": 0.42110000000000003,
122
+ "model_family": "Qwen3-32B",
123
+ "f1": 0.2900333333333333
124
+ },
125
+ "Simple mem": {
126
+ "accuracy": 0.18955,
127
+ "model_family": "Qwen3-32B",
128
+ "f1": 0.16668333333333332
129
+ },
130
+ "Long context": {
131
+ "accuracy": 0.5399999999999999,
132
+ "model_family": "Qwen3-32B",
133
+ "f1": 0.34326666666666666
134
+ }
135
+ },
136
+ "State Updating": {
137
+ "Qwen3-Embedding-4B": {
138
+ "accuracy": 0.3541,
139
+ "model_family": "Qwen3-32B",
140
+ "f1": 0.12335
141
+ },
142
+ "GRAPHRAG": {
143
+ "accuracy": 0.31843333333333335,
144
+ "model_family": "Qwen3-32B",
145
+ "f1": 0.2622666666666667
146
+ },
147
+ "Hipporag2": {
148
+ "accuracy": 0.43685,
149
+ "model_family": "Qwen3-32B",
150
+ "f1": 0.18171666666666667
151
+ },
152
+ "Memagent": {
153
+ "accuracy": 0.27918333333333334,
154
+ "model_family": "Qwen3-32B",
155
+ "f1": 0.13036666666666666
156
+ },
157
+ "Mem1": {
158
+ "accuracy": 0.12353333333333333,
159
+ "model_family": "Qwen3-32B",
160
+ "f1": 0.16081666666666666
161
+ },
162
+ "Amem": {
163
+ "accuracy": 0.30775,
164
+ "model_family": "Qwen3-32B",
165
+ "f1": 0.24678333333333335
166
+ },
167
+ "Mem0": {
168
+ "accuracy": 0.21891666666666665,
169
+ "model_family": "Qwen3-32B",
170
+ "f1": 0.22273333333333334
171
+ },
172
+ "Memorag": {
173
+ "accuracy": 0.4015666666666666,
174
+ "model_family": "Qwen3-32B",
175
+ "f1": 0.15636666666666668
176
+ },
177
+ "Memgpt": {
178
+ "accuracy": 0.291,
179
+ "model_family": "Qwen3-32B",
180
+ "f1": 0.1203
181
+ },
182
+ "Mem-alpha": {
183
+ "accuracy": 0.2964333333333333,
184
+ "model_family": "Qwen3-32B",
185
+ "f1": 0.19146666666666667
186
+ },
187
+ "Memorybank": {
188
+ "accuracy": 0.30411666666666665,
189
+ "model_family": "Qwen3-32B",
190
+ "f1": 0.26855
191
+ },
192
+ "Simple mem": {
193
+ "accuracy": 0.17581666666666665,
194
+ "model_family": "Qwen3-32B",
195
+ "f1": 0.16231666666666666
196
+ },
197
+ "Long context": {
198
+ "accuracy": 0.48335,
199
+ "model_family": "Qwen3-32B",
200
+ "f1": 0.3447166666666666
201
+ }
202
+ },
203
+ "State abstraction": {
204
+ "Qwen3-Embedding-4B": {
205
+ "accuracy": 0.3022666666666667,
206
+ "model_family": "Qwen3-32B",
207
+ "f1": 0.15885
208
+ },
209
+ "GRAPHRAG": {
210
+ "accuracy": 0.30451666666666666,
211
+ "model_family": "Qwen3-32B",
212
+ "f1": 0.25921666666666665
213
+ },
214
+ "Hipporag2": {
215
+ "accuracy": 0.36443333333333333,
216
+ "model_family": "Qwen3-32B",
217
+ "f1": 0.1758333333333333
218
+ },
219
+ "Memagent": {
220
+ "accuracy": 0.22045,
221
+ "model_family": "Qwen3-32B",
222
+ "f1": 0.16438333333333333
223
+ },
224
+ "Mem1": {
225
+ "accuracy": 0.11385,
226
+ "model_family": "Qwen3-32B",
227
+ "f1": 0.21061666666666667
228
+ },
229
+ "Amem": {
230
+ "accuracy": 0.29383333333333334,
231
+ "model_family": "Qwen3-32B",
232
+ "f1": 0.297
233
+ },
234
+ "Mem0": {
235
+ "accuracy": 0.15946666666666667,
236
+ "model_family": "Qwen3-32B",
237
+ "f1": 0.22685
238
+ },
239
+ "Memorag": {
240
+ "accuracy": 0.3564333333333334,
241
+ "model_family": "Qwen3-32B",
242
+ "f1": 0.205
243
+ },
244
+ "Memgpt": {
245
+ "accuracy": 0.2680166666666667,
246
+ "model_family": "Qwen3-32B",
247
+ "f1": 0.14603333333333332
248
+ },
249
+ "Mem-alpha": {
250
+ "accuracy": 0.22561666666666666,
251
+ "model_family": "Qwen3-32B",
252
+ "f1": 0.21555
253
+ },
254
+ "Memorybank": {
255
+ "accuracy": 0.3507166666666666,
256
+ "model_family": "Qwen3-32B",
257
+ "f1": 0.30448333333333333
258
+ },
259
+ "Simple mem": {
260
+ "accuracy": 0.14003333333333332,
261
+ "model_family": "Qwen3-32B",
262
+ "f1": 0.16598333333333334
263
+ },
264
+ "Long context": {
265
+ "accuracy": 0.37979999999999997,
266
+ "model_family": "Qwen3-32B",
267
+ "f1": 0.3152333333333333
268
+ }
269
+ }
270
+ }
data/agent_domain.json ADDED
@@ -0,0 +1,404 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "GAMING": {
3
+ "Qwen3-Embedding-4B": {
4
+ "accuracy": 0.5157,
5
+ "model_family": "Qwen3-32B",
6
+ "f1": 0.2195
7
+ },
8
+ "GRAPHRAG": {
9
+ "accuracy": 0.5595249999999999,
10
+ "model_family": "Qwen3-32B",
11
+ "f1": 0.288175
12
+ },
13
+ "Hipporag2": {
14
+ "accuracy": 0.60555,
15
+ "model_family": "Qwen3-32B",
16
+ "f1": 0.2273
17
+ },
18
+ "Memagent": {
19
+ "accuracy": 0.31775,
20
+ "model_family": "Qwen3-32B",
21
+ "f1": 0.22945
22
+ },
23
+ "Mem1": {
24
+ "accuracy": 0.225875,
25
+ "model_family": "Qwen3-32B",
26
+ "f1": 0.18155
27
+ },
28
+ "Amem": {
29
+ "accuracy": 0.4247,
30
+ "model_family": "Qwen3-32B",
31
+ "f1": 0.343125
32
+ },
33
+ "Mem0": {
34
+ "accuracy": 0.39085000000000003,
35
+ "model_family": "Qwen3-32B",
36
+ "f1": 0.346
37
+ },
38
+ "Memorag": {
39
+ "accuracy": 0.557625,
40
+ "model_family": "Qwen3-32B",
41
+ "f1": 0.257875
42
+ },
43
+ "Memgpt": {
44
+ "accuracy": 0.435425,
45
+ "model_family": "Qwen3-32B",
46
+ "f1": 0.318475
47
+ },
48
+ "Mem-alpha": {
49
+ "accuracy": 0.43895,
50
+ "model_family": "Qwen3-32B",
51
+ "f1": 0.319875
52
+ },
53
+ "Memorybank": {
54
+ "accuracy": 0.43885,
55
+ "model_family": "Qwen3-32B",
56
+ "f1": 0.325325
57
+ },
58
+ "Simple mem": {
59
+ "accuracy": 0.288775,
60
+ "model_family": "Qwen3-32B",
61
+ "f1": 0.163
62
+ },
63
+ "Long context": {
64
+ "accuracy": 0.5355,
65
+ "model_family": "Qwen3-32B",
66
+ "f1": 0.321775
67
+ }
68
+ },
69
+ "EMBODIED_AI": {
70
+ "Qwen3-Embedding-4B": {
71
+ "accuracy": 0.204325,
72
+ "model_family": "Qwen3-32B",
73
+ "f1": 0.1353
74
+ },
75
+ "GRAPHRAG": {
76
+ "accuracy": 0.1476,
77
+ "model_family": "Qwen3-32B",
78
+ "f1": 0.3799
79
+ },
80
+ "Hipporag2": {
81
+ "accuracy": 0.17627500000000002,
82
+ "model_family": "Qwen3-32B",
83
+ "f1": 0.181875
84
+ },
85
+ "Memagent": {
86
+ "accuracy": 0.10617499999999999,
87
+ "model_family": "Qwen3-32B",
88
+ "f1": 0.144975
89
+ },
90
+ "Mem1": {
91
+ "accuracy": 0.03355,
92
+ "model_family": "Qwen3-32B",
93
+ "f1": 0.22445
94
+ },
95
+ "Amem": {
96
+ "accuracy": 0.183975,
97
+ "model_family": "Qwen3-32B",
98
+ "f1": 0.3524
99
+ },
100
+ "Mem0": {
101
+ "accuracy": 0.11109999999999999,
102
+ "model_family": "Qwen3-32B",
103
+ "f1": 0.27005
104
+ },
105
+ "Memorag": {
106
+ "accuracy": 0.085425,
107
+ "model_family": "Qwen3-32B",
108
+ "f1": 0.17677500000000002
109
+ },
110
+ "Memgpt": {
111
+ "accuracy": 0.1122,
112
+ "model_family": "Qwen3-32B",
113
+ "f1": 0.10405
114
+ },
115
+ "Mem-alpha": {
116
+ "accuracy": 0.15515,
117
+ "model_family": "Qwen3-32B",
118
+ "f1": 0.23735
119
+ },
120
+ "Memorybank": {
121
+ "accuracy": 0.16025,
122
+ "model_family": "Qwen3-32B",
123
+ "f1": 0.426475
124
+ },
125
+ "Simple mem": {
126
+ "accuracy": 0.045975,
127
+ "model_family": "Qwen3-32B",
128
+ "f1": 0.2284
129
+ },
130
+ "Long context": {
131
+ "accuracy": 0.48185,
132
+ "model_family": "Qwen3-32B",
133
+ "f1": 0.56
134
+ }
135
+ },
136
+ "WEB": {
137
+ "Qwen3-Embedding-4B": {
138
+ "accuracy": 0.2872,
139
+ "model_family": "Qwen3-32B",
140
+ "f1": 0.08535000000000001
141
+ },
142
+ "GRAPHRAG": {
143
+ "accuracy": 0.420675,
144
+ "model_family": "Qwen3-32B",
145
+ "f1": 0.268075
146
+ },
147
+ "Hipporag2": {
148
+ "accuracy": 0.3761,
149
+ "model_family": "Qwen3-32B",
150
+ "f1": 0.120125
151
+ },
152
+ "Memagent": {
153
+ "accuracy": 0.263975,
154
+ "model_family": "Qwen3-32B",
155
+ "f1": 0.09065
156
+ },
157
+ "Mem1": {
158
+ "accuracy": 0.131275,
159
+ "model_family": "Qwen3-32B",
160
+ "f1": 0.1518
161
+ },
162
+ "Amem": {
163
+ "accuracy": 0.391525,
164
+ "model_family": "Qwen3-32B",
165
+ "f1": 0.2294
166
+ },
167
+ "Mem0": {
168
+ "accuracy": 0.2705,
169
+ "model_family": "Qwen3-32B",
170
+ "f1": 0.21675
171
+ },
172
+ "Memorag": {
173
+ "accuracy": 0.364975,
174
+ "model_family": "Qwen3-32B",
175
+ "f1": 0.108075
176
+ },
177
+ "Memgpt": {
178
+ "accuracy": 0.327975,
179
+ "model_family": "Qwen3-32B",
180
+ "f1": 0.07105
181
+ },
182
+ "Mem-alpha": {
183
+ "accuracy": 0.362925,
184
+ "model_family": "Qwen3-32B",
185
+ "f1": 0.15944999999999998
186
+ },
187
+ "Memorybank": {
188
+ "accuracy": 0.401775,
189
+ "model_family": "Qwen3-32B",
190
+ "f1": 0.23704999999999998
191
+ },
192
+ "Simple mem": {
193
+ "accuracy": 0.13974999999999999,
194
+ "model_family": "Qwen3-32B",
195
+ "f1": 0.1679
196
+ },
197
+ "Long context": {
198
+ "accuracy": 0.554275,
199
+ "model_family": "Qwen3-32B",
200
+ "f1": 0.348075
201
+ }
202
+ },
203
+ "TEXT2SQL": {
204
+ "Qwen3-Embedding-4B": {
205
+ "accuracy": 0.4164,
206
+ "model_family": "Qwen3-32B",
207
+ "f1": 0.249325
208
+ },
209
+ "GRAPHRAG": {
210
+ "accuracy": 0.21665,
211
+ "model_family": "Qwen3-32B",
212
+ "f1": 0.221675
213
+ },
214
+ "Hipporag2": {
215
+ "accuracy": 0.46267499999999995,
216
+ "model_family": "Qwen3-32B",
217
+ "f1": 0.26935
218
+ },
219
+ "Memagent": {
220
+ "accuracy": 0.245375,
221
+ "model_family": "Qwen3-32B",
222
+ "f1": 0.245375
223
+ },
224
+ "Mem1": {
225
+ "accuracy": 0.06465,
226
+ "model_family": "Qwen3-32B",
227
+ "f1": 0.19990000000000002
228
+ },
229
+ "Amem": {
230
+ "accuracy": 0.31405,
231
+ "model_family": "Qwen3-32B",
232
+ "f1": 0.289625
233
+ },
234
+ "Mem0": {
235
+ "accuracy": 0.1192,
236
+ "model_family": "Qwen3-32B",
237
+ "f1": 0.2326
238
+ },
239
+ "Memorag": {
240
+ "accuracy": 0.619,
241
+ "model_family": "Qwen3-32B",
242
+ "f1": 0.296475
243
+ },
244
+ "Memgpt": {
245
+ "accuracy": 0.206875,
246
+ "model_family": "Qwen3-32B",
247
+ "f1": 0.178975
248
+ },
249
+ "Mem-alpha": {
250
+ "accuracy": 0.30065,
251
+ "model_family": "Qwen3-32B",
252
+ "f1": 0.26505
253
+ },
254
+ "Memorybank": {
255
+ "accuracy": 0.23855,
256
+ "model_family": "Qwen3-32B",
257
+ "f1": 0.28355
258
+ },
259
+ "Simple mem": {
260
+ "accuracy": 0.192575,
261
+ "model_family": "Qwen3-32B",
262
+ "f1": 0.157225
263
+ },
264
+ "Long context": {
265
+ "accuracy": 0.456075,
266
+ "model_family": "Qwen3-32B",
267
+ "f1": 0.295275
268
+ }
269
+ },
270
+ "OPENWORLD_QA": {
271
+ "Qwen3-Embedding-4B": {
272
+ "accuracy": 0.399125,
273
+ "model_family": "Qwen3-32B",
274
+ "f1": 0.0837
275
+ },
276
+ "GRAPHRAG": {
277
+ "accuracy": 0.31845,
278
+ "model_family": "Qwen3-32B",
279
+ "f1": 0.22635
280
+ },
281
+ "Hipporag2": {
282
+ "accuracy": 0.45825,
283
+ "model_family": "Qwen3-32B",
284
+ "f1": 0.2362
285
+ },
286
+ "Memagent": {
287
+ "accuracy": 0.158225,
288
+ "model_family": "Qwen3-32B",
289
+ "f1": 0.0704
290
+ },
291
+ "Mem1": {
292
+ "accuracy": 0.12065000000000001,
293
+ "model_family": "Qwen3-32B",
294
+ "f1": 0.15005
295
+ },
296
+ "Amem": {
297
+ "accuracy": 0.29359999999999997,
298
+ "model_family": "Qwen3-32B",
299
+ "f1": 0.2079
300
+ },
301
+ "Mem0": {
302
+ "accuracy": 0.16197499999999998,
303
+ "model_family": "Qwen3-32B",
304
+ "f1": 0.1604
305
+ },
306
+ "Memorag": {
307
+ "accuracy": 0.411375,
308
+ "model_family": "Qwen3-32B",
309
+ "f1": 0.093675
310
+ },
311
+ "Memgpt": {
312
+ "accuracy": 0.3155,
313
+ "model_family": "Qwen3-32B",
314
+ "f1": 0.0595
315
+ },
316
+ "Mem-alpha": {
317
+ "accuracy": 0.2301,
318
+ "model_family": "Qwen3-32B",
319
+ "f1": 0.13345
320
+ },
321
+ "Memorybank": {
322
+ "accuracy": 0.3486,
323
+ "model_family": "Qwen3-32B",
324
+ "f1": 0.2519
325
+ },
326
+ "Simple mem": {
327
+ "accuracy": 0.12154999999999999,
328
+ "model_family": "Qwen3-32B",
329
+ "f1": 0.1312
330
+ },
331
+ "Long context": {
332
+ "accuracy": 0.49785,
333
+ "model_family": "Qwen3-32B",
334
+ "f1": 0.3349
335
+ }
336
+ },
337
+ "SOFTWARE": {
338
+ "Qwen3-Embedding-4B": {
339
+ "accuracy": 0.599025,
340
+ "model_family": "Qwen3-32B",
341
+ "f1": 0.083575
342
+ },
343
+ "GRAPHRAG": {
344
+ "accuracy": 0.348875,
345
+ "model_family": "Qwen3-32B",
346
+ "f1": 0.229825
347
+ },
348
+ "Hipporag2": {
349
+ "accuracy": 0.5299,
350
+ "model_family": "Qwen3-32B",
351
+ "f1": 0.1279
352
+ },
353
+ "Memagent": {
354
+ "accuracy": 0.53965,
355
+ "model_family": "Qwen3-32B",
356
+ "f1": 0.09085
357
+ },
358
+ "Mem1": {
359
+ "accuracy": 0.18595,
360
+ "model_family": "Qwen3-32B",
361
+ "f1": 0.17527500000000001
362
+ },
363
+ "Amem": {
364
+ "accuracy": 0.29615,
365
+ "model_family": "Qwen3-32B",
366
+ "f1": 0.20395
367
+ },
368
+ "Mem0": {
369
+ "accuracy": 0.2366,
370
+ "model_family": "Qwen3-32B",
371
+ "f1": 0.176975
372
+ },
373
+ "Memorag": {
374
+ "accuracy": 0.55005,
375
+ "model_family": "Qwen3-32B",
376
+ "f1": 0.10707499999999999
377
+ },
378
+ "Memgpt": {
379
+ "accuracy": 0.599125,
380
+ "model_family": "Qwen3-32B",
381
+ "f1": 0.066575
382
+ },
383
+ "Mem-alpha": {
384
+ "accuracy": 0.3476,
385
+ "model_family": "Qwen3-32B",
386
+ "f1": 0.12492500000000001
387
+ },
388
+ "Memorybank": {
389
+ "accuracy": 0.5072,
390
+ "model_family": "Qwen3-32B",
391
+ "f1": 0.240875
392
+ },
393
+ "Simple mem": {
394
+ "accuracy": 0.2431,
395
+ "model_family": "Qwen3-32B",
396
+ "f1": 0.2005
397
+ },
398
+ "Long context": {
399
+ "accuracy": 0.4847,
400
+ "model_family": "Qwen3-32B",
401
+ "f1": 0.267725
402
+ }
403
+ }
404
+ }
data/method_data.json DELETED
@@ -1,160 +0,0 @@
1
- {
2
- "title": "Performance comparison of Agent Memory and RAG methods (base model: Qwen-32B) on real-world subset",
3
- "metrics": ["Recall", "Causal Inference", "State Updating", "State Abstraction", "Average"],
4
- "entries": [
5
- {
6
- "method": "BM25",
7
- "category": "RAG",
8
- "scores": {
9
- "Recall": {"accuracy": 0.3301, "f1": 0.1465},
10
- "Causal Inference": {"accuracy": 0.4264, "f1": 0.1549},
11
- "State Updating": {"accuracy": 0.3450, "f1": 0.1325},
12
- "State Abstraction": {"accuracy": 0.2498, "f1": 0.1623},
13
- "Average": {"accuracy": 0.3436, "f1": 0.1475}
14
- }
15
- },
16
- {
17
- "method": "Qwen3-Emb-4B",
18
- "category": "RAG",
19
- "scores": {
20
- "Recall": {"accuracy": 0.4843, "f1": 0.1590},
21
- "Causal Inference": {"accuracy": 0.4974, "f1": 0.1549},
22
- "State Updating": {"accuracy": 0.3520, "f1": 0.1353},
23
- "State Abstraction": {"accuracy": 0.3011, "f1": 0.1610},
24
- "Average": {"accuracy": 0.4227, "f1": 0.1522}
25
- }
26
- },
27
- {
28
- "method": "GraphRAG",
29
- "category": "RAG",
30
- "scores": {
31
- "Recall": {"accuracy": 0.3077, "f1": 0.2769},
32
- "Causal Inference": {"accuracy": 0.3905, "f1": 0.2634},
33
- "State Updating": {"accuracy": 0.3140, "f1": 0.2551},
34
- "State Abstraction": {"accuracy": 0.2879, "f1": 0.2588},
35
- "Average": {"accuracy": 0.3258, "f1": 0.2650}
36
- }
37
- },
38
- {
39
- "method": "HippoRAG2",
40
- "category": "RAG",
41
- "scores": {
42
- "Recall": {"accuracy": 0.4579, "f1": 0.2356},
43
- "Causal Inference": {"accuracy": 0.5080, "f1": 0.1966},
44
- "State Updating": {"accuracy": 0.4403, "f1": 0.1892},
45
- "State Abstraction": {"accuracy": 0.3538, "f1": 0.1785},
46
- "Average": {"accuracy": 0.4480, "f1": 0.2048}
47
- }
48
- },
49
- {
50
- "method": "MemAgent",
51
- "category": "Agent Memory",
52
- "scores": {
53
- "Recall": {"accuracy": 0.2550, "f1": 0.1489},
54
- "Causal Inference": {"accuracy": 0.3380, "f1": 0.1606},
55
- "State Updating": {"accuracy": 0.2849, "f1": 0.1432},
56
- "State Abstraction": {"accuracy": 0.2202, "f1": 0.1655},
57
- "Average": {"accuracy": 0.2768, "f1": 0.1530}
58
- }
59
- },
60
- {
61
- "method": "Mem1",
62
- "category": "Agent Memory",
63
- "scores": {
64
- "Recall": {"accuracy": 0.1180, "f1": 0.1857},
65
- "Causal Inference": {"accuracy": 0.1427, "f1": 0.1732},
66
- "State Updating": {"accuracy": 0.1205, "f1": 0.1659},
67
- "State Abstraction": {"accuracy": 0.1080, "f1": 0.2042},
68
- "Average": {"accuracy": 0.1229, "f1": 0.1807}
69
- }
70
- },
71
- {
72
- "method": "Amem",
73
- "category": "Agent Memory",
74
- "scores": {
75
- "Recall": {"accuracy": 0.3084, "f1": 0.2707},
76
- "Causal Inference": {"accuracy": 0.3653, "f1": 0.2731},
77
- "State Updating": {"accuracy": 0.3088, "f1": 0.2480},
78
- "State Abstraction": {"accuracy": 0.2873, "f1": 0.2953},
79
- "Average": {"accuracy": 0.3186, "f1": 0.2695}
80
- }
81
- },
82
- {
83
- "method": "Mem0",
84
- "category": "Agent Memory",
85
- "scores": {
86
- "Recall": {"accuracy": 0.2011, "f1": 0.2413},
87
- "Causal Inference": {"accuracy": 0.2645, "f1": 0.2443},
88
- "State Updating": {"accuracy": 0.2101, "f1": 0.2225},
89
- "State Abstraction": {"accuracy": 0.1516, "f1": 0.2241},
90
- "Average": {"accuracy": 0.2104, "f1": 0.2343}
91
- }
92
- },
93
- {
94
- "method": "MemoRAG",
95
- "category": "Agent Memory",
96
- "scores": {
97
- "Recall": {"accuracy": 0.4708, "f1": 0.1789},
98
- "Causal Inference": {"accuracy": 0.5497, "f1": 0.1811},
99
- "State Updating": {"accuracy": 0.4257, "f1": 0.1713},
100
- "State Abstraction": {"accuracy": 0.3659, "f1": 0.2073},
101
- "Average": {"accuracy": 0.4606, "f1": 0.1822}
102
- }
103
- },
104
- {
105
- "method": "MemGPT",
106
- "category": "Agent Memory",
107
- "scores": {
108
- "Recall": {"accuracy": 0.3289, "f1": 0.1318},
109
- "Causal Inference": {"accuracy": 0.4404, "f1": 0.1475},
110
- "State Updating": {"accuracy": 0.2809, "f1": 0.1259},
111
- "State Abstraction": {"accuracy": 0.2526, "f1": 0.1431},
112
- "Average": {"accuracy": 0.3304, "f1": 0.1359}
113
- }
114
- },
115
- {
116
- "method": "Mem-alpha",
117
- "category": "Agent Memory",
118
- "scores": {
119
- "Recall": {"accuracy": 0.2876, "f1": 0.2325},
120
- "Causal Inference": {"accuracy": 0.4172, "f1": 0.1993},
121
- "State Updating": {"accuracy": 0.3064, "f1": 0.2000},
122
- "State Abstraction": {"accuracy": 0.2171, "f1": 0.2135},
123
- "Average": {"accuracy": 0.3117, "f1": 0.2130}
124
- }
125
- },
126
- {
127
- "method": "MemoryBank",
128
- "category": "Agent Memory",
129
- "scores": {
130
- "Recall": {"accuracy": 0.3231, "f1": 0.3128},
131
- "Causal Inference": {"accuracy": 0.4100, "f1": 0.2861},
132
- "State Updating": {"accuracy": 0.3006, "f1": 0.2678},
133
- "State Abstraction": {"accuracy": 0.3332, "f1": 0.3011},
134
- "Average": {"accuracy": 0.3397, "f1": 0.2928}
135
- }
136
- },
137
- {
138
- "method": "Simple Mem",
139
- "category": "Agent Memory",
140
- "scores": {
141
- "Recall": {"accuracy": 0.2012, "f1": 0.2039},
142
- "Causal Inference": {"accuracy": 0.1884, "f1": 0.1612},
143
- "State Updating": {"accuracy": 0.1764, "f1": 0.1594},
144
- "State Abstraction": {"accuracy": 0.1373, "f1": 0.1689},
145
- "Average": {"accuracy": 0.1811, "f1": 0.1764}
146
- }
147
- },
148
- {
149
- "method": "AMA Agent",
150
- "category": "Agent Memory",
151
- "scores": {
152
- "Recall": {"accuracy": 0.6238, "f1": 0.3280},
153
- "Causal Inference": {"accuracy": 0.6145, "f1": 0.3103},
154
- "State Updating": {"accuracy": 0.5305, "f1": 0.2625},
155
- "State Abstraction": {"accuracy": 0.4719, "f1": 0.2825},
156
- "Average": {"accuracy": 0.5722, "f1": 0.2992}
157
- }
158
- }
159
- ]
160
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
data/model_capability.json ADDED
@@ -0,0 +1,586 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "Recall": {
3
+ "Claude Haiku 3.5": {
4
+ "accuracy": 0.48456666666666665,
5
+ "f1": 0.35600000000000004
6
+ },
7
+ "OpenAI GPT-5.1 mini": {
8
+ "accuracy": 0.6773166666666667,
9
+ "f1": 0.397
10
+ },
11
+ "gpt 5.2": {
12
+ "accuracy": 0.7655,
13
+ "f1": 0.4805333333333333
14
+ },
15
+ "Gemini 2.5 flash": {
16
+ "accuracy": 0.5763333333333334,
17
+ "f1": 0.3706
18
+ },
19
+ "Qwen2.5-14B-Instruct-1M": {
20
+ "accuracy": 0.5497833333333334,
21
+ "f1": 0.41873333333333335
22
+ },
23
+ "Qwen3-32B": {
24
+ "accuracy": 0.6036833333333333,
25
+ "f1": 0.4152833333333333
26
+ },
27
+ "Qwen3-14B": {
28
+ "accuracy": 0.5599999999999999,
29
+ "f1": 0.37024999999999997
30
+ },
31
+ "Qwen3-8B": {
32
+ "accuracy": 0.49710000000000004,
33
+ "f1": 0.3894333333333333
34
+ },
35
+ "BM25 (32B)": {
36
+ "accuracy": 0.3209,
37
+ "f1": 0.13673333333333335
38
+ },
39
+ "Qwen3-Embedding-4B (32B)": {
40
+ "accuracy": 0.47196666666666665,
41
+ "f1": 0.14795
42
+ },
43
+ "GRAPHRAG (32B)": {
44
+ "accuracy": 0.31029999999999996,
45
+ "f1": 0.28025
46
+ },
47
+ "Hipporag2 (32B)": {
48
+ "accuracy": 0.4413833333333333,
49
+ "f1": 0.23165
50
+ },
51
+ "Memagent (32B)": {
52
+ "accuracy": 0.2511333333333334,
53
+ "f1": 0.13931666666666667
54
+ },
55
+ "Mem1 (32B)": {
56
+ "accuracy": 0.12108333333333333,
57
+ "f1": 0.18071666666666666
58
+ },
59
+ "Amem (32B)": {
60
+ "accuracy": 0.29723333333333335,
61
+ "f1": 0.26671666666666666
62
+ },
63
+ "Mem0 (32B)": {
64
+ "accuracy": 0.20451666666666668,
65
+ "f1": 0.24041666666666664
66
+ },
67
+ "Memorag (32B)": {
68
+ "accuracy": 0.44153333333333333,
69
+ "f1": 0.16653333333333334
70
+ },
71
+ "Memgpt (32B)": {
72
+ "accuracy": 0.32865,
73
+ "f1": 0.12778333333333333
74
+ },
75
+ "Mem-alpha (32B)": {
76
+ "accuracy": 0.28221666666666667,
77
+ "f1": 0.2279
78
+ },
79
+ "Memorybank (32B)": {
80
+ "accuracy": 0.32088333333333335,
81
+ "f1": 0.31371666666666664
82
+ },
83
+ "Simple mem (32B)": {
84
+ "accuracy": 0.18241666666666667,
85
+ "f1": 0.20383333333333334
86
+ },
87
+ "AMA-agent (Ours) (32B)": {
88
+ "accuracy": 0.6319833333333333,
89
+ "f1": 0.32741666666666663
90
+ },
91
+ "BM25 (8B)": {
92
+ "accuracy": 0.3297666666666667,
93
+ "f1": 0.12873333333333334
94
+ },
95
+ "Qwen3-Embedding-4B (8B)": {
96
+ "accuracy": 0.4556166666666666,
97
+ "f1": 0.13745
98
+ },
99
+ "GRAPHRAG (8B)": {
100
+ "accuracy": 0.239,
101
+ "f1": 0.23536666666666664
102
+ },
103
+ "Hipporag2 (8B)": {
104
+ "accuracy": 0.34790000000000004,
105
+ "f1": 0.20298333333333332
106
+ },
107
+ "Memagent (8B)": {
108
+ "accuracy": 0.18251666666666666,
109
+ "f1": 0.13096666666666668
110
+ },
111
+ "Mem1 (8B)": {
112
+ "accuracy": 0.14309999999999998,
113
+ "f1": 0.14278333333333335
114
+ },
115
+ "Amem (8B)": {
116
+ "accuracy": 0.3001,
117
+ "f1": 0.25503333333333333
118
+ },
119
+ "Mem0 (8B)": {
120
+ "accuracy": 0.2809,
121
+ "f1": 0.23186666666666667
122
+ },
123
+ "Memgpt (8B)": {
124
+ "accuracy": 0.28455,
125
+ "f1": 0.11388333333333334
126
+ },
127
+ "Mem-alpha (8B)": {
128
+ "accuracy": 0.20241666666666666,
129
+ "f1": 0.21398333333333333
130
+ },
131
+ "Memorag (8B)": {
132
+ "accuracy": 0.37543333333333334,
133
+ "f1": 0.1662
134
+ },
135
+ "Memorybank (8B)": {
136
+ "accuracy": 0.23948333333333335,
137
+ "f1": 0.28055
138
+ },
139
+ "Simple mem (8B)": {
140
+ "accuracy": 0.17913333333333334,
141
+ "f1": 0.1920833333333333
142
+ },
143
+ "AMA-agent (Ours) (8B)": {
144
+ "accuracy": 0.60195,
145
+ "f1": 0.3065
146
+ }
147
+ },
148
+ "Casual Inference": {
149
+ "Claude Haiku 3.5": {
150
+ "accuracy": 0.4799333333333333,
151
+ "f1": 0.29278333333333334
152
+ },
153
+ "OpenAI GPT-5.1 mini": {
154
+ "accuracy": 0.7091333333333334,
155
+ "f1": 0.3001666666666667
156
+ },
157
+ "gpt 5.2": {
158
+ "accuracy": 0.7995166666666668,
159
+ "f1": 0.35365
160
+ },
161
+ "Gemini 2.5 flash": {
162
+ "accuracy": 0.49951666666666666,
163
+ "f1": 0.26445
164
+ },
165
+ "Qwen2.5-14B-Instruct-1M": {
166
+ "accuracy": 0.4305666666666667,
167
+ "f1": 0.3269
168
+ },
169
+ "Qwen3-32B": {
170
+ "accuracy": 0.5399999999999999,
171
+ "f1": 0.34326666666666666
172
+ },
173
+ "Qwen3-14B": {
174
+ "accuracy": 0.46735,
175
+ "f1": 0.3073
176
+ },
177
+ "Qwen3-8B": {
178
+ "accuracy": 0.39735000000000004,
179
+ "f1": 0.29578333333333334
180
+ },
181
+ "BM25 (32B)": {
182
+ "accuracy": 0.42081666666666667,
183
+ "f1": 0.14131666666666667
184
+ },
185
+ "Qwen3-Embedding-4B (32B)": {
186
+ "accuracy": 0.48618333333333336,
187
+ "f1": 0.14101666666666665
188
+ },
189
+ "GRAPHRAG (32B)": {
190
+ "accuracy": 0.4079333333333333,
191
+ "f1": 0.27426666666666666
192
+ },
193
+ "Hipporag2 (32B)": {
194
+ "accuracy": 0.4965,
195
+ "f1": 0.1859666666666667
196
+ },
197
+ "Memagent (32B)": {
198
+ "accuracy": 0.33666666666666667,
199
+ "f1": 0.14706666666666665
200
+ },
201
+ "Mem1 (32B)": {
202
+ "accuracy": 0.1495,
203
+ "f1": 0.1698666666666667
204
+ },
205
+ "Amem (32B)": {
206
+ "accuracy": 0.37051666666666666,
207
+ "f1": 0.27376666666666666
208
+ },
209
+ "Mem0 (32B)": {
210
+ "accuracy": 0.27725,
211
+ "f1": 0.24518333333333334
212
+ },
213
+ "Memorag (32B)": {
214
+ "accuracy": 0.5261,
215
+ "f1": 0.16540000000000002
216
+ },
217
+ "Memgpt (32B)": {
218
+ "accuracy": 0.4437333333333333,
219
+ "f1": 0.1383
220
+ },
221
+ "Mem-alpha (32B)": {
222
+ "accuracy": 0.4193166666666667,
223
+ "f1": 0.19181666666666666
224
+ },
225
+ "Memorybank (32B)": {
226
+ "accuracy": 0.42110000000000003,
227
+ "f1": 0.2900333333333333
228
+ },
229
+ "Simple mem (32B)": {
230
+ "accuracy": 0.18955,
231
+ "f1": 0.16668333333333332
232
+ },
233
+ "AMA-agent (Ours) (32B)": {
234
+ "accuracy": 0.6169833333333333,
235
+ "f1": 0.30663333333333337
236
+ },
237
+ "BM25 (8B)": {
238
+ "accuracy": 0.43721666666666664,
239
+ "f1": 0.13381666666666667
240
+ },
241
+ "Qwen3-Embedding-4B (8B)": {
242
+ "accuracy": 0.42788333333333334,
243
+ "f1": 0.13291666666666666
244
+ },
245
+ "GRAPHRAG (8B)": {
246
+ "accuracy": 0.26385,
247
+ "f1": 0.2061333333333333
248
+ },
249
+ "Hipporag2 (8B)": {
250
+ "accuracy": 0.44411666666666666,
251
+ "f1": 0.18869999999999998
252
+ },
253
+ "Memagent (8B)": {
254
+ "accuracy": 0.29035,
255
+ "f1": 0.13751666666666668
256
+ },
257
+ "Mem1 (8B)": {
258
+ "accuracy": 0.19256666666666666,
259
+ "f1": 0.14903333333333332
260
+ },
261
+ "Amem (8B)": {
262
+ "accuracy": 0.4492833333333333,
263
+ "f1": 0.26935
264
+ },
265
+ "Mem0 (8B)": {
266
+ "accuracy": 0.34385,
267
+ "f1": 0.22716666666666666
268
+ },
269
+ "Memgpt (8B)": {
270
+ "accuracy": 0.3446833333333333,
271
+ "f1": 0.12268333333333332
272
+ },
273
+ "Mem-alpha (8B)": {
274
+ "accuracy": 0.30363333333333337,
275
+ "f1": 0.18689999999999998
276
+ },
277
+ "Memorag (8B)": {
278
+ "accuracy": 0.46485,
279
+ "f1": 0.16515
280
+ },
281
+ "Memorybank (8B)": {
282
+ "accuracy": 0.32225,
283
+ "f1": 0.2800833333333333
284
+ },
285
+ "Simple mem (8B)": {
286
+ "accuracy": 0.22571666666666668,
287
+ "f1": 0.17606666666666668
288
+ },
289
+ "AMA-agent (Ours) (8B)": {
290
+ "accuracy": 0.4806166666666667,
291
+ "f1": 0.23224999999999998
292
+ }
293
+ },
294
+ "State Updating": {
295
+ "Claude Haiku 3.5": {
296
+ "accuracy": 0.4325666666666667,
297
+ "f1": 0.31329999999999997
298
+ },
299
+ "OpenAI GPT-5.1 mini": {
300
+ "accuracy": 0.6369,
301
+ "f1": 0.32348333333333334
302
+ },
303
+ "gpt 5.2": {
304
+ "accuracy": 0.6355666666666667,
305
+ "f1": 0.3697333333333333
306
+ },
307
+ "Gemini 2.5 flash": {
308
+ "accuracy": 0.4866,
309
+ "f1": 0.23691666666666666
310
+ },
311
+ "Qwen2.5-14B-Instruct-1M": {
312
+ "accuracy": 0.4663833333333333,
313
+ "f1": 0.33735
314
+ },
315
+ "Qwen3-32B": {
316
+ "accuracy": 0.48335,
317
+ "f1": 0.3447166666666666
318
+ },
319
+ "Qwen3-14B": {
320
+ "accuracy": 0.4473666666666667,
321
+ "f1": 0.33188333333333336
322
+ },
323
+ "Qwen3-8B": {
324
+ "accuracy": 0.39466666666666667,
325
+ "f1": 0.32993333333333336
326
+ },
327
+ "BM25 (32B)": {
328
+ "accuracy": 0.33854999999999996,
329
+ "f1": 0.12065
330
+ },
331
+ "Qwen3-Embedding-4B (32B)": {
332
+ "accuracy": 0.3541,
333
+ "f1": 0.12335
334
+ },
335
+ "GRAPHRAG (32B)": {
336
+ "accuracy": 0.31843333333333335,
337
+ "f1": 0.2622666666666667
338
+ },
339
+ "Hipporag2 (32B)": {
340
+ "accuracy": 0.43685,
341
+ "f1": 0.18171666666666667
342
+ },
343
+ "Memagent (32B)": {
344
+ "accuracy": 0.27918333333333334,
345
+ "f1": 0.13036666666666666
346
+ },
347
+ "Mem1 (32B)": {
348
+ "accuracy": 0.12353333333333333,
349
+ "f1": 0.16081666666666666
350
+ },
351
+ "Amem (32B)": {
352
+ "accuracy": 0.30775,
353
+ "f1": 0.24678333333333335
354
+ },
355
+ "Mem0 (32B)": {
356
+ "accuracy": 0.21891666666666665,
357
+ "f1": 0.22273333333333334
358
+ },
359
+ "Memorag (32B)": {
360
+ "accuracy": 0.4015666666666666,
361
+ "f1": 0.15636666666666668
362
+ },
363
+ "Memgpt (32B)": {
364
+ "accuracy": 0.291,
365
+ "f1": 0.1203
366
+ },
367
+ "Mem-alpha (32B)": {
368
+ "accuracy": 0.2964333333333333,
369
+ "f1": 0.19146666666666667
370
+ },
371
+ "Memorybank (32B)": {
372
+ "accuracy": 0.30411666666666665,
373
+ "f1": 0.26855
374
+ },
375
+ "Simple mem (32B)": {
376
+ "accuracy": 0.17581666666666665,
377
+ "f1": 0.16231666666666666
378
+ },
379
+ "AMA-agent (Ours) (32B)": {
380
+ "accuracy": 0.5138666666666667,
381
+ "f1": 0.25103333333333333
382
+ },
383
+ "BM25 (8B)": {
384
+ "accuracy": 0.3229666666666667,
385
+ "f1": 0.11235
386
+ },
387
+ "Qwen3-Embedding-4B (8B)": {
388
+ "accuracy": 0.34371666666666667,
389
+ "f1": 0.11576666666666667
390
+ },
391
+ "GRAPHRAG (8B)": {
392
+ "accuracy": 0.23753333333333335,
393
+ "f1": 0.22826666666666665
394
+ },
395
+ "Hipporag2 (8B)": {
396
+ "accuracy": 0.40763333333333335,
397
+ "f1": 0.18355
398
+ },
399
+ "Memagent (8B)": {
400
+ "accuracy": 0.2063,
401
+ "f1": 0.1215
402
+ },
403
+ "Mem1 (8B)": {
404
+ "accuracy": 0.12731666666666666,
405
+ "f1": 0.13308333333333333
406
+ },
407
+ "Amem (8B)": {
408
+ "accuracy": 0.3300666666666667,
409
+ "f1": 0.23895
410
+ },
411
+ "Mem0 (8B)": {
412
+ "accuracy": 0.24305,
413
+ "f1": 0.20679999999999998
414
+ },
415
+ "Memgpt (8B)": {
416
+ "accuracy": 0.24914999999999998,
417
+ "f1": 0.11001666666666667
418
+ },
419
+ "Mem-alpha (8B)": {
420
+ "accuracy": 0.2172666666666667,
421
+ "f1": 0.18433333333333332
422
+ },
423
+ "Memorag (8B)": {
424
+ "accuracy": 0.3682666666666667,
425
+ "f1": 0.14901666666666666
426
+ },
427
+ "Memorybank (8B)": {
428
+ "accuracy": 0.22931666666666664,
429
+ "f1": 0.25906666666666667
430
+ },
431
+ "Simple mem (8B)": {
432
+ "accuracy": 0.17063333333333333,
433
+ "f1": 0.17784999999999998
434
+ },
435
+ "AMA-agent (Ours) (8B)": {
436
+ "accuracy": 0.43645,
437
+ "f1": 0.21893333333333334
438
+ }
439
+ },
440
+ "State abstraction": {
441
+ "Claude Haiku 3.5": {
442
+ "accuracy": 0.32758333333333334,
443
+ "f1": 0.2684166666666667
444
+ },
445
+ "OpenAI GPT-5.1 mini": {
446
+ "accuracy": 0.6024333333333333,
447
+ "f1": 0.31545
448
+ },
449
+ "gpt 5.2": {
450
+ "accuracy": 0.59255,
451
+ "f1": 0.34695000000000004
452
+ },
453
+ "Gemini 2.5 flash": {
454
+ "accuracy": 0.40641666666666665,
455
+ "f1": 0.2329
456
+ },
457
+ "Qwen2.5-14B-Instruct-1M": {
458
+ "accuracy": 0.3559666666666667,
459
+ "f1": 0.3595
460
+ },
461
+ "Qwen3-32B": {
462
+ "accuracy": 0.37979999999999997,
463
+ "f1": 0.3152333333333333
464
+ },
465
+ "Qwen3-14B": {
466
+ "accuracy": 0.33476666666666666,
467
+ "f1": 0.2716
468
+ },
469
+ "Qwen3-8B": {
470
+ "accuracy": 0.3063166666666667,
471
+ "f1": 0.27915
472
+ },
473
+ "BM25 (32B)": {
474
+ "accuracy": 0.25508333333333333,
475
+ "f1": 0.16045
476
+ },
477
+ "Qwen3-Embedding-4B (32B)": {
478
+ "accuracy": 0.3022666666666667,
479
+ "f1": 0.15885
480
+ },
481
+ "GRAPHRAG (32B)": {
482
+ "accuracy": 0.30451666666666666,
483
+ "f1": 0.25921666666666665
484
+ },
485
+ "Hipporag2 (32B)": {
486
+ "accuracy": 0.36443333333333333,
487
+ "f1": 0.1758333333333333
488
+ },
489
+ "Memagent (32B)": {
490
+ "accuracy": 0.22045,
491
+ "f1": 0.16438333333333333
492
+ },
493
+ "Mem1 (32B)": {
494
+ "accuracy": 0.11385,
495
+ "f1": 0.21061666666666667
496
+ },
497
+ "Amem (32B)": {
498
+ "accuracy": 0.29383333333333334,
499
+ "f1": 0.297
500
+ },
501
+ "Mem0 (32B)": {
502
+ "accuracy": 0.15946666666666667,
503
+ "f1": 0.22685
504
+ },
505
+ "Memorag (32B)": {
506
+ "accuracy": 0.3564333333333334,
507
+ "f1": 0.205
508
+ },
509
+ "Memgpt (32B)": {
510
+ "accuracy": 0.2680166666666667,
511
+ "f1": 0.14603333333333332
512
+ },
513
+ "Mem-alpha (32B)": {
514
+ "accuracy": 0.22561666666666666,
515
+ "f1": 0.21555
516
+ },
517
+ "Memorybank (32B)": {
518
+ "accuracy": 0.3507166666666666,
519
+ "f1": 0.30448333333333333
520
+ },
521
+ "Simple mem (32B)": {
522
+ "accuracy": 0.14003333333333332,
523
+ "f1": 0.16598333333333334
524
+ },
525
+ "AMA-agent (Ours) (32B)": {
526
+ "accuracy": 0.4688666666666667,
527
+ "f1": 0.2747
528
+ },
529
+ "BM25 (8B)": {
530
+ "accuracy": 0.27895,
531
+ "f1": 0.14856666666666665
532
+ },
533
+ "Qwen3-Embedding-4B (8B)": {
534
+ "accuracy": 0.2748333333333333,
535
+ "f1": 0.14676666666666668
536
+ },
537
+ "GRAPHRAG (8B)": {
538
+ "accuracy": 0.22055,
539
+ "f1": 0.19723333333333334
540
+ },
541
+ "Hipporag2 (8B)": {
542
+ "accuracy": 0.292,
543
+ "f1": 0.17103333333333334
544
+ },
545
+ "Memagent (8B)": {
546
+ "accuracy": 0.14305,
547
+ "f1": 0.15775
548
+ },
549
+ "Mem1 (8B)": {
550
+ "accuracy": 0.1189,
551
+ "f1": 0.1691666666666667
552
+ },
553
+ "Amem (8B)": {
554
+ "accuracy": 0.31046666666666667,
555
+ "f1": 0.25876666666666664
556
+ },
557
+ "Mem0 (8B)": {
558
+ "accuracy": 0.2598,
559
+ "f1": 0.19686666666666666
560
+ },
561
+ "Memgpt (8B)": {
562
+ "accuracy": 0.24563333333333334,
563
+ "f1": 0.11535
564
+ },
565
+ "Mem-alpha (8B)": {
566
+ "accuracy": 0.20698333333333332,
567
+ "f1": 0.21046666666666666
568
+ },
569
+ "Memorag (8B)": {
570
+ "accuracy": 0.32411666666666666,
571
+ "f1": 0.1984
572
+ },
573
+ "Memorybank (8B)": {
574
+ "accuracy": 0.32095,
575
+ "f1": 0.28145
576
+ },
577
+ "Simple mem (8B)": {
578
+ "accuracy": 0.17876666666666666,
579
+ "f1": 0.15215
580
+ },
581
+ "AMA-agent (Ours) (8B)": {
582
+ "accuracy": 0.37873333333333337,
583
+ "f1": 0.21493333333333334
584
+ }
585
+ }
586
+ }
data/model_data.json DELETED
@@ -1,94 +0,0 @@
1
- {
2
- "title": "Performance of different models on real-world subset",
3
- "metrics": ["Recall", "Causal Inference", "State Updating", "State Abstraction", "Average"],
4
- "entries": [
5
- {
6
- "method": "Claude Haiku 3.5",
7
- "category": null,
8
- "scores": {
9
- "Recall": {"accuracy": 0.4943, "f1": 0.3510},
10
- "Causal Inference": {"accuracy": 0.4507, "f1": 0.2792},
11
- "State Updating": {"accuracy": 0.4287, "f1": 0.3015},
12
- "State Abstraction": {"accuracy": 0.3090, "f1": 0.2648},
13
- "Average": {"accuracy": 0.4361, "f1": 0.3067}
14
- }
15
- },
16
- {
17
- "method": "GPT-5-mini",
18
- "category": null,
19
- "scores": {
20
- "Recall": {"accuracy": 0.6951, "f1": 0.4010},
21
- "Causal Inference": {"accuracy": 0.7157, "f1": 0.3027},
22
- "State Updating": {"accuracy": 0.6575, "f1": 0.3288},
23
- "State Abstraction": {"accuracy": 0.6235, "f1": 0.3262},
24
- "Average": {"accuracy": 0.6784, "f1": 0.3464}
25
- }
26
- },
27
- {
28
- "method": "GPT 5.2",
29
- "category": null,
30
- "scores": {
31
- "Recall": {"accuracy": 0.7741, "f1": 0.4758},
32
- "Causal Inference": {"accuracy": 0.8047, "f1": 0.3512},
33
- "State Updating": {"accuracy": 0.6563, "f1": 0.3686},
34
- "State Abstraction": {"accuracy": 0.6037, "f1": 0.3582},
35
- "Average": {"accuracy": 0.7226, "f1": 0.3988}
36
- }
37
- },
38
- {
39
- "method": "Gemini 2.5 Flash",
40
- "category": null,
41
- "scores": {
42
- "Recall": {"accuracy": 0.5834, "f1": 0.3682},
43
- "Causal Inference": {"accuracy": 0.5087, "f1": 0.2628},
44
- "State Updating": {"accuracy": 0.5000, "f1": 0.2395},
45
- "State Abstraction": {"accuracy": 0.4196, "f1": 0.2361},
46
- "Average": {"accuracy": 0.5168, "f1": 0.2878}
47
- }
48
- },
49
- {
50
- "method": "Qwen2.5-14B-1M",
51
- "category": null,
52
- "scores": {
53
- "Recall": {"accuracy": 0.5570, "f1": 0.4157},
54
- "Causal Inference": {"accuracy": 0.4111, "f1": 0.3209},
55
- "State Updating": {"accuracy": 0.4728, "f1": 0.3348},
56
- "State Abstraction": {"accuracy": 0.3368, "f1": 0.3560},
57
- "Average": {"accuracy": 0.4638, "f1": 0.3622}
58
- }
59
- },
60
- {
61
- "method": "Qwen3-32B",
62
- "category": null,
63
- "scores": {
64
- "Recall": {"accuracy": 0.6149, "f1": 0.4074},
65
- "Causal Inference": {"accuracy": 0.5178, "f1": 0.3289},
66
- "State Updating": {"accuracy": 0.4903, "f1": 0.3334},
67
- "State Abstraction": {"accuracy": 0.3657, "f1": 0.3172},
68
- "Average": {"accuracy": 0.5181, "f1": 0.3545}
69
- }
70
- },
71
- {
72
- "method": "Qwen3-14B",
73
- "category": null,
74
- "scores": {
75
- "Recall": {"accuracy": 0.5675, "f1": 0.3636},
76
- "Causal Inference": {"accuracy": 0.4430, "f1": 0.2931},
77
- "State Updating": {"accuracy": 0.4502, "f1": 0.3204},
78
- "State Abstraction": {"accuracy": 0.3176, "f1": 0.2716},
79
- "Average": {"accuracy": 0.4659, "f1": 0.3203}
80
- }
81
- },
82
- {
83
- "method": "Qwen3-8B",
84
- "category": null,
85
- "scores": {
86
- "Recall": {"accuracy": 0.5024, "f1": 0.3801},
87
- "Causal Inference": {"accuracy": 0.3776, "f1": 0.2830},
88
- "State Updating": {"accuracy": 0.3987, "f1": 0.3177},
89
- "State Abstraction": {"accuracy": 0.2923, "f1": 0.2792},
90
- "Average": {"accuracy": 0.4109, "f1": 0.3240}
91
- }
92
- }
93
- ]
94
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
data/model_domain.json ADDED
@@ -0,0 +1,404 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "GAMING": {
3
+ "Qwen3-Embedding-4B": {
4
+ "accuracy": 0.5157,
5
+ "model_family": "Qwen3-32B",
6
+ "f1": 0.2195
7
+ },
8
+ "GRAPHRAG": {
9
+ "accuracy": 0.5595249999999999,
10
+ "model_family": "Qwen3-32B",
11
+ "f1": 0.288175
12
+ },
13
+ "Hipporag2": {
14
+ "accuracy": 0.60555,
15
+ "model_family": "Qwen3-32B",
16
+ "f1": 0.2273
17
+ },
18
+ "Memagent": {
19
+ "accuracy": 0.31775,
20
+ "model_family": "Qwen3-32B",
21
+ "f1": 0.22945
22
+ },
23
+ "Mem1": {
24
+ "accuracy": 0.225875,
25
+ "model_family": "Qwen3-32B",
26
+ "f1": 0.18155
27
+ },
28
+ "Amem": {
29
+ "accuracy": 0.4247,
30
+ "model_family": "Qwen3-32B",
31
+ "f1": 0.343125
32
+ },
33
+ "Mem0": {
34
+ "accuracy": 0.39085000000000003,
35
+ "model_family": "Qwen3-32B",
36
+ "f1": 0.346
37
+ },
38
+ "Memorag": {
39
+ "accuracy": 0.557625,
40
+ "model_family": "Qwen3-32B",
41
+ "f1": 0.257875
42
+ },
43
+ "Memgpt": {
44
+ "accuracy": 0.435425,
45
+ "model_family": "Qwen3-32B",
46
+ "f1": 0.318475
47
+ },
48
+ "Mem-alpha": {
49
+ "accuracy": 0.43895,
50
+ "model_family": "Qwen3-32B",
51
+ "f1": 0.319875
52
+ },
53
+ "Memorybank": {
54
+ "accuracy": 0.43885,
55
+ "model_family": "Qwen3-32B",
56
+ "f1": 0.325325
57
+ },
58
+ "Simple mem": {
59
+ "accuracy": 0.288775,
60
+ "model_family": "Qwen3-32B",
61
+ "f1": 0.163
62
+ },
63
+ "Long context": {
64
+ "accuracy": 0.5355,
65
+ "model_family": "Qwen3-32B",
66
+ "f1": 0.321775
67
+ }
68
+ },
69
+ "EMBODIED_AI": {
70
+ "Qwen3-Embedding-4B": {
71
+ "accuracy": 0.204325,
72
+ "model_family": "Qwen3-32B",
73
+ "f1": 0.1353
74
+ },
75
+ "GRAPHRAG": {
76
+ "accuracy": 0.1476,
77
+ "model_family": "Qwen3-32B",
78
+ "f1": 0.3799
79
+ },
80
+ "Hipporag2": {
81
+ "accuracy": 0.17627500000000002,
82
+ "model_family": "Qwen3-32B",
83
+ "f1": 0.181875
84
+ },
85
+ "Memagent": {
86
+ "accuracy": 0.10617499999999999,
87
+ "model_family": "Qwen3-32B",
88
+ "f1": 0.144975
89
+ },
90
+ "Mem1": {
91
+ "accuracy": 0.03355,
92
+ "model_family": "Qwen3-32B",
93
+ "f1": 0.22445
94
+ },
95
+ "Amem": {
96
+ "accuracy": 0.183975,
97
+ "model_family": "Qwen3-32B",
98
+ "f1": 0.3524
99
+ },
100
+ "Mem0": {
101
+ "accuracy": 0.11109999999999999,
102
+ "model_family": "Qwen3-32B",
103
+ "f1": 0.27005
104
+ },
105
+ "Memorag": {
106
+ "accuracy": 0.085425,
107
+ "model_family": "Qwen3-32B",
108
+ "f1": 0.17677500000000002
109
+ },
110
+ "Memgpt": {
111
+ "accuracy": 0.1122,
112
+ "model_family": "Qwen3-32B",
113
+ "f1": 0.10405
114
+ },
115
+ "Mem-alpha": {
116
+ "accuracy": 0.15515,
117
+ "model_family": "Qwen3-32B",
118
+ "f1": 0.23735
119
+ },
120
+ "Memorybank": {
121
+ "accuracy": 0.16025,
122
+ "model_family": "Qwen3-32B",
123
+ "f1": 0.426475
124
+ },
125
+ "Simple mem": {
126
+ "accuracy": 0.045975,
127
+ "model_family": "Qwen3-32B",
128
+ "f1": 0.2284
129
+ },
130
+ "Long context": {
131
+ "accuracy": 0.48185,
132
+ "model_family": "Qwen3-32B",
133
+ "f1": 0.56
134
+ }
135
+ },
136
+ "WEB": {
137
+ "Qwen3-Embedding-4B": {
138
+ "accuracy": 0.2872,
139
+ "model_family": "Qwen3-32B",
140
+ "f1": 0.08535000000000001
141
+ },
142
+ "GRAPHRAG": {
143
+ "accuracy": 0.420675,
144
+ "model_family": "Qwen3-32B",
145
+ "f1": 0.268075
146
+ },
147
+ "Hipporag2": {
148
+ "accuracy": 0.3761,
149
+ "model_family": "Qwen3-32B",
150
+ "f1": 0.120125
151
+ },
152
+ "Memagent": {
153
+ "accuracy": 0.263975,
154
+ "model_family": "Qwen3-32B",
155
+ "f1": 0.09065
156
+ },
157
+ "Mem1": {
158
+ "accuracy": 0.131275,
159
+ "model_family": "Qwen3-32B",
160
+ "f1": 0.1518
161
+ },
162
+ "Amem": {
163
+ "accuracy": 0.391525,
164
+ "model_family": "Qwen3-32B",
165
+ "f1": 0.2294
166
+ },
167
+ "Mem0": {
168
+ "accuracy": 0.2705,
169
+ "model_family": "Qwen3-32B",
170
+ "f1": 0.21675
171
+ },
172
+ "Memorag": {
173
+ "accuracy": 0.364975,
174
+ "model_family": "Qwen3-32B",
175
+ "f1": 0.108075
176
+ },
177
+ "Memgpt": {
178
+ "accuracy": 0.327975,
179
+ "model_family": "Qwen3-32B",
180
+ "f1": 0.07105
181
+ },
182
+ "Mem-alpha": {
183
+ "accuracy": 0.362925,
184
+ "model_family": "Qwen3-32B",
185
+ "f1": 0.15944999999999998
186
+ },
187
+ "Memorybank": {
188
+ "accuracy": 0.401775,
189
+ "model_family": "Qwen3-32B",
190
+ "f1": 0.23704999999999998
191
+ },
192
+ "Simple mem": {
193
+ "accuracy": 0.13974999999999999,
194
+ "model_family": "Qwen3-32B",
195
+ "f1": 0.1679
196
+ },
197
+ "Long context": {
198
+ "accuracy": 0.554275,
199
+ "model_family": "Qwen3-32B",
200
+ "f1": 0.348075
201
+ }
202
+ },
203
+ "TEXT2SQL": {
204
+ "Qwen3-Embedding-4B": {
205
+ "accuracy": 0.4164,
206
+ "model_family": "Qwen3-32B",
207
+ "f1": 0.249325
208
+ },
209
+ "GRAPHRAG": {
210
+ "accuracy": 0.21665,
211
+ "model_family": "Qwen3-32B",
212
+ "f1": 0.221675
213
+ },
214
+ "Hipporag2": {
215
+ "accuracy": 0.46267499999999995,
216
+ "model_family": "Qwen3-32B",
217
+ "f1": 0.26935
218
+ },
219
+ "Memagent": {
220
+ "accuracy": 0.245375,
221
+ "model_family": "Qwen3-32B",
222
+ "f1": 0.245375
223
+ },
224
+ "Mem1": {
225
+ "accuracy": 0.06465,
226
+ "model_family": "Qwen3-32B",
227
+ "f1": 0.19990000000000002
228
+ },
229
+ "Amem": {
230
+ "accuracy": 0.31405,
231
+ "model_family": "Qwen3-32B",
232
+ "f1": 0.289625
233
+ },
234
+ "Mem0": {
235
+ "accuracy": 0.1192,
236
+ "model_family": "Qwen3-32B",
237
+ "f1": 0.2326
238
+ },
239
+ "Memorag": {
240
+ "accuracy": 0.619,
241
+ "model_family": "Qwen3-32B",
242
+ "f1": 0.296475
243
+ },
244
+ "Memgpt": {
245
+ "accuracy": 0.206875,
246
+ "model_family": "Qwen3-32B",
247
+ "f1": 0.178975
248
+ },
249
+ "Mem-alpha": {
250
+ "accuracy": 0.30065,
251
+ "model_family": "Qwen3-32B",
252
+ "f1": 0.26505
253
+ },
254
+ "Memorybank": {
255
+ "accuracy": 0.23855,
256
+ "model_family": "Qwen3-32B",
257
+ "f1": 0.28355
258
+ },
259
+ "Simple mem": {
260
+ "accuracy": 0.192575,
261
+ "model_family": "Qwen3-32B",
262
+ "f1": 0.157225
263
+ },
264
+ "Long context": {
265
+ "accuracy": 0.456075,
266
+ "model_family": "Qwen3-32B",
267
+ "f1": 0.295275
268
+ }
269
+ },
270
+ "OPENWORLD_QA": {
271
+ "Qwen3-Embedding-4B": {
272
+ "accuracy": 0.399125,
273
+ "model_family": "Qwen3-32B",
274
+ "f1": 0.0837
275
+ },
276
+ "GRAPHRAG": {
277
+ "accuracy": 0.31845,
278
+ "model_family": "Qwen3-32B",
279
+ "f1": 0.22635
280
+ },
281
+ "Hipporag2": {
282
+ "accuracy": 0.45825,
283
+ "model_family": "Qwen3-32B",
284
+ "f1": 0.2362
285
+ },
286
+ "Memagent": {
287
+ "accuracy": 0.158225,
288
+ "model_family": "Qwen3-32B",
289
+ "f1": 0.0704
290
+ },
291
+ "Mem1": {
292
+ "accuracy": 0.12065000000000001,
293
+ "model_family": "Qwen3-32B",
294
+ "f1": 0.15005
295
+ },
296
+ "Amem": {
297
+ "accuracy": 0.29359999999999997,
298
+ "model_family": "Qwen3-32B",
299
+ "f1": 0.2079
300
+ },
301
+ "Mem0": {
302
+ "accuracy": 0.16197499999999998,
303
+ "model_family": "Qwen3-32B",
304
+ "f1": 0.1604
305
+ },
306
+ "Memorag": {
307
+ "accuracy": 0.411375,
308
+ "model_family": "Qwen3-32B",
309
+ "f1": 0.093675
310
+ },
311
+ "Memgpt": {
312
+ "accuracy": 0.3155,
313
+ "model_family": "Qwen3-32B",
314
+ "f1": 0.0595
315
+ },
316
+ "Mem-alpha": {
317
+ "accuracy": 0.2301,
318
+ "model_family": "Qwen3-32B",
319
+ "f1": 0.13345
320
+ },
321
+ "Memorybank": {
322
+ "accuracy": 0.3486,
323
+ "model_family": "Qwen3-32B",
324
+ "f1": 0.2519
325
+ },
326
+ "Simple mem": {
327
+ "accuracy": 0.12154999999999999,
328
+ "model_family": "Qwen3-32B",
329
+ "f1": 0.1312
330
+ },
331
+ "Long context": {
332
+ "accuracy": 0.49785,
333
+ "model_family": "Qwen3-32B",
334
+ "f1": 0.3349
335
+ }
336
+ },
337
+ "SOFTWARE": {
338
+ "Qwen3-Embedding-4B": {
339
+ "accuracy": 0.599025,
340
+ "model_family": "Qwen3-32B",
341
+ "f1": 0.083575
342
+ },
343
+ "GRAPHRAG": {
344
+ "accuracy": 0.348875,
345
+ "model_family": "Qwen3-32B",
346
+ "f1": 0.229825
347
+ },
348
+ "Hipporag2": {
349
+ "accuracy": 0.5299,
350
+ "model_family": "Qwen3-32B",
351
+ "f1": 0.1279
352
+ },
353
+ "Memagent": {
354
+ "accuracy": 0.53965,
355
+ "model_family": "Qwen3-32B",
356
+ "f1": 0.09085
357
+ },
358
+ "Mem1": {
359
+ "accuracy": 0.18595,
360
+ "model_family": "Qwen3-32B",
361
+ "f1": 0.17527500000000001
362
+ },
363
+ "Amem": {
364
+ "accuracy": 0.29615,
365
+ "model_family": "Qwen3-32B",
366
+ "f1": 0.20395
367
+ },
368
+ "Mem0": {
369
+ "accuracy": 0.2366,
370
+ "model_family": "Qwen3-32B",
371
+ "f1": 0.176975
372
+ },
373
+ "Memorag": {
374
+ "accuracy": 0.55005,
375
+ "model_family": "Qwen3-32B",
376
+ "f1": 0.10707499999999999
377
+ },
378
+ "Memgpt": {
379
+ "accuracy": 0.599125,
380
+ "model_family": "Qwen3-32B",
381
+ "f1": 0.066575
382
+ },
383
+ "Mem-alpha": {
384
+ "accuracy": 0.3476,
385
+ "model_family": "Qwen3-32B",
386
+ "f1": 0.12492500000000001
387
+ },
388
+ "Memorybank": {
389
+ "accuracy": 0.5072,
390
+ "model_family": "Qwen3-32B",
391
+ "f1": 0.240875
392
+ },
393
+ "Simple mem": {
394
+ "accuracy": 0.2431,
395
+ "model_family": "Qwen3-32B",
396
+ "f1": 0.2005
397
+ },
398
+ "Long context": {
399
+ "accuracy": 0.4847,
400
+ "model_family": "Qwen3-32B",
401
+ "f1": 0.267725
402
+ }
403
+ }
404
+ }
gaia-leaderboard ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit d34b929801f4ff3f73aaa392d5ca593eba0766e7
lmgame_bench ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit aa854e662254e5454fea0705a6525b02620bcceb
requirements.txt CHANGED
@@ -1,4 +1,7 @@
1
- gradio==5.23.3
2
  pandas>=2.0.0
3
  plotly>=5.15.0
4
  numpy>=1.24.0
 
 
 
 
1
+ gradio>=5.0.0
2
  pandas>=2.0.0
3
  plotly>=5.15.0
4
  numpy>=1.24.0
5
+ datasets>=2.10.0
6
+ huggingface_hub>=0.16.0
7
+ requests>=2.28.0
scorer.py ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Scoring functions for AMA-Bench submissions.
3
+
4
+ This module implements evaluation logic for multiple-choice questions,
5
+ calculating accuracy by comparing uppercase letters in answers.
6
+ """
7
+
8
+ import re
9
+ from typing import Union, List, Dict
10
+
11
+
12
+ def extract_uppercase_letters(text: str) -> str:
13
+ """
14
+ Extract all uppercase letters from text.
15
+
16
+ Used for multiple-choice answer comparison where answers are like
17
+ "A", "B", "AB", "ACD", etc.
18
+
19
+ Args:
20
+ text: Input text containing answer choices
21
+
22
+ Returns:
23
+ String of uppercase letters only, sorted alphabetically
24
+ """
25
+ if not isinstance(text, str):
26
+ text = str(text)
27
+
28
+ # Extract all uppercase letters
29
+ letters = [c for c in text if c.isupper() and c.isalpha()]
30
+
31
+ # Sort and join to ensure consistent ordering
32
+ return ''.join(sorted(set(letters)))
33
+
34
+
35
+ def multiple_choice_accuracy(prediction: str, reference: str) -> float:
36
+ """
37
+ Calculate accuracy for multiple-choice answers.
38
+
39
+ Compares uppercase letters extracted from both prediction and reference.
40
+ Returns 1.0 if they match exactly, 0.0 otherwise.
41
+
42
+ Args:
43
+ prediction: Model's predicted answer
44
+ reference: Ground truth reference answer
45
+
46
+ Returns:
47
+ 1.0 if exact match, 0.0 otherwise
48
+ """
49
+ pred_letters = extract_uppercase_letters(prediction)
50
+ ref_letters = extract_uppercase_letters(reference)
51
+
52
+ return 1.0 if pred_letters == ref_letters else 0.0
53
+
54
+
55
+ def calculate_accuracy(scores: List[float]) -> Dict[str, float]:
56
+ """
57
+ Calculate accuracy metric from individual question scores.
58
+
59
+ Args:
60
+ scores: List of question scores (0.0 or 1.0)
61
+
62
+ Returns:
63
+ Dictionary with accuracy metric
64
+ """
65
+ if not scores:
66
+ return {"accuracy": 0.0, "count": 0}
67
+
68
+ import numpy as np
69
+
70
+ return {
71
+ "accuracy": float(np.mean(scores)),
72
+ "count": len(scores),
73
+ "correct": int(sum(scores)),
74
+ }
75
+
76
+
77
+ def score_submission(
78
+ submissions: List[Dict],
79
+ groundtruth: Dict[str, Dict],
80
+ metrics_mapping: Dict[str, str] = None
81
+ ) -> Dict:
82
+ """
83
+ Score a complete submission against ground truth.
84
+
85
+ Args:
86
+ submissions: List of submission dicts with episode_id, question, answer
87
+ groundtruth: Dict mapping (episode_id, question) to ground truth info
88
+ metrics_mapping: Optional dict mapping question types to metric categories
89
+
90
+ Returns:
91
+ Dictionary with overall and per-metric scores
92
+ """
93
+ # Default metric mapping based on question type
94
+ if metrics_mapping is None:
95
+ metrics_mapping = {
96
+ "Recall": "Recall",
97
+ "Causal": "Causal Inference",
98
+ "State": "State Updating",
99
+ "Abstraction": "State Abstraction",
100
+ }
101
+
102
+ # Initialize scores by metric
103
+ scores_by_metric = {
104
+ "Recall": [],
105
+ "Causal Inference": [],
106
+ "State Updating": [],
107
+ "State Abstraction": [],
108
+ }
109
+
110
+ all_scores = []
111
+ scored_submissions = []
112
+
113
+ for submission in submissions:
114
+ episode_id = submission.get("episode_id", "")
115
+ question = submission.get("question", "")
116
+ answer = submission.get("answer", "")
117
+
118
+ # Look up ground truth
119
+ key = f"{episode_id}_{question}"
120
+ gt_info = groundtruth.get(key)
121
+
122
+ if gt_info is None:
123
+ # Question not found in ground truth
124
+ score = 0.0
125
+ reference = ""
126
+ qa_type = "Unknown"
127
+ else:
128
+ reference = gt_info["answer"]
129
+ qa_type = gt_info.get("type", "Recall")
130
+
131
+ # Calculate accuracy
132
+ score = multiple_choice_accuracy(answer, reference)
133
+
134
+ # Map question type to metric category
135
+ metric_category = "Recall" # default
136
+ for key_term, metric in metrics_mapping.items():
137
+ if key_term.lower() in qa_type.lower():
138
+ metric_category = metric
139
+ break
140
+
141
+ # Add to appropriate metric bucket
142
+ if metric_category in scores_by_metric:
143
+ scores_by_metric[metric_category].append(score)
144
+
145
+ all_scores.append(score)
146
+
147
+ # Store scored submission
148
+ scored_submissions.append({
149
+ **submission,
150
+ "score": score,
151
+ "reference_answer": reference,
152
+ "metric_category": metric_category,
153
+ })
154
+
155
+ # Calculate metrics for each category
156
+ results = {}
157
+ for metric_name, metric_scores in scores_by_metric.items():
158
+ results[metric_name] = calculate_accuracy(metric_scores)
159
+
160
+ # Calculate overall average
161
+ results["Average"] = calculate_accuracy(all_scores)
162
+
163
+ return {
164
+ "scores": results,
165
+ "scored_submissions": scored_submissions,
166
+ }
utils.py ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utility functions for AMA-Bench Leaderboard.
3
+
4
+ This module contains helper functions for:
5
+ - DataFrame building and manipulation
6
+ - Chart generation
7
+ - Data validation
8
+ """
9
+
10
+ import pandas as pd
11
+ import plotly.graph_objects as go
12
+ from typing import List, Dict
13
+
14
+
15
+ # Metrics configuration
16
+ METRICS = ["Recall", "Causal Inference", "State Updating", "State Abstraction"]
17
+ ALL_METRICS = METRICS + ["Average"]
18
+
19
+ # Chart colors moved to visualization.py
20
+
21
+
22
+ def build_dataframe(data: Dict) -> pd.DataFrame:
23
+ """
24
+ Build a pandas DataFrame showing Accuracy for each metric.
25
+
26
+ Args:
27
+ data: Dictionary with 'entries' key containing list of results
28
+
29
+ Returns:
30
+ DataFrame with Method and metric columns
31
+ """
32
+ rows = []
33
+ for entry in data["entries"]:
34
+ row = {"Method": entry["method"]}
35
+ if entry.get("category"):
36
+ row["Category"] = entry["category"]
37
+ for m in ALL_METRICS:
38
+ accuracy = entry["scores"][m]["accuracy"]
39
+ row[m] = f"{accuracy:.4f}"
40
+ # Store raw average accuracy for sorting
41
+ row["_sort_avg"] = entry["scores"]["Average"]["accuracy"]
42
+ rows.append(row)
43
+
44
+ df = pd.DataFrame(rows)
45
+ df = df.sort_values("_sort_avg", ascending=False).reset_index(drop=True)
46
+ df = df.drop(columns=["_sort_avg"])
47
+ return df
48
+
49
+
50
+ def add_medals(df: pd.DataFrame) -> pd.DataFrame:
51
+ """
52
+ Add medal emojis to the top-3 Method names.
53
+
54
+ Args:
55
+ df: DataFrame with 'Method' column
56
+
57
+ Returns:
58
+ DataFrame with medals added to top 3 methods
59
+ """
60
+ df = df.copy()
61
+ medals = ["\U0001f947", "\U0001f948", "\U0001f949"] # 🥇 🥈 🥉
62
+ for i in range(min(3, len(df))):
63
+ df.loc[i, "Method"] = f"{medals[i]} {df.loc[i, 'Method']}"
64
+ return df
65
+
66
+
67
+ def load_groundtruth(dataset_name: str, token: str = None) -> Dict[str, str]:
68
+ """
69
+ Load ground truth Q&A pairs from HuggingFace dataset.
70
+
71
+ Expected schema in the dataset:
72
+ {
73
+ "episode_id": "string",
74
+ "qa_pairs": [
75
+ {
76
+ "question": "string",
77
+ "answer": "string",
78
+ "type": "string",
79
+ "sub_type": "string"
80
+ }
81
+ ]
82
+ }
83
+
84
+ Args:
85
+ dataset_name: HuggingFace dataset name (e.g., "Pettingllms/AMA-bench")
86
+ token: Optional HuggingFace token for private datasets
87
+
88
+ Returns:
89
+ Dictionary mapping (episode_id, question) to answer info
90
+ """
91
+ groundtruth = {}
92
+
93
+ try:
94
+ from datasets import load_dataset, VerificationMode
95
+
96
+ # Try loading from HuggingFace dataset
97
+ try:
98
+ dataset = load_dataset(
99
+ dataset_name,
100
+ split="test",
101
+ token=token,
102
+ verification_mode=VerificationMode.NO_CHECKS,
103
+ trust_remote_code=True
104
+ )
105
+
106
+ print(f"Loaded dataset from HuggingFace: {dataset_name}")
107
+
108
+ for row in dataset:
109
+ episode_id = row.get("episode_id", "")
110
+ qa_pairs = row.get("qa_pairs", [])
111
+
112
+ for qa in qa_pairs:
113
+ question = qa.get("question", "")
114
+ answer = qa.get("answer", "")
115
+ qa_type = qa.get("type", "")
116
+
117
+ # Create unique key for this Q&A pair
118
+ key = f"{episode_id}_{question}"
119
+ groundtruth[key] = {
120
+ "answer": answer,
121
+ "type": qa_type,
122
+ "sub_type": qa.get("sub_type", "")
123
+ }
124
+
125
+ except Exception as hf_error:
126
+ print(f"Warning: Could not load from HuggingFace ({hf_error})")
127
+ print("Trying local file test/test.jsonl...")
128
+
129
+ # Fallback to local file
130
+ import json
131
+ local_path = "test/test.jsonl"
132
+
133
+ try:
134
+ with open(local_path, 'r', encoding='utf-8') as f:
135
+ for line in f:
136
+ line = line.strip()
137
+ if not line:
138
+ continue
139
+
140
+ data = json.loads(line)
141
+ episode_id = data.get("episode_id", "")
142
+ qa_pairs = data.get("qa_pairs", [])
143
+
144
+ for qa in qa_pairs:
145
+ question = qa.get("question", "")
146
+ answer = qa.get("answer", "")
147
+ qa_type = qa.get("type", "")
148
+
149
+ # Create unique key for this Q&A pair
150
+ key = f"{episode_id}_{question}"
151
+ groundtruth[key] = {
152
+ "answer": answer,
153
+ "type": qa_type,
154
+ "sub_type": qa.get("sub_type", "")
155
+ }
156
+
157
+ print(f"Loaded from local file: {local_path}")
158
+
159
+ except FileNotFoundError:
160
+ print(f"Warning: Local ground truth file not found: {local_path}")
161
+ except Exception as e:
162
+ print(f"Warning: Error loading local ground truth: {e}")
163
+
164
+ except ImportError:
165
+ print("Warning: datasets library not available, cannot load ground truth")
166
+
167
+ return groundtruth
168
+
169
+
170
+ def validate_submission_file(file_path: str) -> tuple:
171
+ """
172
+ Validate submission file format.
173
+
174
+ Expected format:
175
+ {"episode_id": "...", "question": "...", "answer": "...", ...}
176
+
177
+ Args:
178
+ file_path: Path to submission JSONL file
179
+
180
+ Returns:
181
+ Tuple of (is_valid, error_message, submissions_list)
182
+ """
183
+ import json
184
+
185
+ submissions = []
186
+ seen_pairs = set()
187
+
188
+ try:
189
+ with open(file_path, 'r', encoding='utf-8') as f:
190
+ for ix, line in enumerate(f):
191
+ line = line.strip()
192
+ if not line:
193
+ continue
194
+
195
+ try:
196
+ task = json.loads(line)
197
+ except json.JSONDecodeError:
198
+ return False, f"Line {ix+1} is incorrectly formatted JSON.", []
199
+
200
+ # Check required fields
201
+ required_fields = ["episode_id", "question", "answer"]
202
+ for field in required_fields:
203
+ if field not in task:
204
+ return False, f"Line {ix+1} is missing required field '{field}'.", []
205
+
206
+ episode_id = task["episode_id"]
207
+ question = task["question"]
208
+ pair_key = (episode_id, question)
209
+
210
+ if pair_key in seen_pairs:
211
+ return False, f"Line {ix+1} contains duplicate episode_id/question pair.", []
212
+
213
+ seen_pairs.add(pair_key)
214
+ submissions.append(task)
215
+
216
+ if len(submissions) == 0:
217
+ return False, "No valid submissions found in the file.", []
218
+
219
+ return True, "", submissions
220
+
221
+ except FileNotFoundError:
222
+ return False, "File not found.", []
223
+ except Exception as e:
224
+ return False, f"Error reading file: {str(e)}", []
validate_jsonl.py ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Validate the processed JSONL file and generate statistics.
4
+ """
5
+
6
+ import json
7
+ from collections import Counter, defaultdict
8
+ from pathlib import Path
9
+
10
+
11
+ def validate_jsonl(file_path: Path):
12
+ """
13
+ Validate JSONL file and generate comprehensive statistics.
14
+ """
15
+ print("=" * 80)
16
+ print(f"Validating: {file_path}")
17
+ print("=" * 80)
18
+ print()
19
+
20
+ # Statistics
21
+ task_types = Counter()
22
+ domains = Counter()
23
+ qa_type_counts = Counter()
24
+ qa_subtype_counts = Counter()
25
+ total_qa_pairs = 0
26
+ success_count = 0
27
+ total_count = 0
28
+ total_turns = 0
29
+ total_tokens = 0
30
+
31
+ # Per task type statistics
32
+ task_type_stats = defaultdict(lambda: {
33
+ 'count': 0,
34
+ 'success': 0,
35
+ 'qa_pairs': 0,
36
+ 'total_turns': 0,
37
+ 'total_tokens': 0
38
+ })
39
+
40
+ # Per domain statistics
41
+ domain_stats = defaultdict(lambda: {
42
+ 'count': 0,
43
+ 'success': 0,
44
+ 'qa_pairs': 0,
45
+ 'total_turns': 0,
46
+ 'total_tokens': 0
47
+ })
48
+
49
+ errors = []
50
+ line_num = 0
51
+
52
+ with open(file_path, 'r', encoding='utf-8') as f:
53
+ for line in f:
54
+ line_num += 1
55
+ try:
56
+ data = json.loads(line)
57
+
58
+ # Validate required fields
59
+ required_fields = ["episode_id", "task", "task_type", "domain",
60
+ "success", "num_turns", "total_tokens",
61
+ "trajectory", "qa_pairs"]
62
+
63
+ for field in required_fields:
64
+ if field not in data:
65
+ errors.append(f"Line {line_num}: Missing field '{field}'")
66
+ continue
67
+
68
+ # Update counters
69
+ task_type = data["task_type"]
70
+ domain = data["domain"]
71
+ task_types[task_type] += 1
72
+ domains[domain] += 1
73
+ total_count += 1
74
+
75
+ if data["success"]:
76
+ success_count += 1
77
+ task_type_stats[task_type]['success'] += 1
78
+ domain_stats[domain]['success'] += 1
79
+
80
+ num_qa = len(data["qa_pairs"])
81
+ total_qa_pairs += num_qa
82
+ task_type_stats[task_type]['qa_pairs'] += num_qa
83
+ task_type_stats[task_type]['count'] += 1
84
+ domain_stats[domain]['qa_pairs'] += num_qa
85
+ domain_stats[domain]['count'] += 1
86
+
87
+ total_turns += data["num_turns"]
88
+ total_tokens += data["total_tokens"]
89
+ task_type_stats[task_type]['total_turns'] += data["num_turns"]
90
+ task_type_stats[task_type]['total_tokens'] += data["total_tokens"]
91
+ domain_stats[domain]['total_turns'] += data["num_turns"]
92
+ domain_stats[domain]['total_tokens'] += data["total_tokens"]
93
+
94
+ # QA pairs type distribution
95
+ for qa in data["qa_pairs"]:
96
+ qa_type = qa.get("type", "unknown")
97
+ qa_type_counts[qa_type] += 1
98
+
99
+ if "sub_type" in qa:
100
+ qa_subtype_counts[qa["sub_type"]] += 1
101
+
102
+ except json.JSONDecodeError as e:
103
+ errors.append(f"Line {line_num}: JSON decode error - {e}")
104
+ except Exception as e:
105
+ errors.append(f"Line {line_num}: Error - {e}")
106
+
107
+ # Print validation results
108
+ if errors:
109
+ print("VALIDATION ERRORS:")
110
+ print("-" * 80)
111
+ for error in errors[:10]: # Show first 10 errors
112
+ print(f" {error}")
113
+ if len(errors) > 10:
114
+ print(f" ... and {len(errors) - 10} more errors")
115
+ print()
116
+ else:
117
+ print("✓ No validation errors found!")
118
+ print()
119
+
120
+ # Print overall statistics
121
+ print("OVERALL STATISTICS")
122
+ print("-" * 80)
123
+ print(f"Total records: {total_count:>6d}")
124
+ print(f"Total QA pairs: {total_qa_pairs:>6d}")
125
+ print(f"Successful episodes: {success_count:>6d} ({success_count/total_count*100:>5.1f}%)")
126
+ print(f"Failed episodes: {total_count - success_count:>6d} ({(total_count - success_count)/total_count*100:>5.1f}%)")
127
+ print(f"Total turns: {total_turns:>6d} (avg: {total_turns/total_count:.1f})")
128
+ print(f"Total tokens: {total_tokens:>6d} (avg: {total_tokens/total_count:.1f})")
129
+ print()
130
+
131
+ # Print domain distribution
132
+ print("DOMAIN DISTRIBUTION")
133
+ print("-" * 80)
134
+ print(f"{'Domain':<20} {'Count':>6} {'Success':>7} {'QA Pairs':>9} {'Avg Turns':>10} {'Avg Tokens':>11}")
135
+ print("-" * 80)
136
+
137
+ for domain in sorted(domains.keys()):
138
+ count = domain_stats[domain]['count']
139
+ success = domain_stats[domain]['success']
140
+ success_pct = (success / count * 100) if count > 0 else 0
141
+ qa_pairs = domain_stats[domain]['qa_pairs']
142
+ avg_turns = domain_stats[domain]['total_turns'] / count if count > 0 else 0
143
+ avg_tokens = domain_stats[domain]['total_tokens'] / count if count > 0 else 0
144
+
145
+ print(f"{domain:<20} {count:>6} {success_pct:>6.1f}% {qa_pairs:>9} {avg_turns:>10.1f} {avg_tokens:>11.1f}")
146
+
147
+ print()
148
+
149
+ # Print task type distribution
150
+ print("TASK TYPE DISTRIBUTION")
151
+ print("-" * 80)
152
+ print(f"{'Task Type':<40} {'Count':>6} {'Success':>7} {'QA Pairs':>9} {'Avg Turns':>10} {'Avg Tokens':>11}")
153
+ print("-" * 80)
154
+
155
+ for task_type in sorted(task_types.keys()):
156
+ count = task_type_stats[task_type]['count']
157
+ success = task_type_stats[task_type]['success']
158
+ qa_pairs = task_type_stats[task_type]['qa_pairs']
159
+ avg_turns = task_type_stats[task_type]['total_turns'] / count if count > 0 else 0
160
+ avg_tokens = task_type_stats[task_type]['total_tokens'] / count if count > 0 else 0
161
+
162
+ print(f"{task_type:<40} {count:>6} {success:>6}% {qa_pairs:>9} {avg_turns:>10.1f} {avg_tokens:>11.1f}")
163
+
164
+ print()
165
+
166
+ # Print QA type distribution
167
+ print("QA TYPE DISTRIBUTION")
168
+ print("-" * 80)
169
+ print(f"{'Type':<20} {'Count':>10} {'Percentage':>12}")
170
+ print("-" * 80)
171
+
172
+ for qa_type, count in sorted(qa_type_counts.items()):
173
+ percentage = count / total_qa_pairs * 100 if total_qa_pairs > 0 else 0
174
+ print(f"{qa_type:<20} {count:>10} {percentage:>11.1f}%")
175
+
176
+ print()
177
+
178
+ # Print QA subtype distribution
179
+ if qa_subtype_counts:
180
+ print("QA SUBTYPE DISTRIBUTION")
181
+ print("-" * 80)
182
+ print(f"{'Subtype':<20} {'Count':>10} {'Percentage':>12}")
183
+ print("-" * 80)
184
+
185
+ for subtype in sorted(qa_subtype_counts.keys()):
186
+ count = qa_subtype_counts[subtype]
187
+ percentage = count / total_qa_pairs * 100 if total_qa_pairs > 0 else 0
188
+ print(f"{subtype:<20} {count:>10} {percentage:>11.1f}%")
189
+
190
+ print()
191
+
192
+ print("=" * 80)
193
+ print("Validation complete!")
194
+ print("=" * 80)
195
+
196
+
197
+ if __name__ == "__main__":
198
+ jsonl_file = Path(__file__).parent / "processed_open_end.jsonl"
199
+
200
+ if not jsonl_file.exists():
201
+ print(f"Error: {jsonl_file} not found!")
202
+ print("Please run process_open_end.py first.")
203
+ exit(1)
204
+
205
+ validate_jsonl(jsonl_file)
view_samples.py ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ View sample records from the processed JSONL file.
4
+ """
5
+
6
+ import json
7
+ import sys
8
+ from pathlib import Path
9
+
10
+
11
+ def print_record(data, show_full=False):
12
+ """
13
+ Print a single record in a readable format.
14
+ """
15
+ print("=" * 80)
16
+ print(f"Episode ID: {data['episode_id']}")
17
+ print(f"Task Type: {data['task_type']}")
18
+ print(f"Domain: {data['domain']}")
19
+ print(f"Success: {data['success']}")
20
+ print(f"Turns: {data['num_turns']}")
21
+ print(f"Tokens: {data['total_tokens']}")
22
+
23
+ if data['task']:
24
+ task_preview = data['task'][:150]
25
+ print(f"\nTask:\n{task_preview}..." if len(data['task']) > 150 else f"\nTask:\n{task_preview}")
26
+
27
+ print(f"\nQA Pairs: {len(data['qa_pairs'])}")
28
+
29
+ if show_full:
30
+ print("\nAll QA Pairs:")
31
+ print("-" * 80)
32
+ for i, qa in enumerate(data['qa_pairs'], 1):
33
+ print(f"\n[{i}] Type: {qa['type']}", end="")
34
+ if 'sub_type' in qa:
35
+ print(f" / Subtype: {qa['sub_type']}")
36
+ else:
37
+ print()
38
+
39
+ print(f"Q: {qa['question'][:120]}...")
40
+ print(f"A: {qa['answer'][:120]}...")
41
+ else:
42
+ # Show first 2 QA pairs as preview
43
+ print("\nSample QA Pairs (first 2):")
44
+ print("-" * 80)
45
+ for i, qa in enumerate(data['qa_pairs'][:2], 1):
46
+ print(f"\n[{i}] Type: {qa['type']}", end="")
47
+ if 'sub_type' in qa:
48
+ print(f" / Subtype: {qa['sub_type']}")
49
+ else:
50
+ print()
51
+
52
+ print(f"Q: {qa['question'][:120]}...")
53
+ print(f"A: {qa['answer'][:120]}...")
54
+
55
+ if data['trajectory']:
56
+ print(f"\nTrajectory: {len(data['trajectory'])} turns")
57
+ if show_full and len(data['trajectory']) > 0:
58
+ print("\nFirst 3 turns:")
59
+ print("-" * 80)
60
+ for turn in data['trajectory'][:3]:
61
+ print(f"\nTurn {turn['turn_idx']}:")
62
+ action = str(turn['action'])[:100] if turn['action'] else "None"
63
+ observation = str(turn['observation'])[:100] if turn['observation'] else "None"
64
+ print(f" Action: {action}...")
65
+ print(f" Observation: {observation}...")
66
+
67
+ print("=" * 80)
68
+ print()
69
+
70
+
71
+ def view_by_task_type(file_path: Path, task_type: str, count: int = 3):
72
+ """
73
+ View samples of a specific task type.
74
+ """
75
+ print(f"\nShowing {count} samples for task type: {task_type}\n")
76
+
77
+ shown = 0
78
+ with open(file_path, 'r', encoding='utf-8') as f:
79
+ for line in f:
80
+ data = json.loads(line)
81
+ if data['task_type'] == task_type:
82
+ print_record(data, show_full=False)
83
+ shown += 1
84
+ if shown >= count:
85
+ break
86
+
87
+ if shown == 0:
88
+ print(f"No records found for task type: {task_type}")
89
+
90
+
91
+ def view_by_index(file_path: Path, index: int):
92
+ """
93
+ View a specific record by index (0-based).
94
+ """
95
+ with open(file_path, 'r', encoding='utf-8') as f:
96
+ for i, line in enumerate(f):
97
+ if i == index:
98
+ data = json.loads(line)
99
+ print_record(data, show_full=True)
100
+ return
101
+
102
+ print(f"Index {index} not found (file has fewer records)")
103
+
104
+
105
+ def list_task_types(file_path: Path):
106
+ """
107
+ List all unique task types in the file.
108
+ """
109
+ task_types = set()
110
+
111
+ with open(file_path, 'r', encoding='utf-8') as f:
112
+ for line in f:
113
+ data = json.loads(line)
114
+ task_types.add(data['task_type'])
115
+
116
+ print("\nAvailable task types:")
117
+ print("-" * 80)
118
+ for i, task_type in enumerate(sorted(task_types), 1):
119
+ print(f" {i:2d}. {task_type}")
120
+ print()
121
+
122
+
123
+ def main():
124
+ jsonl_file = Path(__file__).parent / "processed_open_end.jsonl"
125
+
126
+ if not jsonl_file.exists():
127
+ print(f"Error: {jsonl_file} not found!")
128
+ print("Please run process_open_end.py first.")
129
+ exit(1)
130
+
131
+ # Command line interface
132
+ if len(sys.argv) < 2:
133
+ print("Usage:")
134
+ print(" python3 view_samples.py list # List all task types")
135
+ print(" python3 view_samples.py index <n> # View record at index n")
136
+ print(" python3 view_samples.py type <task_type> [n] # View n samples of task type (default 3)")
137
+ print("\nExamples:")
138
+ print(" python3 view_samples.py list")
139
+ print(" python3 view_samples.py index 0")
140
+ print(" python3 view_samples.py type text2sql/spider2 5")
141
+ return
142
+
143
+ command = sys.argv[1]
144
+
145
+ if command == "list":
146
+ list_task_types(jsonl_file)
147
+
148
+ elif command == "index":
149
+ if len(sys.argv) < 3:
150
+ print("Error: Please specify an index")
151
+ return
152
+ try:
153
+ index = int(sys.argv[2])
154
+ view_by_index(jsonl_file, index)
155
+ except ValueError:
156
+ print("Error: Index must be an integer")
157
+
158
+ elif command == "type":
159
+ if len(sys.argv) < 3:
160
+ print("Error: Please specify a task type")
161
+ return
162
+
163
+ task_type = sys.argv[2]
164
+ count = 3
165
+
166
+ if len(sys.argv) >= 4:
167
+ try:
168
+ count = int(sys.argv[3])
169
+ except ValueError:
170
+ print("Error: Count must be an integer")
171
+ return
172
+
173
+ view_by_task_type(jsonl_file, task_type, count)
174
+
175
+ else:
176
+ print(f"Unknown command: {command}")
177
+ print("Use: list, index, or type")
178
+
179
+
180
+ if __name__ == "__main__":
181
+ main()
visualization.py ADDED
@@ -0,0 +1,664 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Visualization module for AMA-Bench leaderboard
3
+ Adapted from lmgame_bench patterns with AMA-specific customizations
4
+ """
5
+
6
+ import plotly.graph_objects as go
7
+ import numpy as np
8
+ import pandas as pd
9
+ import json
10
+ import os
11
+ from typing import Dict, List, Optional, Tuple
12
+
13
+
14
+ # Constants
15
+ METRICS = ["Recall", "Causal Inference", "State Updating", "State Abstraction"]
16
+ ALL_METRICS = METRICS + ["Average"]
17
+
18
+
19
+ def load_model_colors(filepath: str = "assets/model_colors.json") -> Dict[str, str]:
20
+ """
21
+ Load color scheme for models and methods from JSON file.
22
+
23
+ Args:
24
+ filepath: Path to color configuration JSON
25
+
26
+ Returns:
27
+ Dictionary mapping model/method names to hex colors
28
+ """
29
+ try:
30
+ with open(filepath, 'r', encoding='utf-8') as f:
31
+ color_data = json.load(f)
32
+
33
+ # Merge models and methods into single dictionary
34
+ colors = {}
35
+ if 'models' in color_data:
36
+ colors.update(color_data['models'])
37
+ if 'methods' in color_data:
38
+ colors.update(color_data['methods'])
39
+
40
+ # Store fallback color
41
+ fallback = color_data.get('fallback', '#808080')
42
+
43
+ return colors, fallback
44
+ except Exception as e:
45
+ print(f"Warning: Could not load colors from {filepath}: {e}")
46
+ return {}, '#808080'
47
+
48
+
49
+ def normalize_scores(values: List[float], mean: float, std: float) -> List[float]:
50
+ """
51
+ Normalize scores using z-score and scale to 0-100 range.
52
+ Adapted from lmgame_bench's normalize_values() function.
53
+
54
+ Args:
55
+ values: List of accuracy values (0-1 range)
56
+ mean: Mean value for normalization
57
+ std: Standard deviation for normalization
58
+
59
+ Returns:
60
+ List of normalized scores (0-100 range)
61
+
62
+ Formula:
63
+ z_score = (value - mean) / std
64
+ normalized = clamp((z_score * 30) + 35, 0, 100)
65
+ """
66
+ # Handle zero std case (all values are the same)
67
+ if std < 0.05: # Minimum std threshold to prevent extreme values
68
+ std = 0.05
69
+
70
+ normalized = []
71
+ for v in values:
72
+ z_score = (v - mean) / std
73
+ scaled = (z_score * 30) + 35
74
+ clamped = max(0, min(100, scaled))
75
+ normalized.append(clamped)
76
+
77
+ return normalized
78
+
79
+
80
+ def filter_by_category(data: Dict, category: str) -> Dict:
81
+ """
82
+ Filter method data by category.
83
+
84
+ Args:
85
+ data: Full dataset with entries
86
+ category: "All", "RAG", or "Agent Memory"
87
+
88
+ Returns:
89
+ Filtered data dictionary
90
+ """
91
+ if category == "All":
92
+ return data
93
+
94
+ filtered_data = data.copy()
95
+ filtered_data['entries'] = [
96
+ entry for entry in data['entries']
97
+ if entry.get('category') == category
98
+ ]
99
+
100
+ return filtered_data
101
+
102
+
103
+ def prepare_dataframe_for_visualization(
104
+ data: Dict,
105
+ top_n: Optional[int] = None,
106
+ category_filter: str = "All",
107
+ selected_metrics: Optional[List[str]] = None
108
+ ) -> pd.DataFrame:
109
+ """
110
+ Build DataFrame with both raw and normalized scores.
111
+
112
+ Args:
113
+ data: Raw data from model_data.json or method_data.json
114
+ top_n: Number of top entries to include (None = all)
115
+ category_filter: "All", "RAG", or "Agent Memory" (for methods only)
116
+ selected_metrics: List of metrics to include (None = all)
117
+
118
+ Returns:
119
+ DataFrame with columns:
120
+ - Method/Model (name)
121
+ - Category (if applicable)
122
+ - {Metric} (raw accuracy 0-1) for each metric
123
+ - norm_{Metric} (normalized 0-100) for each metric
124
+ - Avg Normalized Score (mean of normalized scores)
125
+ """
126
+ # Filter by category first
127
+ if category_filter != "All":
128
+ data = filter_by_category(data, category_filter)
129
+
130
+ if not data['entries']:
131
+ # Return empty DataFrame if no entries
132
+ return pd.DataFrame()
133
+
134
+ # Use all metrics if none specified
135
+ if selected_metrics is None:
136
+ selected_metrics = METRICS
137
+
138
+ # Build basic DataFrame
139
+ rows = []
140
+ for entry in data['entries']:
141
+ row = {
142
+ 'Name': entry['method'],
143
+ }
144
+
145
+ # Add category if present
146
+ if entry.get('category') is not None:
147
+ row['Category'] = entry['category']
148
+
149
+ # Add raw scores
150
+ for metric in selected_metrics:
151
+ score_data = entry['scores'].get(metric, {})
152
+ row[metric] = score_data.get('accuracy', 0.0)
153
+
154
+ # Add average
155
+ row['Average'] = entry['scores'].get('Average', {}).get('accuracy', 0.0)
156
+
157
+ rows.append(row)
158
+
159
+ df = pd.DataFrame(rows)
160
+
161
+ # Sort by average accuracy (descending)
162
+ df = df.sort_values(by='Average', ascending=False)
163
+
164
+ # Calculate normalization parameters from FULL dataset (before limiting)
165
+ norm_params = {}
166
+ for metric in selected_metrics:
167
+ values = df[metric].values
168
+ mean = values.mean()
169
+ std = values.std()
170
+ norm_params[metric] = (mean, std)
171
+
172
+ # Apply top_n limit if specified
173
+ if top_n is not None and top_n > 0:
174
+ df = df.head(top_n)
175
+
176
+ # Add normalized scores
177
+ for metric in selected_metrics:
178
+ mean, std = norm_params[metric]
179
+ values = df[metric].values
180
+ df[f'norm_{metric}'] = normalize_scores(values.tolist(), mean, std)
181
+
182
+ # Calculate average normalized score
183
+ norm_cols = [f'norm_{metric}' for metric in selected_metrics]
184
+ df['Avg Normalized Score'] = df[norm_cols].mean(axis=1)
185
+
186
+ # Reset index
187
+ df = df.reset_index(drop=True)
188
+
189
+ return df
190
+
191
+
192
+ def hex_to_rgba(hex_color: str, alpha: float = 0.2) -> str:
193
+ """
194
+ Convert hex color to RGBA with specified alpha.
195
+
196
+ Args:
197
+ hex_color: Hex color code (e.g., "#FF0000")
198
+ alpha: Alpha value (0-1)
199
+
200
+ Returns:
201
+ RGBA color string
202
+ """
203
+ hex_color = hex_color.lstrip('#')
204
+ r = int(hex_color[0:2], 16)
205
+ g = int(hex_color[2:4], 16)
206
+ b = int(hex_color[4:6], 16)
207
+ return f'rgba({r}, {g}, {b}, {alpha})'
208
+
209
+
210
+ def create_radar_chart(
211
+ df: pd.DataFrame,
212
+ selected_metrics: List[str],
213
+ title: str = "Performance Across Metrics",
214
+ color_map: Optional[Dict[str, str]] = None
215
+ ) -> go.Figure:
216
+ """
217
+ Create radar chart with normalized scores.
218
+ Adapted from lmgame_bench's create_single_radar_chart().
219
+
220
+ Args:
221
+ df: DataFrame from prepare_dataframe_for_visualization()
222
+ selected_metrics: List of metric names to include as axes
223
+ title: Chart title
224
+ color_map: Dictionary mapping names to colors
225
+
226
+ Returns:
227
+ Plotly Figure with radar chart
228
+
229
+ Features:
230
+ - Each axis = one metric
231
+ - Each trace = one model/method
232
+ - Range: 0-100 (normalized)
233
+ - Interactive legend (click to isolate, double-click to toggle)
234
+ """
235
+ if df.empty:
236
+ fig = go.Figure()
237
+ fig.update_layout(title="No data available")
238
+ return fig
239
+
240
+ # Load colors if not provided
241
+ if color_map is None:
242
+ color_map, fallback_color = load_model_colors()
243
+ else:
244
+ fallback_color = '#808080'
245
+
246
+ # Check if we have normalized columns
247
+ norm_cols = [f'norm_{metric}' for metric in selected_metrics]
248
+ if not all(col in df.columns for col in norm_cols):
249
+ fig = go.Figure()
250
+ fig.update_layout(title="Missing normalized data")
251
+ return fig
252
+
253
+ fig = go.Figure()
254
+
255
+ # Add trace for each model/method
256
+ for _, row in df.iterrows():
257
+ name = row['Name']
258
+
259
+ # Get normalized values for selected metrics
260
+ r = [row[f'norm_{metric}'] for metric in selected_metrics]
261
+
262
+ # Get color
263
+ color = color_map.get(name, fallback_color)
264
+ fillcolor = hex_to_rgba(color, 0.2)
265
+
266
+ # Add trace
267
+ fig.add_trace(go.Scatterpolar(
268
+ r=r + [r[0]], # Close the polygon
269
+ theta=selected_metrics + [selected_metrics[0]],
270
+ mode='lines+markers',
271
+ fill='toself',
272
+ name=name.lower(), # Lowercase for legend
273
+ line=dict(color=color, width=2),
274
+ marker=dict(color=color, size=6),
275
+ fillcolor=fillcolor,
276
+ opacity=0.7,
277
+ hovertemplate='<b>%{fullData.name}</b><br>%{theta}: %{r:.1f}<extra></extra>'
278
+ ))
279
+
280
+ # Update layout
281
+ fig.update_layout(
282
+ title=dict(
283
+ text=title,
284
+ x=0.5,
285
+ xanchor='center',
286
+ font=dict(size=18)
287
+ ),
288
+ polar=dict(
289
+ radialaxis=dict(
290
+ visible=True,
291
+ range=[0, 100],
292
+ tickfont=dict(size=11),
293
+ gridcolor='lightgray',
294
+ gridwidth=1
295
+ ),
296
+ angularaxis=dict(
297
+ tickfont=dict(size=12, weight='bold')
298
+ )
299
+ ),
300
+ legend=dict(
301
+ font=dict(size=11),
302
+ title=dict(text="Models/Methods 💡", font=dict(size=12)),
303
+ itemsizing='trace',
304
+ x=1.05,
305
+ y=1,
306
+ xanchor='left',
307
+ yanchor='top',
308
+ bgcolor='rgba(255,255,255,0.6)',
309
+ bordercolor='gray',
310
+ borderwidth=1,
311
+ itemclick="toggleothers",
312
+ itemdoubleclick="toggle"
313
+ ),
314
+ height=550,
315
+ margin=dict(l=80, r=200, t=80, b=80)
316
+ )
317
+
318
+ return fig
319
+
320
+
321
+ def create_group_bar_chart(
322
+ df: pd.DataFrame,
323
+ selected_metrics: List[str],
324
+ top_n: int = 5,
325
+ color_map: Optional[Dict[str, str]] = None
326
+ ) -> go.Figure:
327
+ """
328
+ Create grouped bar chart showing top N performers per metric.
329
+ Adapted from lmgame_bench's create_group_bar_chart().
330
+
331
+ Args:
332
+ df: DataFrame with normalized scores
333
+ selected_metrics: List of metrics to display
334
+ top_n: Number of top performers to show per metric
335
+ color_map: Dictionary mapping names to colors
336
+
337
+ Returns:
338
+ Plotly Figure with grouped bar chart
339
+
340
+ Structure:
341
+ - X-axis: Metrics with rank positions (e.g., "Recall #1", "Recall #2")
342
+ - Y-axis: Normalized score (0-100)
343
+ - Bars: Grouped by model/method
344
+ """
345
+ if df.empty:
346
+ fig = go.Figure()
347
+ fig.update_layout(title="No data available")
348
+ return fig
349
+
350
+ # Load colors if not provided
351
+ if color_map is None:
352
+ color_map, fallback_color = load_model_colors()
353
+ else:
354
+ fallback_color = '#808080'
355
+
356
+ # Check for normalized columns
357
+ norm_cols = [f'norm_{metric}' for metric in selected_metrics]
358
+ if not all(col in df.columns for col in norm_cols):
359
+ fig = go.Figure()
360
+ fig.update_layout(title="Missing normalized data")
361
+ return fig
362
+
363
+ # Build x-axis categories and data structure
364
+ all_x_categories = []
365
+ all_names = set()
366
+ metric_rankings = {}
367
+
368
+ for metric in selected_metrics:
369
+ norm_col = f'norm_{metric}'
370
+
371
+ # Get top N for this metric
372
+ metric_df = df[df[norm_col].notna()].copy()
373
+ metric_df = metric_df.sort_values(by=norm_col, ascending=False).head(top_n)
374
+
375
+ metric_rankings[metric] = []
376
+ for rank, (_, row) in enumerate(metric_df.iterrows(), 1):
377
+ name = row['Name']
378
+ score = row[norm_col]
379
+ x_category = f"{metric}<br>#{rank}"
380
+
381
+ metric_rankings[metric].append({
382
+ 'name': name,
383
+ 'score': score,
384
+ 'x_category': x_category,
385
+ 'rank': rank
386
+ })
387
+
388
+ all_x_categories.append(x_category)
389
+ all_names.add(name)
390
+
391
+ # Create traces for each model/method
392
+ fig = go.Figure()
393
+
394
+ for name in sorted(all_names):
395
+ x_vals = []
396
+ y_vals = []
397
+
398
+ for metric in selected_metrics:
399
+ # Find this model/method's data for this metric
400
+ for data in metric_rankings[metric]:
401
+ if data['name'] == name:
402
+ x_vals.append(data['x_category'])
403
+ y_vals.append(data['score'])
404
+ break
405
+
406
+ if x_vals: # Only add if has data
407
+ color = color_map.get(name, fallback_color)
408
+ fig.add_trace(go.Bar(
409
+ name=name,
410
+ x=x_vals,
411
+ y=y_vals,
412
+ marker_color=color,
413
+ hovertemplate="<b>%{fullData.name}</b><br>Score: %{y:.1f}<extra></extra>"
414
+ ))
415
+
416
+ # Update layout
417
+ fig.update_layout(
418
+ title=dict(
419
+ text=f"Top {top_n} Performers by Metric",
420
+ x=0.5,
421
+ xanchor='center',
422
+ font=dict(size=18)
423
+ ),
424
+ xaxis_title="Metrics (Ranked by Performance)",
425
+ yaxis_title="Normalized Score",
426
+ xaxis=dict(
427
+ categoryorder='array',
428
+ categoryarray=all_x_categories,
429
+ tickangle=0
430
+ ),
431
+ yaxis=dict(range=[0, 100]),
432
+ barmode='group',
433
+ bargap=0.15,
434
+ bargroupgap=0.1,
435
+ height=550,
436
+ margin=dict(l=60, r=200, t=80, b=80),
437
+ legend=dict(
438
+ font=dict(size=11),
439
+ title=dict(text="Models/Methods 💡", font=dict(size=12)),
440
+ itemsizing='trace',
441
+ x=1.05,
442
+ y=1,
443
+ xanchor='left',
444
+ yanchor='top',
445
+ bgcolor='rgba(255,255,255,0.6)',
446
+ bordercolor='gray',
447
+ borderwidth=1
448
+ )
449
+ )
450
+
451
+ return fig
452
+
453
+
454
+ def create_horizontal_bar_chart(
455
+ df: pd.DataFrame,
456
+ metric: str,
457
+ color_map: Optional[Dict[str, str]] = None
458
+ ) -> go.Figure:
459
+ """
460
+ Create horizontal bar chart for single metric details view.
461
+ Adapted from lmgame_bench's create_horizontal_bar_chart().
462
+
463
+ Args:
464
+ df: DataFrame with scores
465
+ metric: Metric name (e.g., "Recall")
466
+ color_map: Dictionary mapping names to colors
467
+
468
+ Returns:
469
+ Plotly Figure with horizontal bar chart
470
+
471
+ Features:
472
+ - Y-axis: Model/method names (sorted by score, descending)
473
+ - X-axis: Raw accuracy score (0-1 range)
474
+ - Uses raw scores, not normalized
475
+ """
476
+ if df.empty or metric not in df.columns:
477
+ fig = go.Figure()
478
+ fig.update_layout(title=f"No data available for {metric}")
479
+ return fig
480
+
481
+ # Load colors if not provided
482
+ if color_map is None:
483
+ color_map, fallback_color = load_model_colors()
484
+ else:
485
+ fallback_color = '#808080'
486
+
487
+ # Filter and sort
488
+ metric_df = df[df[metric].notna()].copy()
489
+ metric_df = metric_df.sort_values(by=metric, ascending=True) # Lowest at top
490
+
491
+ if metric_df.empty:
492
+ fig = go.Figure()
493
+ fig.update_layout(title=f"No valid data for {metric}")
494
+ return fig
495
+
496
+ # Create bar chart
497
+ colors = [color_map.get(name, fallback_color) for name in metric_df['Name']]
498
+
499
+ fig = go.Figure(
500
+ go.Bar(
501
+ y=metric_df['Name'],
502
+ x=metric_df[metric],
503
+ orientation='h',
504
+ marker=dict(
505
+ color=colors,
506
+ line=dict(color='#2c3e50', width=1)
507
+ ),
508
+ hovertemplate='%{y}<br>Accuracy: %{x:.4f}<extra></extra>'
509
+ )
510
+ )
511
+
512
+ # Update layout
513
+ fig.update_layout(
514
+ title=dict(
515
+ text=f'{metric} - Detailed Rankings',
516
+ x=0.5,
517
+ xanchor='center',
518
+ font=dict(size=18)
519
+ ),
520
+ xaxis_title="Accuracy",
521
+ yaxis_title="Model/Method",
522
+ xaxis=dict(
523
+ range=[0, 1],
524
+ gridcolor='#e0e0e0'
525
+ ),
526
+ plot_bgcolor='rgba(0,0,0,0)',
527
+ paper_bgcolor='rgba(0,0,0,0)',
528
+ font=dict(color='#2c3e50'),
529
+ height=max(400, len(metric_df) * 30), # Dynamic height based on entries
530
+ margin=dict(l=200, r=40, t=80, b=60),
531
+ showlegend=False
532
+ )
533
+
534
+ return fig
535
+
536
+
537
+ def create_multi_metric_bar_chart(
538
+ df: pd.DataFrame,
539
+ selected_metrics: List[str],
540
+ color_map: Optional[Dict[str, str]] = None
541
+ ) -> go.Figure:
542
+ """
543
+ Create grouped horizontal bar chart showing multiple metrics for each model/method.
544
+
545
+ Args:
546
+ df: DataFrame with scores
547
+ selected_metrics: List of metrics to display (e.g., ["Recall", "Causal Inference"])
548
+ color_map: Dictionary mapping names to colors
549
+
550
+ Returns:
551
+ Plotly Figure with grouped horizontal bar chart
552
+
553
+ Features:
554
+ - Y-axis: Model/method names
555
+ - X-axis: Raw accuracy score (0-1 range)
556
+ - Multiple bars per model/method (one per selected metric)
557
+ - Sorted by average score across selected metrics
558
+ """
559
+ if df.empty or not selected_metrics:
560
+ fig = go.Figure()
561
+ fig.update_layout(title="No data available")
562
+ return fig
563
+
564
+ # Check if all selected metrics exist
565
+ missing_metrics = [m for m in selected_metrics if m not in df.columns]
566
+ if missing_metrics:
567
+ fig = go.Figure()
568
+ fig.update_layout(title=f"Missing metrics: {', '.join(missing_metrics)}")
569
+ return fig
570
+
571
+ # Filter to entries that have at least one selected metric
572
+ metric_df = df.copy()
573
+ metric_df = metric_df[metric_df[selected_metrics].notna().any(axis=1)]
574
+
575
+ if metric_df.empty:
576
+ fig = go.Figure()
577
+ fig.update_layout(title="No valid data for selected metrics")
578
+ return fig
579
+
580
+ # Calculate average score across selected metrics for sorting
581
+ metric_df['avg_score'] = metric_df[selected_metrics].mean(axis=1)
582
+ metric_df = metric_df.sort_values(by='avg_score', ascending=True) # Lowest at top
583
+
584
+ # Use single base color with gradient based on capability
585
+ base_color = "#636EFA" # Blue color
586
+
587
+ # Normalize avg_score to create gradient (0.3 to 1.0 range for visibility)
588
+ min_score = metric_df['avg_score'].min()
589
+ max_score = metric_df['avg_score'].max()
590
+ score_range = max_score - min_score if max_score > min_score else 1
591
+
592
+ # Create color gradient based on model capability (higher score = deeper color)
593
+ def get_gradient_color(score, min_val, max_val, score_range):
594
+ """Generate color with gradient based on score"""
595
+ # Normalize to 0-1 range, then scale to 0.3-1.0 for better visibility
596
+ normalized = (score - min_val) / score_range if score_range > 0 else 0.5
597
+ intensity = 0.3 + (normalized * 0.7) # Range: 0.3 (light) to 1.0 (deep)
598
+
599
+ # Convert base color to RGB and apply intensity with 50% opacity
600
+ hex_color = base_color.lstrip('#')
601
+ r = int(hex_color[0:2], 16)
602
+ g = int(hex_color[2:4], 16)
603
+ b = int(hex_color[4:6], 16)
604
+
605
+ # Apply intensity to RGB values
606
+ r = int(255 - (255 - r) * intensity)
607
+ g = int(255 - (255 - g) * intensity)
608
+ b = int(255 - (255 - b) * intensity)
609
+
610
+ return f'rgba({r}, {g}, {b}, 0.5)' # 50% transparency
611
+
612
+ # Create grouped bar chart
613
+ fig = go.Figure()
614
+
615
+ for metric in selected_metrics:
616
+ # Create color array for each model based on their avg_score
617
+ colors = [
618
+ get_gradient_color(row['avg_score'], min_score, max_score, score_range)
619
+ for _, row in metric_df.iterrows()
620
+ ]
621
+
622
+ fig.add_trace(go.Bar(
623
+ name=metric,
624
+ y=metric_df['Name'],
625
+ x=metric_df[metric],
626
+ orientation='h',
627
+ marker=dict(
628
+ color=colors,
629
+ line=dict(color='#2c3e50', width=0.5)
630
+ ),
631
+ hovertemplate=f'<b>%{{y}}</b><br>{metric}: %{{x:.4f}}<extra></extra>'
632
+ ))
633
+
634
+ # Update layout
635
+ fig.update_layout(
636
+ title=dict(
637
+ text=f'Detailed Comparison - {", ".join(selected_metrics)}',
638
+ x=0.5,
639
+ xanchor='center',
640
+ font=dict(size=18)
641
+ ),
642
+ xaxis_title="Accuracy",
643
+ yaxis_title="Model/Method",
644
+ xaxis=dict(
645
+ range=[0, 1],
646
+ gridcolor='#e0e0e0'
647
+ ),
648
+ barmode='group',
649
+ plot_bgcolor='rgba(0,0,0,0)',
650
+ paper_bgcolor='rgba(0,0,0,0)',
651
+ font=dict(color='#2c3e50'),
652
+ height=max(500, len(metric_df) * 40), # Dynamic height
653
+ margin=dict(l=200, r=40, t=80, b=80),
654
+ legend=dict(
655
+ orientation="h",
656
+ yanchor="bottom",
657
+ y=1.02,
658
+ xanchor="center",
659
+ x=0.5,
660
+ font=dict(size=12)
661
+ )
662
+ )
663
+
664
+ return fig