Borchmann commited on
Commit
7abb0ba
·
verified ·
1 Parent(s): 86d62d0

Upload folder using huggingface_hub

Browse files
Files changed (8) hide show
  1. .streamlit/config.toml +11 -0
  2. README.md +37 -119
  3. app.py +1623 -380
  4. eval/README.md +82 -0
  5. eval/evaluate.py +309 -0
  6. eval/metrics.py +209 -0
  7. eval/requirements.txt +5 -0
  8. requirements.txt +7 -14
.streamlit/config.toml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [theme]
2
+ # Snowflake Blue as primary color (controls tabs, checkboxes, buttons)
3
+ primaryColor = "#29B5E8"
4
+ backgroundColor = "#0e1117"
5
+ secondaryBackgroundColor = "#1a1a2e"
6
+ textColor = "#ffffff"
7
+ font = "sans serif"
8
+
9
+ [server]
10
+ headless = true
11
+
README.md CHANGED
@@ -1,141 +1,59 @@
1
  ---
2
  title: Agentic Document AI Leaderboard
3
- emoji: 🤖📄
4
- colorFrom: green
5
  colorTo: indigo
6
- sdk: gradio
 
7
  app_file: app.py
8
- pinned: true
9
- license: apache-2.0
10
- short_description: Leaderboard for evaluating AI agents
11
- sdk_version: 5.43.1
12
- tags:
13
- - leaderboard
14
- - document-ai
15
- - agents
16
  ---
17
 
18
- # 🤖📄 Agentic Document AI Leaderboard
19
 
20
- A leaderboard for evaluating AI agents on complex document understanding tasks that require multi-step reasoning and evidence gathering across documents.
21
 
22
- ## 📊 Metrics
23
 
24
- The benchmark evaluates models using **ANLS (Average Normalized Levenshtein Similarity)** across four task categories:
 
 
 
25
 
26
- 1. **ANLS (Overall)** - Main score across the entire dataset
27
- 2. **ANLS (Single Evidence)** - Questions requiring single evidence extraction
28
- 3. **ANLS (Multi-Evidence, Same Doc)** - Combining evidence within one document
29
- 4. **ANLS (Multi-Evidence, Multi Doc)** - Synthesizing across multiple documents
30
 
31
- Additionally, we track:
32
- - **Agent Steps**: Total number of reasoning/action steps
33
- - **Cost (USD)**: Estimated inference cost
34
-
35
- ## 🚀 How to Submit
36
-
37
- ### 1. Run Your Model
38
-
39
- Run your model/agent on the Agentic Document AI benchmark dataset.
40
-
41
- ### 2. Prepare Your Predictions File
42
-
43
- Create a JSONL file where each line contains one prediction (see `submission_template.jsonl` for examples):
44
-
45
- ```jsonl
46
- {"question": "What is Dr. McElhaney's position at AMRIC?", "answer": ["Senior Scientist"], "citations": [{"file": "1307326.pdf", "page": 1}], "iterations": 1, "id": "q_4"}
47
- {"question": "Who is the CEO of the company?", "answer": ["John Smith"], "citations": [{"file": "company_report.pdf", "page": 3}], "iterations": 2, "id": "q_5"}
48
- {"question": "What was the revenue in 2023?", "answer": ["$5.2 million"], "citations": [{"file": "financial_report.pdf", "page": 12}, {"file": "annual_summary.pdf", "page": 4}], "iterations": 3, "id": "q_6"}
49
  ```
50
 
51
- **Required fields per line:**
52
- - `question`: The question text (string)
53
- - `answer`: List of answer strings
54
- - `citations`: List of dicts with `"file"` and `"page"` keys
55
- - `iterations`: Number of agent iterations/steps (integer ≥ 0)
56
- - `id`: Unique question identifier matching the benchmark (string)
57
-
58
- ### 3. Submit via the Interface
59
-
60
- 1. Go to the "🚀 Submit Results" tab
61
- 2. Fill in:
62
- - **Model Name**: A descriptive name for your system (e.g., "GPT-4-Agent-v1")
63
- - **Submitted By**: Your name or organization
64
- - **Model Type**: Whether your model is behind an API or uses open weights
65
- - **Predictions JSONL File**: Upload your JSONL file
66
- 3. Click "Submit Evaluation"
67
- 4. The system will:
68
- - Validate your JSONL format
69
- - Evaluate against the gold standard
70
- - Compute ANLS scores automatically
71
- - Display results on the leaderboard
72
-
73
- ## ⚙️ Configuration
74
 
75
- Most configuration variables are in:
76
- - `src/envs.py` - Repository paths and API configuration
77
- - `src/about.py` - Task definitions and benchmark description
78
-
79
- ## 🔬 Implementing the Evaluator
80
-
81
- **IMPORTANT:** You need to implement the evaluation logic in `src/evaluation/evaluator.py`.
82
-
83
- The evaluator should:
84
- 1. Load your gold standard dataset with correct answers and metadata
85
- 2. Compute ANLS (Average Normalized Levenshtein Similarity) for each prediction
86
- 3. Classify questions by evidence type (single/multi-doc same/multi-doc different)
87
- 4. Aggregate scores by category
88
- 5. Calculate agent steps and cost metrics
89
-
90
- See `src/evaluation/evaluator.py` for the template and detailed TODOs.
91
-
92
- **Current Status:** The system uses placeholder scores (0.50) until you implement the evaluator.
93
-
94
- To integrate your evaluator:
95
- 1. Implement functions in `src/evaluation/evaluator.py`
96
- 2. Uncomment lines 120-122 in `src/submission/submit.py`
97
- 3. Test with a sample submission
98
 
99
- ## 🗂️ Project Structure
100
 
101
- ```
102
- ├── app.py # Main Gradio application
103
- ├── src/
104
- │ ├── about.py # Benchmark description and tasks
105
- │ ├── envs.py # Environment configuration
106
- │ ├── display/
107
- │ │ ├── utils.py # Column definitions and data types
108
- │ │ ├── formatting.py # Display formatting utilities
109
- │ │ └── css_html_js.py # Custom styling
110
- │ ├── evaluation/
111
- │ │ └── evaluator.py # ⚠��� IMPLEMENT THIS: ANLS evaluation logic
112
- │ ├── leaderboard/
113
- │ │ └── read_evals.py # Result parsing logic
114
- │ ├── submission/
115
- │ │ ├── submit.py # Submission handling & validation
116
- │ │ └── check_validity.py # Duplicate checking
117
- │ └── populate.py # Dataframe population
118
- ├── eval-queue/ # Submission requests (auto-generated)
119
- ├── eval-results/ # Predictions & results (auto-generated)
120
- ├── submission_template.jsonl # Template for submissions
121
- └── ADAPTATION_SUMMARY.md # Detailed adaptation notes
122
- ```
123
 
124
- ## 🔧 Troubleshooting
 
 
 
 
 
125
 
126
- If you encounter problems with the space:
127
- - Restart the space to clear cached data
128
- - Check that `eval-queue` and `eval-results` directories are properly synced with HuggingFace datasets
129
- - Verify your environment variables in `src/envs.py` are correctly configured
130
 
131
- ## 📝 Code Logic
 
 
 
132
 
133
- For advanced customization:
134
- - **Column definitions**: `src/display/utils.py`
135
- - **Result parsing**: `src/leaderboard/read_evals.py`
136
- - **Submission logic**: `src/submission/submit.py` and `src/submission/check_validity.py`
137
- - **UI layout**: `app.py`
138
 
139
- ## 📚 Additional Documentation
 
140
 
141
- See `ADAPTATION_SUMMARY.md` for detailed information about the changes made to adapt this from the HuggingFace leaderboard template.
 
1
  ---
2
  title: Agentic Document AI Leaderboard
3
+ emoji: 📄
4
+ colorFrom: blue
5
  colorTo: indigo
6
+ sdk: streamlit
7
+ sdk_version: "1.37.0"
8
  app_file: app.py
9
+ pinned: false
10
+ hf_oauth: true
 
 
 
 
 
 
11
  ---
12
 
13
+ # Agentic Document AI Leaderboard - Streamlit Version
14
 
15
+ This is a Streamlit port of the Agentic Document AI Leaderboard.
16
 
17
+ ## Features
18
 
19
+ - 📊 **Leaderboard**: View and filter model performance rankings
20
+ - 📈 **Visualizations**: Interactive plots showing ANLS vs Agent Steps and Cost
21
+ - 📖 **About**: Information about the benchmark and metrics
22
+ - 📝 **Submit**: Validate and submit your model results
23
 
24
+ ## Installation
 
 
 
25
 
26
+ ```bash
27
+ cd streamlit_app
28
+ pip install -r requirements.txt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ```
30
 
31
+ ## Running the App
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
+ ```bash
34
+ streamlit run app.py
35
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
+ The app will open in your browser at `http://localhost:8501`.
38
 
39
+ ## Color Palette (Snowflake)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
+ - SNOWFLAKE BLUE: #29B5E8
42
+ - MID-BLUE: #11567F
43
+ - STAR BLUE: #75CDD7
44
+ - VALENCIA ORANGE: #FF9F36
45
+ - FIRST LIGHT: #D45B90
46
+ - MEDIUM GRAY: #5B5B5B
47
 
48
+ ## Differences from Gradio Version
 
 
 
49
 
50
+ 1. **Native Streamlit components** instead of gradio_leaderboard
51
+ 2. **Simplified submission flow** - validates but doesn't upload to HuggingFace Hub
52
+ 3. **Native dataframe display** with column configuration
53
+ 4. **Streamlit tabs** instead of Gradio tabs
54
 
55
+ ## Data
 
 
 
 
56
 
57
+ The app reads evaluation results from `../eval-results/` directory (relative to this app).
58
+ Make sure the eval-results folder exists with JSON result files.
59
 
 
app.py CHANGED
@@ -1,5 +1,9 @@
1
  """
2
- Agentic Document AI Leaderboard
 
 
 
 
3
 
4
  Color palette: Snowflake colors
5
  - SNOWFLAKE BLUE: #29B5E8
@@ -12,454 +16,1693 @@ Color palette: Snowflake colors
12
  - PURPLE MOON: #7254A3
13
  """
14
 
 
 
15
  import os
16
- from typing import Optional
 
 
17
 
18
- import gradio as gr
19
  import pandas as pd
20
  import plotly.graph_objects as go
21
- from apscheduler.schedulers.background import BackgroundScheduler
22
- from gradio_leaderboard import Leaderboard, SelectColumns
23
- from huggingface_hub import snapshot_download
24
-
25
- from src.about import LLM_BENCHMARKS_TEXT, TITLE
26
- from src.display.css_html_js import custom_css
27
- from src.display.utils import BENCHMARK_COLS, COLS, EVAL_COLS, EVAL_TYPES, AutoEvalColumn, ModelType, Tasks, fields
28
- from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, QUEUE_REPO, REPO_ID, RESULTS_REPO, TOKEN
29
- from src.populate import get_evaluation_queue_df, get_leaderboard_df
30
- from src.submission.submit import add_new_eval
31
-
32
- # Set static directory for assets
33
- # Note: Must be absolute path for gr.set_static_paths
34
- ASSETS_PATH = os.path.abspath("assets")
35
- gr.set_static_paths(paths=[ASSETS_PATH])
36
-
37
-
38
- # Load SVG icons
39
- def load_svg(filename):
40
- """Load SVG file and return as string"""
41
- svg_path = os.path.join("assets", filename)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
  try:
43
- with open(svg_path, "r") as f:
44
- return f.read()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  except Exception:
46
  return ""
47
 
48
 
49
- # Load tab icons
50
- ICON_MEDAL = load_svg("snow_medal.svg")
51
- ICON_PLOT = load_svg("snow_eye.svg")
52
- ICON_DOC = load_svg("snow_docs.svg")
53
- ICON_WRITE = load_svg("snow_write.svg")
54
- ICON_CLOUD = load_svg("snow_cloud2.svg")
55
- ICON_CODE = load_svg("snow_code.svg")
56
 
57
- # Tab brand colors
58
- LEADERBOARD_COLOR = "--body-text-color" # Snowflake blue
59
- VISUALIZATIONS_COLOR = "--body-text-color" # Valencia orange
60
- ABOUT_COLOR = "--body-text-color" # Purple moon
61
- SUBMIT_COLOR = "--body-text-color" # First light
 
62
 
63
 
64
- def render_tab_header(title: str, icon_svg: Optional[str] = None, color: str = LEADERBOARD_COLOR) -> str:
65
- """Generate HTML string for tab header with optional SVG icon."""
66
- icon_style = f'style="--tab-icon-color: {color};"' if icon_svg else ""
67
- icon_block = f'<span class="tab-icon" {icon_style}>{icon_svg}</span>' if icon_svg else ""
68
- return f'<div class="tab-title">{icon_block}<h1 style="color: {color};">{title}</h1></div>'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
 
71
- def restart_space():
72
- API.restart_space(repo_id=REPO_ID)
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
 
75
- def create_plot_df(leaderboard_df):
76
- """Extract data for plotting from leaderboard dataframe."""
77
- if leaderboard_df is None or leaderboard_df.empty:
 
 
 
 
78
  return pd.DataFrame()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
- # Get the first task column name (Overall ANLS)
81
- first_task_col = list(Tasks)[0].value.col_name
82
 
83
- plot_data = []
84
- for _, row in leaderboard_df.iterrows():
85
- try:
86
- # Extract model name (remove markdown links)
87
- model_text = row.get(AutoEvalColumn.model.name, "Unknown")
88
- if isinstance(model_text, str):
89
- # Extract text from markdown link [text](url)
90
- import re
91
 
92
- match = re.search(r"\[([^\]]+)\]", model_text)
93
- model_name = match.group(1) if match else model_text
94
- else:
95
- model_name = str(model_text)
96
-
97
- plot_data.append(
98
- {
99
- "model": model_name,
100
- "anls": row.get(first_task_col, 0),
101
- "agent_steps": row.get(AutoEvalColumn.agent_steps.name, 0),
102
- "cost_usd": row.get(AutoEvalColumn.cost_usd.name, 0),
103
- "model_type": row.get(AutoEvalColumn.model_type.name, "unknown"),
104
- }
105
- )
106
- except Exception as e:
107
- print(f"Error processing row: {e}")
108
- continue
109
 
110
- return pd.DataFrame(plot_data)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
- def create_anls_vs_steps_plot(leaderboard_df):
114
- """Create scatter plot of ANLS vs Agent Steps."""
115
- df = create_plot_df(leaderboard_df)
116
 
 
 
117
  if df.empty:
118
  fig = go.Figure()
119
  fig.add_annotation(
120
- text="No data available", xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False, font=dict(size=20)
 
 
 
121
  )
122
  return fig
123
-
124
- # Snowflake color palette
125
  color_map = {
126
- "api": "#D45B90", # FIRST LIGHT
127
- "open-weight": "#29B5E8", # SNOWFLAKE BLUE
128
- "unknown": "#5B5B5B", # MEDIUM GRAY
129
  }
130
-
131
  fig = go.Figure()
132
-
133
- for model_type in df["model_type"].unique():
134
- df_type = df[df["model_type"] == model_type]
135
- fig.add_trace(
136
- go.Scatter(
137
- x=df_type["agent_steps"],
138
- y=df_type["anls"],
139
- mode="markers+text",
140
- name=model_type,
141
- text=df_type["model"],
142
- textposition="top center",
143
- textfont=dict(size=9),
144
- marker=dict(size=12, color=color_map.get(model_type, "#95A5A6"), line=dict(width=1, color="white")),
145
- hovertemplate="<b>%{text}</b><br>Agent Steps: %{x}<br>ANLS: %{y:.2f}<extra></extra>",
146
- )
147
- )
148
-
 
 
149
  fig.update_layout(
150
- title="ANLS Score vs Agent Steps",
151
- xaxis_title="Agent Steps",
152
- yaxis_title="ANLS Score (%)",
153
  hovermode="closest",
154
- template="plotly_white",
155
- height=600,
156
  showlegend=True,
157
- legend=dict(title="Model Type", yanchor="top", y=0.99, xanchor="right", x=0.99),
 
 
 
 
158
  )
159
-
160
  return fig
161
 
162
 
163
- def create_anls_vs_cost_plot(leaderboard_df):
164
- """Create scatter plot of ANLS vs Cost (USD)."""
165
- df = create_plot_df(leaderboard_df)
166
-
167
  if df.empty:
168
  fig = go.Figure()
169
  fig.add_annotation(
170
- text="No data available", xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False, font=dict(size=20)
 
 
 
171
  )
172
  return fig
173
-
174
- # Filter out models with zero cost for better visualization
175
- df_with_cost = df[df["cost_usd"] > 0]
176
-
177
- if df_with_cost.empty:
178
- df_with_cost = df # Fall back to all data if no cost data
179
-
180
- # Snowflake color palette
181
  color_map = {
182
- "api": "#FF9F36", # VALENCIA ORANGE
183
- "open-weight": "#29B5E8", # SNOWFLAKE BLUE
184
- "unknown": "#5B5B5B", # MEDIUM GRAY
185
  }
186
-
187
  fig = go.Figure()
188
-
189
- for model_type in df_with_cost["model_type"].unique():
190
- df_type = df_with_cost[df_with_cost["model_type"] == model_type]
191
- fig.add_trace(
192
- go.Scatter(
193
- x=df_type["cost_usd"],
194
- y=df_type["anls"],
195
- mode="markers+text",
196
- name=model_type,
197
- text=df_type["model"],
198
- textposition="top center",
199
- textfont=dict(size=9),
200
- marker=dict(size=12, color=color_map.get(model_type, "#95A5A6"), line=dict(width=1, color="white")),
201
- hovertemplate="<b>%{text}</b><br>Cost: $%{x:.2f}<br>ANLS: %{y:.2f}<extra></extra>",
202
- )
203
- )
204
-
 
 
205
  fig.update_layout(
206
- title="ANLS Score vs Cost (USD)",
207
- xaxis_title="Cost (USD)",
208
- yaxis_title="ANLS Score (%)",
209
  hovermode="closest",
210
- template="plotly_white",
211
- height=600,
212
  showlegend=True,
213
- legend=dict(title="Model Type", yanchor="top", y=0.99, xanchor="right", x=0.99),
 
 
 
 
214
  )
215
-
216
  return fig
217
 
218
 
219
- ### Space initialisation
220
- try:
221
- print(EVAL_REQUESTS_PATH)
222
- snapshot_download(
223
- repo_id=QUEUE_REPO,
224
- local_dir=EVAL_REQUESTS_PATH,
225
- repo_type="dataset",
226
- tqdm_class=None,
227
- etag_timeout=30,
228
- token=TOKEN,
229
- )
230
- except Exception:
231
- restart_space()
232
- try:
233
- print(EVAL_RESULTS_PATH)
234
- snapshot_download(
235
- repo_id=RESULTS_REPO,
236
- local_dir=EVAL_RESULTS_PATH,
237
- repo_type="dataset",
238
- tqdm_class=None,
239
- etag_timeout=30,
240
- token=TOKEN,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
241
  )
242
- except Exception:
243
- restart_space()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
244
 
245
 
246
- LEADERBOARD_DF = get_leaderboard_df(EVAL_RESULTS_PATH, EVAL_REQUESTS_PATH, COLS, BENCHMARK_COLS)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
247
 
248
- (
249
- finished_eval_queue_df,
250
- running_eval_queue_df,
251
- pending_eval_queue_df,
252
- ) = get_evaluation_queue_df(EVAL_REQUESTS_PATH, EVAL_COLS)
253
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
 
255
- def init_leaderboard(dataframe):
256
- if dataframe is None or dataframe.empty:
257
- raise ValueError("Leaderboard DataFrame is empty or None.")
 
 
258
 
259
- # Calculate dynamic filter ranges from actual data
260
- max_agent_steps = int(dataframe[AutoEvalColumn.agent_steps.name].max()) if len(dataframe) > 0 else 1000
261
- max_cost = float(dataframe[AutoEvalColumn.cost_usd.name].max()) if len(dataframe) > 0 else 10.0
262
 
263
- # Add some headroom to max values
264
- max_agent_steps = max(max_agent_steps + 100, 1000)
265
- max_cost = max(max_cost + 1.0, 10.0)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
266
 
267
- return Leaderboard(
268
- value=dataframe,
269
- datatype=[c.type for c in fields(AutoEvalColumn)],
270
- select_columns=SelectColumns(
271
- default_selection=[c.name for c in fields(AutoEvalColumn) if c.displayed_by_default],
272
- cant_deselect=[c.name for c in fields(AutoEvalColumn) if c.never_hidden],
273
- label="Select columns to display:",
274
- ),
275
- search_columns=[AutoEvalColumn.model.name, AutoEvalColumn.organization.name],
276
- hide_columns=[c.name for c in fields(AutoEvalColumn) if c.hidden] + ["Type"],
277
- bool_checkboxgroup_label="Hide models",
278
- interactive=False,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
279
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280
 
281
 
282
- demo = gr.Blocks(
283
- css=custom_css,
284
- theme=gr.themes.Default(
285
- primary_hue=gr.themes.Color(
286
- c50="#E6F7FC",
287
- c100="#B3E5F5",
288
- c200="#80D3ED",
289
- c300="#4DC1E5",
290
- c400="#29B5E8", # SNOWFLAKE BLUE
291
- c500="#29B5E8", # SNOWFLAKE BLUE (primary)
292
- c600="#11567F", # MID-BLUE
293
- c700="#0D4464",
294
- c800="#093248",
295
- c900="#05202D",
296
- c950="#021018",
297
- name="snowflake_blue",
298
- ),
299
- secondary_hue=gr.themes.Color(
300
- c50="#FFF4E6",
301
- c100="#FFE4B3",
302
- c200="#FFD480",
303
- c300="#FFC44D",
304
- c400="#FFB41A",
305
- c500="#FF9F36", # VALENCIA ORANGE
306
- c600="#E68A1F",
307
- c700="#CC7A1B",
308
- c800="#B36A17",
309
- c900="#995A13",
310
- c950="#804A0F",
311
- name="valencia_orange",
312
- ),
313
- neutral_hue=gr.themes.Color(
314
- c50="#F5F5F5",
315
- c100="#E0E0E0",
316
- c200="#CCCCCC",
317
- c300="#B8B8B8",
318
- c400="#A3A3A3",
319
- c500="#8F8F8F",
320
- c600="#7A7A7A",
321
- c700="#5B5B5B", # MEDIUM GRAY
322
- c800="#474747",
323
- c900="#333333",
324
- c950="#1F1F1F",
325
- name="medium_gray",
326
- ),
327
- ),
328
- )
329
- with demo:
330
- gr.HTML(TITLE)
331
-
332
- with gr.Tabs(elem_classes="tab-buttons") as tabs:
333
- with gr.TabItem("Leaderboard", elem_id="llm-benchmark-tab-table", id=0):
334
- with gr.Row():
335
- with gr.Column():
336
- gr.HTML(render_tab_header("Leaderboard", ICON_MEDAL, LEADERBOARD_COLOR))
337
- leaderboard = init_leaderboard(LEADERBOARD_DF)
338
-
339
- with gr.TabItem("Visualizations", elem_id="llm-benchmark-tab-viz", id=1):
340
- with gr.Row():
341
- gr.HTML(render_tab_header("Visualizations", ICON_PLOT, VISUALIZATIONS_COLOR))
342
- gr.Markdown("## Performance vs Cost Analysis", elem_classes="markdown-text")
343
- with gr.Row():
344
- with gr.Column():
345
- plot_steps = gr.Plot(value=create_anls_vs_steps_plot(LEADERBOARD_DF))
346
- with gr.Column():
347
- plot_cost = gr.Plot(value=create_anls_vs_cost_plot(LEADERBOARD_DF))
348
- gr.Markdown(
349
- """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
350
  **Understanding the plots:**
351
  - Each point represents a model submission
352
  - **Orange points**: API-based models
353
  - **Blue points**: Open-weight models
354
  - Hover over points to see model details
355
- - Upper-left quadrant = better performance with lower cost (optimal)
356
- """,
357
- elem_classes="markdown-text",
358
- )
359
-
360
- with gr.TabItem("About", elem_id="llm-benchmark-tab-about", id=2):
361
- with gr.Row():
362
- gr.HTML(render_tab_header("About", ICON_DOC, ABOUT_COLOR))
363
- gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
364
-
365
- with gr.TabItem("Submit Results", elem_id="llm-benchmark-tab-submit", id=3):
366
- with gr.Row():
367
- gr.HTML(render_tab_header("Submit Results", ICON_WRITE, SUBMIT_COLOR))
368
- with gr.Column():
369
- with gr.Column():
370
- with gr.Accordion(
371
- f"✅ Finished Evaluations ({len(finished_eval_queue_df)})",
372
- open=False,
373
- ):
374
- with gr.Row():
375
- finished_eval_table = gr.components.Dataframe(
376
- value=finished_eval_queue_df,
377
- headers=EVAL_COLS,
378
- datatype=EVAL_TYPES,
379
- row_count=5,
380
- )
381
- with gr.Accordion(
382
- f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})",
383
- open=False,
384
- ):
385
- with gr.Row():
386
- running_eval_table = gr.components.Dataframe(
387
- value=running_eval_queue_df,
388
- headers=EVAL_COLS,
389
- datatype=EVAL_TYPES,
390
- row_count=5,
391
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
392
 
393
- with gr.Accordion(
394
- f"⏳ Pending Evaluation Queue ({len(pending_eval_queue_df)})",
395
- open=False,
396
- ):
397
- with gr.Row():
398
- pending_eval_table = gr.components.Dataframe(
399
- value=pending_eval_queue_df,
400
- headers=EVAL_COLS,
401
- datatype=EVAL_TYPES,
402
- row_count=5,
403
- )
404
- with gr.Row():
405
- gr.Markdown("# ✉️✨ Submit your results here!", elem_classes="markdown-text")
406
-
407
- with gr.Row():
408
- with gr.Column():
409
- model_name_textbox = gr.Textbox(
410
- label="Model Name", placeholder="e.g., GPT-4-Turbo-Agent, Claude-3-Opus-Agent"
411
- )
412
- organization_textbox = gr.Textbox(
413
- label="Organization", placeholder="e.g., OpenAI, Anthropic, Meta, or your organization name"
414
- )
415
- model_type = gr.Dropdown(
416
- choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
417
- label="Model Type",
418
- multiselect=False,
419
- value=None,
420
- interactive=True,
421
- )
422
- link_textbox = gr.Textbox(
423
- label="Link (Optional)",
424
- placeholder="e.g., https://arxiv.org/abs/... or https://github.com/...",
425
- info="Link to paper, code repository, or model card (optional)"
426
- )
427
-
428
- with gr.Column():
429
- predictions_file = gr.File(label="Predictions JSONL File", file_types=[".jsonl"], type="filepath")
430
- gr.Markdown(
431
- """
432
- **Expected JSONL format (one prediction per line):**
433
- ```json
434
- {"question": "What is Dr. McElhaney's position?", "answer": ["Senior Scientist"], "citations": [{"file": "1307326.pdf", "page": 1}], "iterations": 1, "id": "q_4"}
435
- {"question": "Who is the CEO?", "answer": ["John Smith"], "citations": [{"file": "report.pdf", "page": 3}], "iterations": 2, "id": "q_5"}
436
- ```
437
- **Required fields per line:**
438
- - `question`: The question text
439
- - `answer`: List of answer strings
440
- - `citations`: List of dicts with "file" and "page"
441
- - `iterations`: Number of agent iterations
442
- - `id`: Unique question identifier
443
- """
444
- )
445
-
446
- submit_button = gr.Button("Submit Evaluation", variant="primary")
447
- submission_result = gr.Markdown()
448
- submit_button.click(
449
- add_new_eval,
450
- [
451
- model_name_textbox,
452
- organization_textbox,
453
- model_type,
454
- predictions_file,
455
- link_textbox,
456
- ],
457
- submission_result,
458
- )
459
-
460
- scheduler = BackgroundScheduler()
461
- scheduler.add_job(restart_space, "interval", seconds=1800)
462
- scheduler.start()
463
-
464
-
465
- demo.queue(default_concurrency_limit=40).launch(allowed_paths=[ASSETS_PATH])
 
1
  """
2
+ Agentic Document VQA Leaderboard - Streamlit Version
3
+
4
+ Benchmark for evaluating AI systems on document collection question answering.
5
+ Based on the paper: "Strategic Navigation or Stochastic Search?
6
+ How Agents and Humans Handle Large Document Collections"
7
 
8
  Color palette: Snowflake colors
9
  - SNOWFLAKE BLUE: #29B5E8
 
16
  - PURPLE MOON: #7254A3
17
  """
18
 
19
+ import base64
20
+ import json
21
  import os
22
+ import sys
23
+ from datetime import datetime, timezone
24
+ from pathlib import Path
25
 
 
26
  import pandas as pd
27
  import plotly.graph_objects as go
28
+ import streamlit as st
29
+ from huggingface_hub import snapshot_download, HfApi
30
+
31
+ # Add eval module to path
32
+ sys.path.insert(0, str(Path(__file__).parent / "eval"))
33
+ try:
34
+ from metrics import anls_star, citation_f1, kuiper_statistic
35
+ from datasets import load_dataset
36
+ EVAL_AVAILABLE = True
37
+ except ImportError:
38
+ EVAL_AVAILABLE = False
39
+
40
+ # Page configuration
41
+ st.set_page_config(
42
+ page_title="Agentic Document VQA",
43
+ page_icon="📄",
44
+ layout="wide",
45
+ initial_sidebar_state="collapsed",
46
+ )
47
+
48
+ # HuggingFace Hub configuration
49
+ TOKEN = os.environ.get("HF_TOKEN")
50
+ QUEUE_REPO = "agentic-document-ai/backend-requests"
51
+ RESULTS_REPO = "agentic-document-ai/backend-results"
52
+ CACHE_PATH = os.getenv("HF_HOME", ".")
53
+
54
+
55
+ def get_hf_user() -> dict | None:
56
+ """Get the logged-in HuggingFace user info from OAuth.
57
+
58
+ Returns dict with 'username', 'name', 'picture' if logged in, None otherwise.
59
+ Works on HuggingFace Spaces with hf_oauth: true in README.md
60
+ """
61
+ # Check if running on HF Spaces with OAuth enabled
62
+ if hasattr(st, 'context') and hasattr(st.context, 'headers'):
63
+ headers = st.context.headers
64
+ # HF Spaces passes user info in headers when OAuth is enabled
65
+ hf_user = headers.get("HF-User")
66
+ if hf_user:
67
+ return {
68
+ 'username': hf_user,
69
+ 'name': headers.get("HF-User-Name", hf_user),
70
+ 'picture': headers.get("HF-User-Picture", ""),
71
+ }
72
+
73
+ # Check for st.user (Streamlit 1.37+)
74
+ if hasattr(st, 'user') and st.user.get('email'):
75
+ return {
76
+ 'username': st.user.get('email', '').split('@')[0],
77
+ 'name': st.user.get('name', ''),
78
+ 'picture': st.user.get('picture', ''),
79
+ }
80
+
81
+ return None
82
+
83
+ # Colors
84
+ SNOWFLAKE_BLUE = "#29B5E8"
85
+ MID_BLUE = "#11567F"
86
+ VALENCIA_ORANGE = "#FF9F36"
87
+ STAR_BLUE = "#75CDD7"
88
+ FIRST_LIGHT = "#D45B90"
89
+ PURPLE_MOON = "#7254A3"
90
+ MEDIUM_GRAY = "#5B5B5B"
91
+
92
+ # Available tags for filtering - can be extended
93
+ AVAILABLE_TAGS = [
94
+ "Agentic",
95
+ "Conventional RAG",
96
+ "BM25 Search Tool",
97
+ "Semantic Search Tool",
98
+ "Vision and Language",
99
+ "Text-only",
100
+ ]
101
+
102
+ # Tag colors for visual distinction (cycling through Snowflake secondary colors)
103
+ TAG_COLORS = {
104
+ "Agentic": MID_BLUE,
105
+ "Conventional RAG": STAR_BLUE,
106
+ "BM25 Search Tool": VALENCIA_ORANGE,
107
+ "Semantic Search Tool": FIRST_LIGHT,
108
+ "Vision and Language": PURPLE_MOON,
109
+ "Text-only": SNOWFLAKE_BLUE,
110
+ }
111
+
112
+ # Custom CSS following Snowflake Brand Color Guide
113
+ # Primary: MID-BLUE (#11567F) for accents/sections, SNOWFLAKE BLUE (#29B5E8) sparingly
114
+ # Use white text on dark backgrounds per accessibility guidelines
115
+ st.markdown(f"""
116
+ <style>
117
+ /* Dark theme base - using near-black for good contrast */
118
+ .stApp {{
119
+ background-color: #0e1117;
120
+ }}
121
+
122
+ /* ===== TAB STYLING ===== */
123
+ .stTabs [data-baseweb="tab-list"] {{
124
+ gap: 8px;
125
+ background-color: transparent;
126
+ border-bottom: 2px solid {MID_BLUE};
127
+ padding-bottom: 0;
128
+ }}
129
+
130
+ .stTabs [data-baseweb="tab"] {{
131
+ height: 50px;
132
+ padding: 0 28px;
133
+ background-color: transparent !important;
134
+ border-radius: 0;
135
+ font-weight: 500;
136
+ font-size: 18px;
137
+ color: {MEDIUM_GRAY} !important;
138
+ border-bottom: 3px solid transparent !important;
139
+ margin-bottom: -2px;
140
+ }}
141
+
142
+ .stTabs [aria-selected="true"] {{
143
+ background-color: transparent !important;
144
+ color: {SNOWFLAKE_BLUE} !important;
145
+ border-bottom: 3px solid {SNOWFLAKE_BLUE} !important;
146
+ }}
147
+
148
+ .stTabs [data-baseweb="tab"]:hover {{
149
+ color: {SNOWFLAKE_BLUE} !important;
150
+ }}
151
+
152
+ /* Tab indicator overrides */
153
+ .stTabs [data-baseweb="tab-highlight"],
154
+ div[data-baseweb="tab-highlight"] {{
155
+ background-color: {SNOWFLAKE_BLUE} !important;
156
+ }}
157
+
158
+ .stTabs [role="tablist"] > div:last-child {{
159
+ background-color: {SNOWFLAKE_BLUE} !important;
160
+ }}
161
+
162
+ /* ===== CHECKBOX STYLING - Clean, no background highlight ===== */
163
+ .stCheckbox {{
164
+ background: transparent !important;
165
+ }}
166
+
167
+ .stCheckbox label {{
168
+ background: transparent !important;
169
+ color: white !important;
170
+ }}
171
+
172
+ .stCheckbox label span {{
173
+ background: transparent !important;
174
+ color: white !important;
175
+ }}
176
+
177
+ /* Remove any highlight/selection background from checkbox labels */
178
+ .stCheckbox > label,
179
+ .stCheckbox label > span,
180
+ .stCheckbox label > div {{
181
+ background-color: transparent !important;
182
+ background: none !important;
183
+ }}
184
+
185
+ /* The checkbox box itself */
186
+ [data-baseweb="checkbox"] > div:first-child {{
187
+ border-color: {MEDIUM_GRAY} !important;
188
+ background-color: transparent !important;
189
+ }}
190
+
191
+ [data-baseweb="checkbox"][aria-checked="true"] > div:first-child {{
192
+ background-color: {SNOWFLAKE_BLUE} !important;
193
+ border-color: {SNOWFLAKE_BLUE} !important;
194
+ }}
195
+
196
+ /* Checkmark icon */
197
+ [data-baseweb="checkbox"] svg {{
198
+ color: white !important;
199
+ }}
200
+
201
+ /* ===== BUTTON STYLING - MID-BLUE primary ===== */
202
+ .stButton > button {{
203
+ background-color: {MID_BLUE} !important;
204
+ color: white !important;
205
+ border: none !important;
206
+ border-radius: 6px;
207
+ font-weight: 500;
208
+ padding: 0.5rem 1.5rem;
209
+ transition: all 0.2s ease;
210
+ }}
211
+
212
+ .stButton > button:hover {{
213
+ background-color: {SNOWFLAKE_BLUE} !important;
214
+ }}
215
+
216
+ .stButton > button:active, .stButton > button:focus {{
217
+ background-color: {MID_BLUE} !important;
218
+ box-shadow: 0 0 0 2px {SNOWFLAKE_BLUE} !important;
219
+ }}
220
+
221
+ /* Download button */
222
+ .stDownloadButton > button {{
223
+ background-color: {MID_BLUE} !important;
224
+ color: white !important;
225
+ border: none !important;
226
+ }}
227
+
228
+ .stDownloadButton > button:hover {{
229
+ background-color: {SNOWFLAKE_BLUE} !important;
230
+ }}
231
+
232
+ /* ===== FORM ELEMENTS ===== */
233
+ /* Text inputs */
234
+ .stTextInput > div > div > input {{
235
+ border-color: {MEDIUM_GRAY} !important;
236
+ background-color: #1a1a2e !important;
237
+ }}
238
+
239
+ .stTextInput > div > div > input:focus {{
240
+ border-color: {SNOWFLAKE_BLUE} !important;
241
+ box-shadow: 0 0 0 1px {SNOWFLAKE_BLUE} !important;
242
+ }}
243
+
244
+ /* Select boxes */
245
+ .stSelectbox [data-baseweb="select"] > div {{
246
+ border-color: {MEDIUM_GRAY} !important;
247
+ background-color: #1a1a2e !important;
248
+ }}
249
+
250
+ /* Multiselect chips */
251
+ .stMultiSelect [data-baseweb="tag"] {{
252
+ background-color: {MID_BLUE} !important;
253
+ color: white !important;
254
+ }}
255
+
256
+ /* File uploader */
257
+ [data-testid="stFileUploader"] {{
258
+ border: 2px dashed {MEDIUM_GRAY} !important;
259
+ border-radius: 12px;
260
+ padding: 2rem 1.5rem !important;
261
+ background-color: transparent !important;
262
+ transition: all 0.2s ease;
263
+ }}
264
+
265
+ [data-testid="stFileUploader"]:hover {{
266
+ border-color: {SNOWFLAKE_BLUE} !important;
267
+ background-color: rgba(17, 86, 127, 0.08) !important;
268
+ }}
269
+
270
+ [data-testid="stFileUploaderDropzone"] {{
271
+ background-color: transparent !important;
272
+ }}
273
+
274
+ [data-testid="stFileUploader"] section {{
275
+ padding: 0 !important;
276
+ }}
277
+
278
+ [data-testid="stFileUploader"] section > div {{
279
+ padding: 0.5rem 0 !important;
280
+ }}
281
+
282
+ /* ===== LINKS - Snowflake Blue for visibility ===== */
283
+ a {{
284
+ color: {SNOWFLAKE_BLUE} !important;
285
+ text-decoration: none !important;
286
+ }}
287
+
288
+ a:hover {{
289
+ color: {STAR_BLUE} !important;
290
+ text-decoration: underline !important;
291
+ }}
292
+
293
+ /* ===== SECTION HEADERS ===== */
294
+ h3 {{
295
+ color: white;
296
+ }}
297
+
298
+ /* ===== ALERTS/MESSAGES ===== */
299
+ .stAlert, [data-testid="stAlert"] {{
300
+ border-radius: 8px !important;
301
+ border: none !important;
302
+ }}
303
+
304
+ /* Info messages - Snowflake Blue */
305
+ .stInfo, [data-testid="stAlert"]:has([data-testid="stMarkdownContainer"]) {{
306
+ background-color: rgba(41, 181, 232, 0.15) !important;
307
+ border-left: 4px solid {SNOWFLAKE_BLUE} !important;
308
+ }}
309
+
310
+ /* Warning messages - Valencia Orange */
311
+ .stWarning, [role="alert"]:has(svg[data-testid="stIconWarning"]) {{
312
+ background-color: rgba(255, 159, 54, 0.15) !important;
313
+ border-left: 4px solid {VALENCIA_ORANGE} !important;
314
+ }}
315
+
316
+ /* Error messages - First Light (pink/red) */
317
+ .stError, [role="alert"]:has(svg[data-testid="stIconError"]) {{
318
+ background-color: rgba(212, 91, 144, 0.15) !important;
319
+ border-left: 4px solid {FIRST_LIGHT} !important;
320
+ }}
321
+
322
+ /* Success messages - Star Blue */
323
+ .stSuccess, [role="alert"]:has(svg[data-testid="stIconSuccess"]) {{
324
+ background-color: rgba(117, 205, 215, 0.15) !important;
325
+ border-left: 4px solid {STAR_BLUE} !important;
326
+ }}
327
+
328
+ /* Alert text and icon colors */
329
+ .stAlert p, [data-testid="stAlert"] p {{
330
+ color: rgba(255, 255, 255, 0.9) !important;
331
+ }}
332
+
333
+ /* Override default alert backgrounds */
334
+ [data-testid="stNotification"] {{
335
+ background-color: transparent !important;
336
+ }}
337
+
338
+ div[data-baseweb="notification"] {{
339
+ background-color: rgba(41, 181, 232, 0.15) !important;
340
+ border-left: 4px solid {SNOWFLAKE_BLUE} !important;
341
+ border-radius: 8px !important;
342
+ }}
343
+
344
+ /* ===== SPINNER ===== */
345
+ .stSpinner > div {{
346
+ border-top-color: {SNOWFLAKE_BLUE} !important;
347
+ }}
348
+
349
+ /* ===== EXPANDER ===== */
350
+ .streamlit-expanderHeader {{
351
+ border-left: 3px solid {MID_BLUE};
352
+ background-color: rgba(17, 86, 127, 0.1) !important;
353
+ }}
354
+
355
+ /* ===== CODE BLOCKS ===== */
356
+ code {{
357
+ background-color: rgba(17, 86, 127, 0.2);
358
+ padding: 0.2em 0.4em;
359
+ border-radius: 3px;
360
+ color: {STAR_BLUE};
361
+ }}
362
+
363
+ /* ===== SCROLLBAR ===== */
364
+ ::-webkit-scrollbar {{
365
+ width: 8px;
366
+ height: 8px;
367
+ }}
368
+
369
+ ::-webkit-scrollbar-track {{
370
+ background: #1a1a2e;
371
+ }}
372
+
373
+ ::-webkit-scrollbar-thumb {{
374
+ background: {MID_BLUE};
375
+ border-radius: 4px;
376
+ }}
377
+
378
+ ::-webkit-scrollbar-thumb:hover {{
379
+ background: {SNOWFLAKE_BLUE};
380
+ }}
381
+
382
+ /* ===== ROOT VARIABLES ===== */
383
+ :root {{
384
+ --primary-color: {SNOWFLAKE_BLUE} !important;
385
+ }}
386
+
387
+ /* ===== MULTISELECT STYLING ===== */
388
+ /* Tag filter multiselect - MID_BLUE (gradient start) */
389
+ div[data-testid="stHorizontalBlock"] > div:first-child .stMultiSelect [data-baseweb="tag"] {{
390
+ background-color: {MID_BLUE} !important;
391
+ color: white !important;
392
+ }}
393
+
394
+ /* Column selector multiselect - SNOWFLAKE_BLUE (gradient end) */
395
+ div[data-testid="stHorizontalBlock"] > div:last-child .stMultiSelect [data-baseweb="tag"] {{
396
+ background-color: {SNOWFLAKE_BLUE} !important;
397
+ color: white !important;
398
+ }}
399
+
400
+ /* Default multiselect styling */
401
+ .stMultiSelect [data-baseweb="tag"] {{
402
+ border-radius: 12px !important;
403
+ padding: 2px 10px !important;
404
+ margin: 2px !important;
405
+ font-weight: 500 !important;
406
+ }}
407
+
408
+ .stMultiSelect [data-baseweb="tag"] span {{
409
+ color: inherit !important;
410
+ }}
411
+
412
+ /* Remove button in tag */
413
+ .stMultiSelect [data-baseweb="tag"] svg {{
414
+ color: white !important;
415
+ opacity: 0.8;
416
+ }}
417
+
418
+ .stMultiSelect [data-baseweb="tag"] svg:hover {{
419
+ opacity: 1;
420
+ }}
421
+
422
+ /* Placeholder text */
423
+ .stMultiSelect input::placeholder {{
424
+ color: {MEDIUM_GRAY} !important;
425
+ }}
426
+ </style>
427
+ """, unsafe_allow_html=True)
428
+
429
+
430
+ # Data paths
431
+ EVAL_RESULTS_PATH = Path(CACHE_PATH) / "eval-results"
432
+ EVAL_REQUESTS_PATH = Path(CACHE_PATH) / "eval-queue"
433
+
434
+
435
+ @st.cache_data(ttl=300) # Cache for 5 minutes
436
+ def download_data():
437
+ """Download data from HuggingFace Hub."""
438
  try:
439
+ snapshot_download(
440
+ repo_id=QUEUE_REPO,
441
+ local_dir=str(EVAL_REQUESTS_PATH),
442
+ repo_type="dataset",
443
+ tqdm_class=None,
444
+ etag_timeout=30,
445
+ token=TOKEN,
446
+ )
447
+ except Exception as e:
448
+ st.warning(f"Could not download queue data: {e}")
449
+
450
+ try:
451
+ snapshot_download(
452
+ repo_id=RESULTS_REPO,
453
+ local_dir=str(EVAL_RESULTS_PATH),
454
+ repo_type="dataset",
455
+ tqdm_class=None,
456
+ etag_timeout=30,
457
+ token=TOKEN,
458
+ )
459
+ except Exception as e:
460
+ st.warning(f"Could not download results data: {e}")
461
+
462
+
463
+ class ModelType:
464
+ API = "api"
465
+ OPEN_WEIGHT = "open-weight"
466
+
467
+ @staticmethod
468
+ def get_color(model_type: str) -> str:
469
+ if model_type == ModelType.API:
470
+ return VALENCIA_ORANGE
471
+ elif model_type == ModelType.OPEN_WEIGHT:
472
+ return STAR_BLUE
473
+ return MEDIUM_GRAY
474
+
475
+
476
+ # Load SVG icons from local assets folder
477
+ ASSETS_PATH = Path(__file__).resolve().parent / "assets"
478
+
479
+
480
+ def load_svg_icon(icon_name: str, fill_color: str = None) -> str:
481
+ """Load SVG icon and return as data URI with optional color replacement.
482
+
483
+ This matches the Gradio app's load_svg_data_uri function.
484
+ """
485
+ svg_file = ASSETS_PATH / f"{icon_name}.svg"
486
+ if not svg_file.exists():
487
+ return ""
488
+
489
+ try:
490
+ with open(svg_file, "r", encoding="utf-8") as f:
491
+ svg_content = f.read()
492
+
493
+ # Replace black fill with specified color for visibility on dark background
494
+ if fill_color:
495
+ svg_content = svg_content.replace('fill="black"', f'fill="{fill_color}"')
496
+ svg_content = svg_content.replace('stroke="black"', f'stroke="{fill_color}"')
497
+
498
+ b64 = base64.b64encode(svg_content.encode()).decode()
499
+ return f"data:image/svg+xml;base64,{b64}"
500
  except Exception:
501
  return ""
502
 
503
 
504
+ # Preload icons with Snowflake colors (matching Gradio app)
505
+ ICON_CLOUD = load_svg_icon("snow_cloud2", VALENCIA_ORANGE) # Orange cloud for API (same as Gradio)
506
+ ICON_CODE = load_svg_icon("snow_code", STAR_BLUE) # Blue code for open-weight (same as Gradio)
 
 
 
 
507
 
508
+ # Tab header icons - use white to match header text color
509
+ HEADER_ICON_COLOR = "#FFFFFF"
510
+ ICON_MEDAL = load_svg_icon("snow_medal", HEADER_ICON_COLOR) # Leaderboard header icon
511
+ ICON_EYE = load_svg_icon("snow_eye", HEADER_ICON_COLOR) # Visualizations header icon
512
+ ICON_DOCS = load_svg_icon("snow_docs", HEADER_ICON_COLOR) # About header icon
513
+ ICON_WRITE = load_svg_icon("snow_write", HEADER_ICON_COLOR) # Submit header icon
514
 
515
 
516
+ def generate_placeholder_description(model_name: str, tags: list, model_type: str) -> str:
517
+ """Generate a placeholder description based on model metadata."""
518
+ parts = []
519
+
520
+ # Describe model type
521
+ if model_type == "api":
522
+ parts.append("API-based")
523
+ elif model_type == "open-weight":
524
+ parts.append("Open-weight")
525
+
526
+ # Describe approach based on tags
527
+ if tags:
528
+ if "Agentic" in tags:
529
+ parts.append("agentic system")
530
+ elif "Conventional RAG" in tags:
531
+ parts.append("RAG pipeline")
532
+ else:
533
+ parts.append("model")
534
+
535
+ # Add tool/capability info
536
+ capabilities = []
537
+ if "BM25 Search Tool" in tags:
538
+ capabilities.append("BM25 search")
539
+ if "Semantic Search Tool" in tags:
540
+ capabilities.append("semantic search")
541
+ if "Vision and Language" in tags:
542
+ capabilities.append("vision")
543
+ if "Text-only" in tags:
544
+ capabilities.append("text-only")
545
+
546
+ if capabilities:
547
+ parts.append(f"with {', '.join(capabilities)}")
548
+ else:
549
+ parts.append("model")
550
+
551
+ return " ".join(parts) if parts else ""
552
 
553
 
554
+ def get_model_type_html(model_type: str) -> str:
555
+ """Get HTML for model type with icon and colored text."""
556
+ color = ModelType.get_color(model_type)
557
+ icon_uri = ICON_CLOUD if model_type == ModelType.API else ICON_CODE
558
+
559
+ # Fallback emoji if icon doesn't load
560
+ fallback_emoji = "☁️" if model_type == ModelType.API else "</>"
561
+
562
+ if icon_uri:
563
+ return f'''<div style="display: inline-flex; align-items: center; white-space: nowrap;">
564
+ <img src="{icon_uri}" style="width: 20px; height: 20px; vertical-align: middle;" />
565
+ <span style="color: {color}; font-weight: 500; margin-left: 6px;">{model_type}</span>
566
+ </div>'''
567
+ # Fallback without icon
568
+ return f'<span style="color: {color}; font-weight: 500;">{fallback_emoji} {model_type}</span>'
569
 
570
 
571
+ @st.cache_data(ttl=300) # Cache for 5 minutes
572
+ def load_eval_results() -> pd.DataFrame:
573
+ """Load evaluation results from JSON files."""
574
+ results = []
575
+
576
+ results_path = Path(EVAL_RESULTS_PATH)
577
+ if not results_path.exists():
578
  return pd.DataFrame()
579
+
580
+ for org_dir in results_path.iterdir():
581
+ if org_dir.is_dir() and not org_dir.name.startswith('.'):
582
+ for result_file in org_dir.glob("*_results_*.json"):
583
+ try:
584
+ with open(result_file) as f:
585
+ data = json.load(f)
586
+
587
+ # Extract data
588
+ model_name = data.get("model_name", "Unknown")
589
+ metadata = data.get("metadata", {})
590
+ result_scores = data.get("results", {})
591
+
592
+ # Get tags - default to ["Agentic"] if not specified
593
+ tags = data.get("tags", metadata.get("tags", ["Agentic"]))
594
+ if isinstance(tags, str):
595
+ tags = [tags] # Convert single tag to list
596
+
597
+ # Get per-domain scores if available
598
+ by_domain = result_scores.get("by_domain", {})
599
+
600
+ results.append({
601
+ "Model": model_name,
602
+ "Organization": data.get("organization", data.get("submitted_by", org_dir.name)),
603
+ "Model Type": metadata.get("model_type", "unknown"),
604
+ "Tags": tags, # Store as list
605
+ # Answer correctness metrics (ANLS*)
606
+ "Accuracy (ANLS*)": result_scores.get("overall", {}).get("anls", 0.0),
607
+ "Acc. Single-Hop": result_scores.get("single_evidence", {}).get("anls", 0.0),
608
+ "Acc. Cross-Page": result_scores.get("multi_evidence_same_doc", {}).get("anls", 0.0),
609
+ "Acc. Cross-Doc": result_scores.get("multi_evidence_multi_doc", {}).get("anls", 0.0),
610
+ # Attribution metrics
611
+ "Attribution (Page F1)": result_scores.get("overall", {}).get("page_f1", 0.0),
612
+ "Attribution (Doc F1)": result_scores.get("overall", {}).get("doc_f1", 0.0),
613
+ # Calibration metric
614
+ "Effort (Kuiper)": result_scores.get("overall", {}).get("kuiper", 0.0),
615
+ "Submission Date": data.get("submission_date", ""),
616
+ "Link": data.get("link", ""),
617
+ "Description": data.get("description", metadata.get("description", "")) or
618
+ generate_placeholder_description(model_name, tags, metadata.get("model_type", "")),
619
+ # Per-domain scores (stored as JSON string for DataFrame compatibility)
620
+ "_by_domain": json.dumps(by_domain) if by_domain else "{}",
621
+ })
622
+ except Exception as e:
623
+ st.warning(f"Error loading {result_file}: {e}")
624
+
625
+ if not results:
626
+ return pd.DataFrame()
627
+
628
+ df = pd.DataFrame(results)
629
+ df = df.sort_values("Accuracy (ANLS*)", ascending=False).reset_index(drop=True)
630
+ return df
631
 
 
 
632
 
633
+ def get_all_tags_from_df(df: pd.DataFrame) -> list:
634
+ """Extract all unique tags from the DataFrame."""
635
+ all_tags = set()
636
+ if "Tags" in df.columns:
637
+ for tags in df["Tags"]:
638
+ if isinstance(tags, list):
639
+ all_tags.update(tags)
640
+ return sorted(list(all_tags))
641
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
642
 
643
+ def filter_df_by_tags(df: pd.DataFrame, selected_tags: list) -> pd.DataFrame:
644
+ """Filter DataFrame to show only rows that have at least one of the selected tags."""
645
+ if not selected_tags:
646
+ return df
647
+
648
+ def has_any_tag(row_tags):
649
+ if not isinstance(row_tags, list):
650
+ return False
651
+ return any(tag in row_tags for tag in selected_tags)
652
+
653
+ return df[df["Tags"].apply(has_any_tag)]
654
+
655
+
656
+ def render_tags_html(tags: list) -> str:
657
+ """Render tags as styled badges."""
658
+ if not tags or not isinstance(tags, list):
659
+ return ""
660
+
661
+ badges = []
662
+ for tag in tags:
663
+ color = TAG_COLORS.get(tag, MID_BLUE)
664
+ # Use lighter background with colored border for better readability
665
+ badge = f'''<span style="
666
+ display: inline-block;
667
+ padding: 2px 8px;
668
+ margin: 2px 3px;
669
+ border-radius: 12px;
670
+ font-size: 11px;
671
+ font-weight: 500;
672
+ background-color: {color}20;
673
+ color: {color};
674
+ border: 1px solid {color};
675
+ white-space: nowrap;
676
+ ">{tag}</span>'''
677
+ badges.append(badge)
678
+
679
+ return "".join(badges)
680
+
681
+
682
+ def format_model_name(row) -> str:
683
+ """Format model name with optional link."""
684
+ model_name = row["Model"]
685
+ link = row.get("Link", "")
686
+ if link and link.strip():
687
+ return f'<a href="{link}" target="_blank">{model_name}</a>'
688
+ return model_name
689
+
690
+
691
+ def format_model_type(model_type: str) -> str:
692
+ """Format model type with icon and color."""
693
+ icon = ModelType.get_icon(model_type)
694
+ color = ModelType.get_color(model_type)
695
+ return f'<span style="color: {color};">{icon} {model_type}</span>'
696
+
697
+
698
+ # Metric tooltips for table headers
699
+ METRIC_TOOLTIPS = {
700
+ "Accuracy (ANLS*)": "Overall answer accuracy using ANLS* (Average Normalized Levenshtein Similarity). Higher is better.",
701
+ "Acc. Single-Hop": "Accuracy on questions requiring evidence from a single page.",
702
+ "Acc. Cross-Page": "Accuracy on multi-hop questions requiring evidence from multiple pages within the same document.",
703
+ "Acc. Cross-Doc": "Accuracy on multi-hop questions requiring evidence from multiple documents.",
704
+ "Attribution (Page F1)": "F1 score for page-level attribution. Measures overlap between cited pages and gold evidence. Higher is better.",
705
+ "Attribution (Doc F1)": "F1 score for document-level attribution. Measures whether the correct documents were identified. Higher is better.",
706
+ "Effort (Kuiper)": "Effort calibration metric (Kuiper statistic). Measures if effort correlates with problem difficulty. Lower is better.",
707
+ "Model Type": "API = cloud-based model, open-weight = downloadable weights",
708
+ "Tags": "Approach characteristics: Agentic, RAG, search tools, vision capabilities, etc.",
709
+ }
710
+
711
 
712
+ def render_leaderboard_table(df: pd.DataFrame, columns: list):
713
+ """Render an HTML table matching the Gradio leaderboard style."""
714
+ if df.empty:
715
+ st.warning("No data available")
716
+ return
717
+
718
+ # Build table HTML with tooltips
719
+ header_cells = []
720
+ for col in columns:
721
+ # Add line break before brackets for cleaner display
722
+ display_col = col.replace(" (", "<br>(") if " (" in col else col
723
+ tooltip = METRIC_TOOLTIPS.get(col, "")
724
+ if tooltip:
725
+ header_cells.append(f'<th title="{tooltip}" style="cursor: help;">{display_col}</th>')
726
+ else:
727
+ header_cells.append(f'<th>{display_col}</th>')
728
+ header_cells = "".join(header_cells)
729
+
730
+ rows_html = ""
731
+ for _, row in df.iterrows():
732
+ cells = []
733
+ for col in columns:
734
+ value = row.get(col, "")
735
+
736
+ if col == "Model":
737
+ # Model name with optional link and description
738
+ link = row.get("Link", "")
739
+ description = row.get("Description", "")
740
+
741
+ if link and str(link).strip():
742
+ name_html = f'<a href="{link}" target="_blank" style="color: #29B5E8; font-weight: 500;">{value}</a>'
743
+ else:
744
+ name_html = f'<span style="font-weight: 500;">{value}</span>'
745
+
746
+ if description and str(description).strip():
747
+ cell_html = f'{name_html}<br><span style="font-size: 12px; color: {MEDIUM_GRAY}; font-weight: normal;">{description}</span>'
748
+ else:
749
+ cell_html = name_html
750
+ elif col == "Model Type":
751
+ # Model type with icon
752
+ cell_html = get_model_type_html(str(value))
753
+ elif col == "Tags":
754
+ # Render tags as badges
755
+ cell_html = render_tags_html(value)
756
+ elif col == "Accuracy (ANLS*)" or col.startswith("Acc."):
757
+ # Format accuracy scores (ANLS*, scale 0-100)
758
+ try:
759
+ cell_html = f"{float(value):.1f}" if value else "0"
760
+ except (ValueError, TypeError):
761
+ cell_html = str(value)
762
+ elif col.startswith("Attribution"):
763
+ # Format F1 scores (scale 0-100)
764
+ try:
765
+ cell_html = f"{float(value):.1f}" if value else "0"
766
+ except (ValueError, TypeError):
767
+ cell_html = str(value)
768
+ elif col == "Effort (Kuiper)":
769
+ # Format Kuiper statistic (lower is better for calibration)
770
+ try:
771
+ cell_html = f"{float(value):.3f}" if value else "0"
772
+ except (ValueError, TypeError):
773
+ cell_html = str(value)
774
+ else:
775
+ cell_html = str(value) if value else ""
776
+
777
+ cells.append(f'<td>{cell_html}</td>')
778
+
779
+ rows_html += f'<tr>{"".join(cells)}</tr>'
780
+
781
+ table_html = f'''
782
+ <style>
783
+ .leaderboard-wrapper {{
784
+ border: 2px solid {MID_BLUE};
785
+ border-radius: 8px;
786
+ overflow: hidden;
787
+ font-size: 0;
788
+ }}
789
+ .leaderboard-table {{
790
+ width: 100%;
791
+ border-collapse: collapse;
792
+ border-spacing: 0;
793
+ font-size: 14px;
794
+ background-color: #0e1117;
795
+ margin: 0;
796
+ padding: 0;
797
+ border: none;
798
+ }}
799
+ .leaderboard-table thead tr {{
800
+ background: linear-gradient(135deg, {MID_BLUE} 0%, {SNOWFLAKE_BLUE} 100%);
801
+ }}
802
+ .leaderboard-table thead th {{
803
+ background: transparent;
804
+ color: white;
805
+ text-align: center;
806
+ padding: 1.2em 0.75em;
807
+ font-weight: 500;
808
+ border: none;
809
+ text-transform: none;
810
+ }}
811
+ .leaderboard-table thead th:not(:last-child) {{
812
+ border-right: 1px solid rgba(255,255,255,0.15);
813
+ }}
814
+ .leaderboard-table tbody td {{
815
+ padding: 0.75em;
816
+ border-bottom: 1px solid {MEDIUM_GRAY}40;
817
+ vertical-align: middle;
818
+ color: white;
819
+ }}
820
+ .leaderboard-table tbody tr:last-child td {{
821
+ border-bottom: none;
822
+ }}
823
+ .leaderboard-table tbody tr:nth-child(even) {{
824
+ background-color: rgba(17, 86, 127, 0.12);
825
+ }}
826
+ .leaderboard-table tbody tr:hover {{
827
+ background-color: rgba(17, 86, 127, 0.25);
828
+ }}
829
+ .leaderboard-table td:first-child {{
830
+ min-width: 280px;
831
+ max-width: 350px;
832
+ word-wrap: break-word;
833
+ }}
834
+ /* Links in table use Snowflake Blue */
835
+ .leaderboard-table a {{
836
+ color: {SNOWFLAKE_BLUE};
837
+ text-decoration: none;
838
+ }}
839
+ .leaderboard-table a:hover {{
840
+ color: {STAR_BLUE};
841
+ text-decoration: underline;
842
+ }}
843
+ </style>
844
+ <div class="leaderboard-wrapper">
845
+ <table class="leaderboard-table">
846
+ <thead>
847
+ <tr>{header_cells}</tr>
848
+ </thead>
849
+ <tbody>
850
+ {rows_html}
851
+ </tbody>
852
+ </table>
853
+ </div>
854
+ '''
855
+
856
+ st.markdown(table_html, unsafe_allow_html=True)
857
 
 
 
 
858
 
859
+ def create_accuracy_vs_attribution_plot(df: pd.DataFrame) -> go.Figure:
860
+ """Create scatter plot of Accuracy vs Attribution."""
861
  if df.empty:
862
  fig = go.Figure()
863
  fig.add_annotation(
864
+ text="No data available",
865
+ xref="paper", yref="paper",
866
+ x=0.5, y=0.5, showarrow=False,
867
+ font=dict(size=20, color="white")
868
  )
869
  return fig
870
+
 
871
  color_map = {
872
+ "api": VALENCIA_ORANGE, # Orange for API
873
+ "open-weight": STAR_BLUE, # Star Blue for open-weight
 
874
  }
875
+
876
  fig = go.Figure()
877
+
878
+ for model_type in df["Model Type"].unique():
879
+ df_type = df[df["Model Type"] == model_type]
880
+ fig.add_trace(go.Scatter(
881
+ x=df_type["Attribution (Page F1)"],
882
+ y=df_type["Accuracy (ANLS*)"],
883
+ mode="markers+text",
884
+ name=model_type,
885
+ text=df_type["Model"],
886
+ textposition="top center",
887
+ textfont=dict(size=9, color="#ccc"),
888
+ marker=dict(
889
+ size=14,
890
+ color=color_map.get(model_type, MEDIUM_GRAY),
891
+ line=dict(width=2, color="white")
892
+ ),
893
+ hovertemplate="<b>%{text}</b><br>Attribution: %{x:.1f}<br>Accuracy: %{y:.1f}<extra></extra>",
894
+ ))
895
+
896
  fig.update_layout(
897
+ title=dict(text="Accuracy vs Attribution", font=dict(color="white")),
898
+ xaxis_title="Attribution (Page F1)",
899
+ yaxis_title="Accuracy (ANLS*)",
900
  hovermode="closest",
901
+ template="plotly_dark",
902
+ height=500,
903
  showlegend=True,
904
+ legend=dict(title="Model Type", yanchor="top", y=0.99, xanchor="right", x=0.99, font=dict(color="#ccc")),
905
+ paper_bgcolor="rgba(0,0,0,0)",
906
+ plot_bgcolor="rgba(14,17,23,0.8)",
907
+ xaxis=dict(gridcolor=MID_BLUE, zerolinecolor=MID_BLUE),
908
+ yaxis=dict(gridcolor=MID_BLUE, zerolinecolor=MID_BLUE),
909
  )
910
+
911
  return fig
912
 
913
 
914
+ def create_accuracy_vs_effort_plot(df: pd.DataFrame) -> go.Figure:
915
+ """Create scatter plot of Accuracy vs Effort (Kuiper)."""
 
 
916
  if df.empty:
917
  fig = go.Figure()
918
  fig.add_annotation(
919
+ text="No data available",
920
+ xref="paper", yref="paper",
921
+ x=0.5, y=0.5, showarrow=False,
922
+ font=dict(size=20, color="white")
923
  )
924
  return fig
925
+
 
 
 
 
 
 
 
926
  color_map = {
927
+ "api": VALENCIA_ORANGE, # Orange for API
928
+ "open-weight": STAR_BLUE, # Star Blue for open-weight
 
929
  }
930
+
931
  fig = go.Figure()
932
+
933
+ for model_type in df["Model Type"].unique():
934
+ df_type = df[df["Model Type"] == model_type]
935
+ fig.add_trace(go.Scatter(
936
+ x=df_type["Effort (Kuiper)"],
937
+ y=df_type["Accuracy (ANLS*)"],
938
+ mode="markers+text",
939
+ name=model_type,
940
+ text=df_type["Model"],
941
+ textposition="top center",
942
+ textfont=dict(size=9, color="#ccc"),
943
+ marker=dict(
944
+ size=14,
945
+ color=color_map.get(model_type, MEDIUM_GRAY),
946
+ line=dict(width=2, color="white")
947
+ ),
948
+ hovertemplate="<b>%{text}</b><br>Effort: %{x:.3f}<br>Accuracy: %{y:.1f}<extra></extra>",
949
+ ))
950
+
951
  fig.update_layout(
952
+ title=dict(text="Accuracy vs Effort", font=dict(color="white")),
953
+ xaxis_title="Effort (Kuiper) — lower is better",
954
+ yaxis_title="Accuracy (ANLS*)",
955
  hovermode="closest",
956
+ template="plotly_dark",
957
+ height=500,
958
  showlegend=True,
959
+ legend=dict(title="Model Type", yanchor="top", y=0.99, xanchor="right", x=0.99, font=dict(color="#ccc")),
960
+ paper_bgcolor="rgba(0,0,0,0)",
961
+ plot_bgcolor="rgba(14,17,23,0.8)",
962
+ xaxis=dict(gridcolor=MID_BLUE, zerolinecolor=MID_BLUE),
963
+ yaxis=dict(gridcolor=MID_BLUE, zerolinecolor=MID_BLUE),
964
  )
965
+
966
  return fig
967
 
968
 
969
+ def create_domain_accuracy_chart(by_domain: dict, model_name: str, overall_accuracy: float = 0) -> go.Figure:
970
+ """Create a horizontal bar chart showing accuracy by domain."""
971
+ # Filter out "Other" category
972
+ filtered_domain = {k: v for k, v in by_domain.items() if k.lower() != 'other'}
973
+
974
+ if not filtered_domain:
975
+ fig = go.Figure()
976
+ fig.add_annotation(
977
+ text="No per-domain data available",
978
+ xref="paper", yref="paper",
979
+ x=0.5, y=0.5, showarrow=False,
980
+ font=dict(size=16, color="white")
981
+ )
982
+ fig.update_layout(
983
+ template="plotly_dark",
984
+ paper_bgcolor="rgba(0,0,0,0)",
985
+ plot_bgcolor="rgba(14,17,23,0.8)",
986
+ )
987
+ return fig
988
+
989
+ # Sort domains by accuracy (descending)
990
+ sorted_domains = sorted(filtered_domain.items(), key=lambda x: x[1].get('anls', 0), reverse=True)
991
+
992
+ domains = [d[0] for d in sorted_domains]
993
+ accuracies = [d[1].get('anls', 0) for d in sorted_domains]
994
+ counts = [d[1].get('n', 0) for d in sorted_domains]
995
+
996
+ # Color based on above/below overall accuracy
997
+ colors = [SNOWFLAKE_BLUE if acc >= overall_accuracy else VALENCIA_ORANGE for acc in accuracies]
998
+
999
+ fig = go.Figure()
1000
+
1001
+ fig.add_trace(go.Bar(
1002
+ y=domains,
1003
+ x=accuracies,
1004
+ orientation='h',
1005
+ marker=dict(
1006
+ color=colors,
1007
+ line=dict(width=1, color='white')
1008
+ ),
1009
+ text=[f"{acc:.1f}% (n={n})" for acc, n in zip(accuracies, counts)],
1010
+ textposition='auto',
1011
+ textfont=dict(color='white', size=11),
1012
+ hovertemplate="<b>%{y}</b><br>Accuracy: %{x:.1f}%<extra></extra>",
1013
+ ))
1014
+
1015
+ fig.update_layout(
1016
+ title=dict(
1017
+ text=f"Accuracy by Domain: {model_name}",
1018
+ font=dict(color="white", size=16)
1019
+ ),
1020
+ xaxis_title="Accuracy (ANLS* %)",
1021
+ yaxis_title="",
1022
+ template="plotly_dark",
1023
+ height=max(400, len(domains) * 35), # Dynamic height based on number of domains
1024
+ paper_bgcolor="rgba(0,0,0,0)",
1025
+ plot_bgcolor="rgba(14,17,23,0.8)",
1026
+ xaxis=dict(
1027
+ gridcolor=MID_BLUE,
1028
+ zerolinecolor=MID_BLUE,
1029
+ range=[0, 100]
1030
+ ),
1031
+ yaxis=dict(
1032
+ gridcolor=MID_BLUE,
1033
+ autorange="reversed" # Keep highest at top
1034
+ ),
1035
+ margin=dict(l=150, r=50, t=60, b=50),
1036
  )
1037
+
1038
+ return fig
1039
+
1040
+
1041
+ def show_model_details(model_name: str):
1042
+ """Show detailed per-domain breakdown for a model."""
1043
+ # Load model data from cached DataFrame
1044
+ df = load_eval_results()
1045
+
1046
+ if df.empty:
1047
+ st.warning("No model data available")
1048
+ return
1049
+
1050
+ model_row = df[df["Model"] == model_name]
1051
+ if model_row.empty:
1052
+ st.warning(f"Model '{model_name}' not found")
1053
+ return
1054
+
1055
+ model_data = model_row.iloc[0]
1056
+
1057
+ # Display model info
1058
+ col1, col2, col3 = st.columns(3)
1059
+ with col1:
1060
+ st.metric("Overall Accuracy", f"{model_data['Accuracy (ANLS*)']:.1f}%")
1061
+ with col2:
1062
+ st.metric("Attribution (Page F1)", f"{model_data['Attribution (Page F1)']:.1f}%")
1063
+ with col3:
1064
+ kuiper = model_data.get('Effort (Kuiper)', 0)
1065
+ st.metric("Effort (Kuiper)", f"{kuiper:.2f}" if kuiper else "N/A")
1066
+
1067
+ # Get per-domain data
1068
+ by_domain_str = model_data.get('_by_domain', '{}')
1069
+ try:
1070
+ by_domain = json.loads(by_domain_str) if isinstance(by_domain_str, str) else by_domain_str
1071
+ except (json.JSONDecodeError, TypeError):
1072
+ by_domain = {}
1073
+
1074
+ if by_domain:
1075
+ # Show per-domain chart (use overall accuracy as threshold for coloring)
1076
+ overall_accuracy = model_data.get('Accuracy (ANLS*)', 0)
1077
+ fig = create_domain_accuracy_chart(by_domain, model_name, overall_accuracy)
1078
+ st.plotly_chart(fig, width="stretch")
1079
+ else:
1080
+ st.info("Per-domain breakdown not available for this submission. Newer submissions will include this data.")
1081
 
1082
 
1083
+ def validate_jsonl_submission(file_content: str) -> tuple[bool, str, list]:
1084
+ """Validate JSONL submission format and return parsed predictions."""
1085
+ try:
1086
+ lines = file_content.strip().split("\n")
1087
+ if not lines or (len(lines) == 1 and not lines[0].strip()):
1088
+ return False, "File is empty", []
1089
+
1090
+ predictions = []
1091
+ for line_num, line in enumerate(lines, 1):
1092
+ line = line.strip()
1093
+ if not line:
1094
+ continue
1095
+
1096
+ try:
1097
+ pred = json.loads(line)
1098
+ except json.JSONDecodeError as e:
1099
+ return False, f"Line {line_num}: Invalid JSON - {str(e)}", []
1100
+
1101
+ # Required: question and answer
1102
+ if "question" not in pred:
1103
+ return False, f"Line {line_num}: Missing required field 'question'", []
1104
+ if "answer" not in pred:
1105
+ return False, f"Line {line_num}: Missing required field 'answer'", []
1106
+
1107
+ predictions.append(pred)
1108
+
1109
+ return True, "", predictions
1110
+
1111
+ except Exception as e:
1112
+ return False, f"Error reading file: {str(e)}", []
1113
 
 
 
 
 
 
1114
 
1115
+ @st.cache_data(ttl=3600) # Cache for 1 hour
1116
+ def load_gold_standard(dataset_name: str = "agentic-document-ai/dataset-PRIVATE", split: str = "test"):
1117
+ """Load gold standard from HuggingFace dataset.
1118
+
1119
+ Note: Uses dataset-PRIVATE for test split (contains gold answers).
1120
+ """
1121
+ if not EVAL_AVAILABLE:
1122
+ return {}, {}
1123
+
1124
+ try:
1125
+ dataset = load_dataset(dataset_name, split=split)
1126
+
1127
+ by_text = {}
1128
+ by_id = {}
1129
+
1130
+ for ex in dataset:
1131
+ question = ex['question'].strip()
1132
+ qid = ex.get('id', '')
1133
+
1134
+ # Try multiple field names for answers (different splits may use different names)
1135
+ answers = ex.get('answer_variants') or ex.get('answers') or []
1136
+ # If answers is a string, wrap it in a list
1137
+ if isinstance(answers, str):
1138
+ answers = [[answers]]
1139
+ # If answers is a flat list of strings, wrap each in a list
1140
+ elif answers and isinstance(answers[0], str):
1141
+ answers = [answers]
1142
+
1143
+ gold_data = {
1144
+ 'answers': answers,
1145
+ 'evidence': ex.get('evidence', []),
1146
+ 'category': ex.get('document_category', ''),
1147
+ 'domain': ex.get('domain', ''),
1148
+ 'hop_type': ex.get('hop_type', 'single')
1149
+ }
1150
+
1151
+ by_text[question] = gold_data
1152
+ if qid:
1153
+ by_id[qid] = gold_data
1154
+
1155
+ return by_text, by_id
1156
+ except Exception as e:
1157
+ st.error(f"Error loading dataset: {e}")
1158
+ return {}, {}
1159
 
1160
+
1161
+ def evaluate_predictions(predictions: list, gold_by_text: dict, gold_by_id: dict) -> dict:
1162
+ """Evaluate predictions against gold standard."""
1163
+ if not EVAL_AVAILABLE:
1164
+ return {"error": "Evaluation module not available"}
1165
 
1166
+ evals = []
1167
+ unmatched = []
 
1168
 
1169
+ for pred in predictions:
1170
+ question = pred.get('question', '').strip()
1171
+ qid = pred.get('id', '')
1172
+
1173
+ # Match to gold
1174
+ if question in gold_by_text:
1175
+ gold_data = gold_by_text[question]
1176
+ elif qid and qid in gold_by_id:
1177
+ gold_data = gold_by_id[qid]
1178
+ else:
1179
+ unmatched.append(question[:50] + "..." if len(question) > 50 else question)
1180
+ continue
1181
+
1182
+ # Get prediction data
1183
+ answer = pred.get('answer', '')
1184
+ citations = pred.get('citations', [])
1185
+ search_history = pred.get('search_history', [])
1186
+ steps = len(search_history) if search_history else pred.get('iterations', 0)
1187
+
1188
+ # Calculate metrics
1189
+ anls = anls_star(answer, gold_data['answers'])
1190
+ correct = anls >= 0.5
1191
+ doc_f1 = citation_f1(citations, gold_data['evidence'], level='document')
1192
+ page_f1 = citation_f1(citations, gold_data['evidence'], level='page')
1193
+
1194
+ evals.append({
1195
+ 'question': question,
1196
+ 'anls': anls,
1197
+ 'correct': correct,
1198
+ 'doc_f1': doc_f1['f1'],
1199
+ 'page_f1': page_f1['f1'],
1200
+ 'steps': steps,
1201
+ 'hop_type': gold_data.get('hop_type', 'single'),
1202
+ 'category': gold_data['category'],
1203
+ 'domain': gold_data['domain']
1204
+ })
1205
 
1206
+ if not evals:
1207
+ return {"error": "No predictions matched the gold standard"}
1208
+
1209
+ # Aggregate overall metrics
1210
+ n = len(evals)
1211
+ accuracy = sum(e['correct'] for e in evals) / n * 100 # Scale to 0-100
1212
+ mean_anls = sum(e['anls'] for e in evals) / n * 100
1213
+ mean_doc_f1 = sum(e['doc_f1'] for e in evals) / n * 100
1214
+ mean_page_f1 = sum(e['page_f1'] for e in evals) / n * 100
1215
+
1216
+ # Kuiper statistic
1217
+ kuiper = kuiper_statistic(evals)
1218
+
1219
+ # By hop type
1220
+ single_hop = [e for e in evals if e['hop_type'] == 'single']
1221
+ cross_page = [e for e in evals if e['hop_type'] == 'cross_page']
1222
+ cross_doc = [e for e in evals if e['hop_type'] == 'cross_doc']
1223
+
1224
+ # By domain
1225
+ from collections import defaultdict
1226
+ by_domain = defaultdict(list)
1227
+ for e in evals:
1228
+ domain = e['domain'] or 'Other'
1229
+ by_domain[domain].append(e)
1230
+
1231
+ domain_scores = {}
1232
+ for domain, domain_evals in sorted(by_domain.items()):
1233
+ domain_scores[domain] = {
1234
+ 'anls': sum(e['anls'] for e in domain_evals) / len(domain_evals) * 100,
1235
+ 'n': len(domain_evals)
1236
+ }
1237
+
1238
+ results = {
1239
+ 'n_evaluated': n,
1240
+ 'n_unmatched': len(unmatched),
1241
+ 'unmatched_samples': unmatched[:5], # Show first 5
1242
+ 'overall': {
1243
+ 'anls': mean_anls,
1244
+ 'accuracy': accuracy,
1245
+ 'doc_f1': mean_doc_f1,
1246
+ 'page_f1': mean_page_f1,
1247
+ 'kuiper': kuiper['kuiper_stat'] if not kuiper.get('degenerate') else None,
1248
+ },
1249
+ 'single_evidence': {
1250
+ 'anls': sum(e['anls'] for e in single_hop) / len(single_hop) * 100 if single_hop else 0,
1251
+ 'n': len(single_hop)
1252
+ },
1253
+ 'multi_evidence_same_doc': {
1254
+ 'anls': sum(e['anls'] for e in cross_page) / len(cross_page) * 100 if cross_page else 0,
1255
+ 'n': len(cross_page)
1256
+ },
1257
+ 'multi_evidence_multi_doc': {
1258
+ 'anls': sum(e['anls'] for e in cross_doc) / len(cross_doc) * 100 if cross_doc else 0,
1259
+ 'n': len(cross_doc)
1260
+ },
1261
+ 'by_domain': domain_scores
1262
+ }
1263
+
1264
+ return results
1265
+
1266
+
1267
+ @st.fragment
1268
+ def submit_results_fragment():
1269
+ """Fragment for file upload and evaluation to prevent full page reruns."""
1270
+ # Check HuggingFace login
1271
+ hf_user = get_hf_user()
1272
+
1273
+ if not hf_user:
1274
+ st.warning("🔐 **Login Required**: Please sign in with your HuggingFace account to submit results.")
1275
+
1276
+ # Show login button (works on HF Spaces with hf_oauth: true)
1277
+ if hasattr(st, 'login_button'):
1278
+ st.login_button("huggingface", use_container_width=True)
1279
+ else:
1280
+ st.info("""
1281
+ To enable login:
1282
+ 1. Deploy this app on HuggingFace Spaces
1283
+ 2. Add `hf_oauth: true` to your Space's README.md metadata
1284
+
1285
+ Or run locally with a test user by setting environment variables.
1286
+ """)
1287
+ return
1288
+
1289
+ # Show logged-in user
1290
+ st.success(f"✅ Logged in as **{hf_user['username']}**")
1291
+
1292
+ # Step 1: Upload and Evaluate
1293
+ st.markdown("### Step 1: Upload Predictions")
1294
+
1295
+ uploaded_file = st.file_uploader(
1296
+ "Upload your predictions JSONL file",
1297
+ type=["jsonl"],
1298
+ help="One prediction per line with 'question' and 'answer' fields",
1299
+ key="predictions_uploader"
1300
  )
1301
+
1302
+ with st.expander("📋 Expected JSONL format"):
1303
+ st.code('''{"question": "What is the total revenue?", "answer": "$1.2M", "citations": [{"file": "report.pdf", "page": 5}], "iterations": 3}
1304
+ {"question": "Who signed the contract?", "answer": ["John Smith", "Jane Doe"], "citations": [{"file": "contract.pdf", "page": 12}], "iterations": 2}''', language="json")
1305
+ st.markdown("""
1306
+ **Required fields:**
1307
+ - `question`: The question text (must match dataset)
1308
+ - `answer`: Predicted answer (string or list)
1309
+
1310
+ **Optional fields (for full metrics):**
1311
+ - `citations`: List of `{"file": "...", "page": N}` for attribution metrics
1312
+ - `iterations` or `search_history`: For effort/calibration metrics
1313
+ - `id`: Question ID (fallback matching)
1314
+ """)
1315
+
1316
+ # Initialize session state for evaluation results
1317
+ if 'eval_results' not in st.session_state:
1318
+ st.session_state.eval_results = None
1319
+ if 'predictions' not in st.session_state:
1320
+ st.session_state.predictions = None
1321
+
1322
+ if uploaded_file is not None:
1323
+ file_content = uploaded_file.read().decode("utf-8")
1324
+ is_valid, error_msg, predictions = validate_jsonl_submission(file_content)
1325
+
1326
+ if not is_valid:
1327
+ st.error(f"❌ Invalid file: {error_msg}")
1328
+ else:
1329
+ st.success(f"✅ Loaded {len(predictions)} predictions")
1330
+ st.session_state.predictions = predictions
1331
+
1332
+ # Evaluate button
1333
+ if st.button("🔬 Run Evaluation", type="primary"):
1334
+ with st.spinner("Loading gold standard and evaluating..."):
1335
+ gold_by_text, gold_by_id = load_gold_standard()
1336
+
1337
+ if not gold_by_text:
1338
+ st.error("Failed to load gold standard dataset")
1339
+ else:
1340
+ results = evaluate_predictions(predictions, gold_by_text, gold_by_id)
1341
+ st.session_state.eval_results = results
1342
+
1343
+ # Show evaluation results
1344
+ if st.session_state.eval_results:
1345
+ results = st.session_state.eval_results
1346
+
1347
+ if 'error' in results:
1348
+ st.error(results['error'])
1349
+ else:
1350
+ st.markdown("### 📊 Evaluation Results")
1351
+
1352
+ # Summary metrics
1353
+ col1, col2, col3, col4 = st.columns(4)
1354
+ with col1:
1355
+ st.metric("Accuracy (ANLS*)", f"{results['overall']['anls']:.1f}")
1356
+ with col2:
1357
+ st.metric("Attribution (Page F1)", f"{results['overall']['page_f1']:.1f}")
1358
+ with col3:
1359
+ kuiper_val = results['overall']['kuiper']
1360
+ st.metric("Effort (Kuiper)", f"{kuiper_val:.3f}" if kuiper_val else "N/A")
1361
+ with col4:
1362
+ st.metric("Evaluated", f"{results['n_evaluated']} / {results['n_evaluated'] + results['n_unmatched']}")
1363
+
1364
+ # Detailed breakdown
1365
+ with st.expander("📈 Detailed Breakdown"):
1366
+ st.markdown(f"""
1367
+ | Metric | Value |
1368
+ |--------|-------|
1369
+ | **Overall ANLS*** | {results['overall']['anls']:.1f} |
1370
+ | **Acc. Single-Hop** (n={results['single_evidence']['n']}) | {results['single_evidence']['anls']:.1f} |
1371
+ | **Acc. Cross-Page** (n={results['multi_evidence_same_doc']['n']}) | {results['multi_evidence_same_doc']['anls']:.1f} |
1372
+ | **Acc. Cross-Doc** (n={results['multi_evidence_multi_doc']['n']}) | {results['multi_evidence_multi_doc']['anls']:.1f} |
1373
+ | **Attribution (Doc F1)** | {results['overall']['doc_f1']:.1f} |
1374
+ | **Attribution (Page F1)** | {results['overall']['page_f1']:.1f} |
1375
+ """)
1376
+
1377
+ if results['n_unmatched'] > 0:
1378
+ with st.expander(f"⚠️ {results['n_unmatched']} unmatched questions"):
1379
+ for q in results['unmatched_samples']:
1380
+ st.text(f"• {q}")
1381
+ if results['n_unmatched'] > 5:
1382
+ st.text(f"... and {results['n_unmatched'] - 5} more")
1383
+
1384
+ # Step 2: Model Information
1385
+ st.markdown("---")
1386
+ st.markdown("### Step 2: Model Information")
1387
+
1388
+ col1, col2 = st.columns(2)
1389
+
1390
+ with col1:
1391
+ model_name = st.text_input("Model Name *", placeholder="e.g., GPT-4o-Agent")
1392
+ organization = st.text_input("Organization *", placeholder="e.g., OpenAI")
1393
+ model_type = st.selectbox("Model Type *", options=["", "api", "open-weight"])
1394
+
1395
+ with col2:
1396
+ description = st.text_area(
1397
+ "Description",
1398
+ placeholder="Brief description of your approach (e.g., 'Vision-language model with BM25 search tool')",
1399
+ height=80
1400
+ )
1401
+ link = st.text_input("Link (Optional)", placeholder="https://arxiv.org/abs/... or https://github.com/...")
1402
+ selected_tags = st.multiselect(
1403
+ "Tags",
1404
+ options=AVAILABLE_TAGS,
1405
+ default=["Agentic"],
1406
+ help="Select tags that describe your approach"
1407
+ )
1408
+
1409
+ # Step 3: Submit
1410
+ st.markdown("---")
1411
+ st.markdown("### Step 3: Submit to Leaderboard")
1412
+
1413
+ if st.button("🚀 Submit to Leaderboard", type="primary", disabled=not (model_name and organization and model_type)):
1414
+ if not model_name or not organization or not model_type:
1415
+ st.error("Please fill in all required fields (Model Name, Organization, Model Type)")
1416
+ else:
1417
+ # Get current user for submission tracking
1418
+ hf_user = get_hf_user()
1419
+
1420
+ # Prepare submission data
1421
+ submission = {
1422
+ "model_name": model_name.strip(),
1423
+ "organization": organization.strip(),
1424
+ "description": description.strip() if description else "",
1425
+ "link": link.strip() if link else "",
1426
+ "tags": selected_tags,
1427
+ "submitted_by": hf_user['username'] if hf_user else "anonymous",
1428
+ "metadata": {
1429
+ "model_type": model_type,
1430
+ },
1431
+ "results": {
1432
+ "overall": {
1433
+ "anls": results['overall']['anls'],
1434
+ "page_f1": results['overall']['page_f1'],
1435
+ "doc_f1": results['overall']['doc_f1'],
1436
+ "kuiper": results['overall']['kuiper'],
1437
+ },
1438
+ "single_evidence": results['single_evidence'],
1439
+ "multi_evidence_same_doc": results['multi_evidence_same_doc'],
1440
+ "multi_evidence_multi_doc": results['multi_evidence_multi_doc'],
1441
+ "by_domain": results.get('by_domain', {}),
1442
+ },
1443
+ "submission_date": datetime.now(timezone.utc).isoformat(),
1444
+ }
1445
+
1446
+ # Upload to HuggingFace Hub
1447
+ with st.spinner("Uploading to leaderboard..."):
1448
+ try:
1449
+ # Create path matching expected structure: {org}/{model}_results_{timestamp}.json
1450
+ safe_org = organization.strip().replace(" ", "_").replace("/", "-")
1451
+ safe_model = model_name.strip().replace(" ", "_").replace("/", "-")
1452
+ timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
1453
+ filename = f"{safe_model}_results_{timestamp}.json"
1454
+ path_in_repo = f"{safe_org}/{filename}"
1455
+
1456
+ # Upload using HfApi
1457
+ api = HfApi()
1458
+ api.upload_file(
1459
+ path_or_fileobj=json.dumps(submission, indent=2).encode("utf-8"),
1460
+ path_in_repo=path_in_repo,
1461
+ repo_id=RESULTS_REPO,
1462
+ repo_type="dataset",
1463
+ token=TOKEN,
1464
+ commit_message=f"Add results for {organization}/{model_name}"
1465
+ )
1466
+
1467
+ st.success(f"✅ Successfully submitted to leaderboard!")
1468
+ st.balloons()
1469
+
1470
+ with st.expander("📄 Submission Details"):
1471
+ st.code(json.dumps(submission, indent=2), language="json")
1472
+
1473
+ # Clear cache to force refresh
1474
+ download_data.clear()
1475
+ load_eval_results.clear()
1476
+
1477
+ st.info("✨ Your submission has been saved! Click below to see it on the leaderboard.")
1478
+ if st.button("🔄 View Updated Leaderboard", type="primary"):
1479
+ st.rerun(scope="app") # Full page rerun, not just fragment
1480
+
1481
+ except Exception as e:
1482
+ st.error(f"❌ Upload failed: {str(e)}")
1483
+ st.warning("Please ensure HF_TOKEN environment variable is set with write access to the repository.")
1484
+
1485
+ with st.expander("📄 Submission JSON (for manual upload)"):
1486
+ st.code(json.dumps(submission, indent=2), language="json")
1487
+
1488
+ st.info(f"""
1489
+ **To submit manually:**
1490
+ 1. Copy the JSON above
1491
+ 2. Save as `{path_in_repo}`
1492
+ 3. Upload to `{RESULTS_REPO}` on HuggingFace Hub
1493
+
1494
+ Or contact lukasz.borchmann@snowflake.com
1495
+ """)
1496
 
1497
 
1498
+ def main():
1499
+ # Download data from HuggingFace Hub
1500
+ with st.spinner("Loading data from HuggingFace Hub..."):
1501
+ download_data()
1502
+
1503
+ # Load data
1504
+ df = load_eval_results()
1505
+
1506
+ # Tabs - matching Gradio style (no emojis)
1507
+ tab1, tab2, tab3, tab4 = st.tabs(["Leaderboard", "Visualizations", "About", "Submit Results"])
1508
+
1509
+ # ===== LEADERBOARD TAB =====
1510
+ with tab1:
1511
+ # Header with icon (fallback to emoji if icon doesn't load)
1512
+ if ICON_MEDAL:
1513
+ icon_html = f'<img src="{ICON_MEDAL}" style="width: 40px; height: 40px; vertical-align: middle; margin-right: 12px;" />'
1514
+ else:
1515
+ icon_html = f'<span style="font-size: 36px; margin-right: 12px;">🏆</span>'
1516
+ st.markdown(f'<h3 style="display: flex; align-items: center; margin-top: 1.5rem; margin-bottom: 1.2rem;">{icon_html} Leaderboard</h3>', unsafe_allow_html=True)
1517
+
1518
+ if df.empty:
1519
+ st.warning("No evaluation results found. Submit your results to appear on the leaderboard!")
1520
+ else:
1521
+ # ===== FILTERS SIDE BY SIDE =====
1522
+ filter_col1, filter_col2 = st.columns(2)
1523
+
1524
+ with filter_col1:
1525
+ # TAG FILTER - chips use MID_BLUE (darker, gradient start)
1526
+ tags_in_data = get_all_tags_from_df(df)
1527
+ all_available_tags = sorted(list(set(AVAILABLE_TAGS + tags_in_data)))
1528
+
1529
+ selected_tags = st.multiselect(
1530
+ "Filter by techniques/features:",
1531
+ options=all_available_tags,
1532
+ default=["Agentic"],
1533
+ placeholder="Click to filter by tags...",
1534
+ key="tag_filter",
1535
+ )
1536
+
1537
+ with filter_col2:
1538
+ # COLUMN SELECTOR - chips use SNOWFLAKE_BLUE (lighter, gradient end)
1539
+ # Mapping: short chip name -> full column name
1540
+ COLUMN_CHIP_NAMES = {
1541
+ "Accuracy": "Accuracy (ANLS*)",
1542
+ "Acc. Single-Hop": "Acc. Single-Hop",
1543
+ "Acc. Cross-Page": "Acc. Cross-Page",
1544
+ "Acc. Cross-Doc": "Acc. Cross-Doc",
1545
+ "Attribution": "Attribution (Page F1)",
1546
+ "Attribution (Doc)": "Attribution (Doc F1)",
1547
+ "Effort": "Effort (Kuiper)",
1548
+ "Model Type": "Model Type",
1549
+ "Tags": "Tags",
1550
+ }
1551
+ # Reverse mapping for lookup
1552
+ CHIP_TO_COLUMN = COLUMN_CHIP_NAMES
1553
+ COLUMN_TO_CHIP = {v: k for k, v in COLUMN_CHIP_NAMES.items()}
1554
+
1555
+ all_columns = list(df.columns)
1556
+ # Model and Organization are always visible (not in selector)
1557
+ always_visible = ["Model", "Organization"]
1558
+ # Hidden columns (used internally but not shown as separate columns)
1559
+ hidden_cols = ["Link", "Submission Date", "Description", "_by_domain"]
1560
+ # Full column names that are optional (Tags moved to end)
1561
+ optional_full_cols = [c for c in all_columns if c not in hidden_cols + always_visible and c != "Tags"]
1562
+ optional_full_cols.append("Tags") # Add Tags at the end
1563
+ # Convert to chip names for display
1564
+ optional_chips = [COLUMN_TO_CHIP.get(c, c) for c in optional_full_cols]
1565
+
1566
+ default_chips = ["Model Type", "Tags", "Accuracy", "Attribution", "Effort"]
1567
+ default_selected = [c for c in default_chips if c in optional_chips]
1568
+
1569
+ selected_chips = st.multiselect(
1570
+ "Select columns to display:",
1571
+ options=optional_chips,
1572
+ default=default_selected,
1573
+ key="column_selector",
1574
+ )
1575
+
1576
+ # Convert selected chips back to full column names
1577
+ selected_optional = [CHIP_TO_COLUMN.get(c, c) for c in selected_chips]
1578
+
1579
+ # Apply tag filter
1580
+ filtered_df = filter_df_by_tags(df, selected_tags)
1581
+
1582
+ # Show filter status
1583
+ if selected_tags:
1584
+ st.caption(f"Showing {len(filtered_df)} of {len(df)} models matching selected tags")
1585
+
1586
+ # Model and Organization are always included first
1587
+ selected_columns = ["Model", "Organization"] + [c for c in optional_full_cols if c in selected_optional]
1588
+
1589
+ if selected_columns:
1590
+ # Render HTML table with proper styling
1591
+ render_leaderboard_table(filtered_df, selected_columns)
1592
+
1593
+ # Download button
1594
+ st.markdown("") # Small spacing
1595
+ csv = filtered_df.to_csv(index=False)
1596
+ st.download_button(
1597
+ label="Download as CSV",
1598
+ data=csv,
1599
+ file_name="leaderboard.csv",
1600
+ mime="text/csv",
1601
+ )
1602
+
1603
+ # ===== VISUALIZATIONS TAB =====
1604
+ with tab2:
1605
+ if ICON_EYE:
1606
+ icon_html = f'<img src="{ICON_EYE}" style="width: 40px; height: 40px; vertical-align: middle; margin-right: 12px;" />'
1607
+ else:
1608
+ icon_html = f'<span style="font-size: 36px; margin-right: 12px;">📈</span>'
1609
+ st.markdown(f'<h3 style="display: flex; align-items: center; margin-top: 1.5rem; margin-bottom: 1.2rem;">{icon_html} Visualizations</h3>', unsafe_allow_html=True)
1610
+
1611
+ if df.empty:
1612
+ st.warning("No data available for visualization.")
1613
+ else:
1614
+ # Two plots side by side
1615
+ col1, col2 = st.columns(2)
1616
+
1617
+ with col1:
1618
+ fig_attribution = create_accuracy_vs_attribution_plot(df)
1619
+ st.plotly_chart(fig_attribution, width="stretch")
1620
+
1621
+ with col2:
1622
+ fig_effort = create_accuracy_vs_effort_plot(df)
1623
+ st.plotly_chart(fig_effort, width="stretch")
1624
+
1625
+ st.markdown("""
1626
  **Understanding the plots:**
1627
  - Each point represents a model submission
1628
  - **Orange points**: API-based models
1629
  - **Blue points**: Open-weight models
1630
  - Hover over points to see model details
1631
+ - **Left plot**: Upper-right = high accuracy with good attribution (optimal)
1632
+ - **Right plot**: Upper-left = high accuracy with good effort calibration (optimal)
1633
+ """)
1634
+
1635
+ # Model details selector
1636
+ st.markdown("---")
1637
+ st.markdown("### 📊 Model Details")
1638
+
1639
+ model_names = df["Model"].tolist()
1640
+ selected_model = st.selectbox("Select a model to view per-domain breakdown:", model_names)
1641
+
1642
+ if selected_model:
1643
+ show_model_details(selected_model)
1644
+
1645
+ # ===== ABOUT TAB =====
1646
+ with tab3:
1647
+ if ICON_DOCS:
1648
+ icon_html = f'<img src="{ICON_DOCS}" style="width: 40px; height: 40px; vertical-align: middle; margin-right: 12px;" />'
1649
+ else:
1650
+ icon_html = f'<span style="font-size: 36px; margin-right: 12px;">📖</span>'
1651
+ st.markdown(f'<h3 style="display: flex; align-items: center; margin-top: 1.5rem; margin-bottom: 1.2rem;">{icon_html} About</h3>', unsafe_allow_html=True)
1652
+
1653
+ st.markdown("""
1654
+ ## Agentic Document VQA Benchmark
1655
+
1656
+ This benchmark evaluates AI systems on **Agentic Document Collection Visual Question Answering** —
1657
+ a task requiring systems to navigate, retrieve, reason over, and aggregate information from
1658
+ heterogeneous document collections.
1659
+
1660
+ ### Dataset
1661
+ - **2,266** human-authored question-answer pairs
1662
+ - **769** multi-page PDF documents from diverse real-world domains
1663
+ - **16,652** total pages with rich visual layouts
1664
+ - **17.3%** multi-hop questions (cross-page and cross-document)
1665
+ - **61** document categories across 13 high-level domains
1666
+
1667
+ ### Task Properties
1668
+ The task is characterized by five formal properties:
1669
+ 1. **Extractive**: Answers are drawn from evidence pages, not generated abstractly
1670
+ 2. **Multi-Hop**: Evidence may span multiple disjoint pages requiring aggregation
1671
+ 3. **Closed-World**: Answers must be derivable solely from the corpus
1672
+ 4. **Grounded Attribution**: Answers must be faithfully attributed to minimal evidence
1673
+ 5. **Agentic**: Requires iterative retrieval and reasoning (planning, navigation, aggregation)
1674
+
1675
+ ## Metrics
1676
+
1677
+ ### Accuracy (ANLS*)
1678
+ - **Accuracy (ANLS*)**: Main score using Average Normalized Levenshtein Similarity with optimal element alignment for lists/sets
1679
+ - **Acc. Single-Hop**: Accuracy on questions requiring a single evidence page
1680
+ - **Acc. Cross-Page**: Accuracy on multi-hop questions within the same document
1681
+ - **Acc. Cross-Doc**: Accuracy on multi-hop questions spanning multiple documents
1682
+
1683
+ ### Attribution (Page F1)
1684
+ - **Attribution (Page F1)**: F1 score measuring overlap between cited pages and gold evidence pages (penalizes both missing and spurious citations)
1685
+ - **Attribution (Doc F1)**: Document-level attribution accuracy (whether the correct documents were identified)
1686
+
1687
+ ### Effort (Kuiper)
1688
+ - **Effort (Kuiper)**: Measures whether computational effort correlates with problem difficulty. Lower values indicate better calibration—the system "knows what it knows" and doesn't waste effort on unsolvable queries
1689
+ """)
1690
+
1691
+ # ===== SUBMIT TAB =====
1692
+ with tab4:
1693
+ if ICON_WRITE:
1694
+ icon_html = f'<img src="{ICON_WRITE}" style="width: 40px; height: 40px; vertical-align: middle; margin-right: 12px;" />'
1695
+ else:
1696
+ icon_html = f'<span style="font-size: 36px; margin-right: 12px;">📝</span>'
1697
+ st.markdown(f'<h3 style="display: flex; align-items: center; margin-top: 1.5rem; margin-bottom: 1.2rem;">{icon_html} Submit Results</h3>', unsafe_allow_html=True)
1698
+
1699
+ if not EVAL_AVAILABLE:
1700
+ st.warning("⚠️ Evaluation module not available. Please install dependencies: `pip install anls-star datasets`")
1701
+
1702
+ # Use fragment to prevent tab switch on file upload
1703
+ submit_results_fragment()
1704
+
1705
+
1706
+ if __name__ == "__main__":
1707
+ main()
1708
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eval/README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Agentic Document AI Evaluation
2
+
3
+ Evaluation library for the [agentic-document-ai/dataset](https://huggingface.co/datasets/agentic-document-ai/dataset) benchmark.
4
+
5
+ ## Installation
6
+
7
+ ```bash
8
+ pip install -r requirements.txt
9
+ ```
10
+
11
+ ## Usage
12
+
13
+ ### Command Line
14
+
15
+ ```bash
16
+ # Basic evaluation
17
+ python evaluate.py results.jsonl
18
+
19
+ # With category/domain breakdown
20
+ python evaluate.py results.jsonl --by-category --by-domain
21
+
22
+ # Compare multiple models
23
+ python evaluate.py model1.jsonl model2.jsonl model3.jsonl --compare
24
+
25
+ # Output as JSON
26
+ python evaluate.py results.jsonl --json
27
+ ```
28
+
29
+ ### Expected Input Format
30
+
31
+ JSONL file with one prediction per line:
32
+
33
+ ```json
34
+ {"id": "test/0", "question": "What is the total revenue?", "answer": "$1.2M", "citations": [{"document": "report.pdf", "page": 5}], "search_history": ["query1", "query2"]}
35
+ ```
36
+
37
+ Required fields:
38
+ - `question`: The question text (used to match with gold standard)
39
+ - `answer`: Predicted answer string
40
+
41
+ Optional fields:
42
+ - `id`: Question ID (fallback if question text doesn't match)
43
+ - `citations`: List of `{document, page}` for citation evaluation
44
+ - `search_history`: List of search queries (for Kuiper effort analysis)
45
+ - `iterations`: Alternative to `search_history` length
46
+
47
+ ### Dataset Splits
48
+
49
+ By default, evaluates against the `dev` split. Use `--split test` for test set evaluation.
50
+
51
+ ## Metrics
52
+
53
+ | Metric | Description |
54
+ |--------|-------------|
55
+ | **ANLS\*** | Answer-level Normalized Levenshtein Similarity (0-1) |
56
+ | **Accuracy** | Fraction with ANLS* ≥ 0.5 |
57
+ | **Document F1** | Citation accuracy at document level |
58
+ | **Page F1** | Citation accuracy at page level |
59
+ | **Kuiper Statistic** | Effort-accuracy calibration (lower = better) |
60
+ | **Wasted Effort Ratio** | μ_steps(incorrect) / μ_steps(correct) |
61
+
62
+ ## Python API
63
+
64
+ ```python
65
+ from metrics import anls_star, citation_f1, kuiper_statistic
66
+
67
+ # ANLS* score
68
+ score = anls_star("$1.2 million", [["$1.2M", "1.2 million dollars"]])
69
+
70
+ # Citation F1
71
+ f1 = citation_f1(
72
+ predicted=[{"document": "a.pdf", "page": 1}],
73
+ gold_locations=[{"document": "a.pdf", "page": 1}, {"document": "a.pdf", "page": 2}],
74
+ level='page'
75
+ )
76
+
77
+ # Kuiper statistic
78
+ results = [{"steps": 3, "correct": True}, {"steps": 7, "correct": False}, ...]
79
+ kuiper = kuiper_statistic(results)
80
+ ```
81
+
82
+
eval/evaluate.py ADDED
@@ -0,0 +1,309 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Evaluation CLI for Agentic Document AI.
4
+
5
+ Evaluates model predictions against the agentic-document-ai/dataset benchmark.
6
+
7
+ Usage:
8
+ python evaluate.py results.jsonl [--by-category] [--by-domain]
9
+ python evaluate.py results_*.jsonl --compare
10
+ """
11
+
12
+ import argparse
13
+ import json
14
+ import sys
15
+ from collections import defaultdict
16
+ from pathlib import Path
17
+ from typing import Any, Dict, List, Optional, Tuple
18
+
19
+ from datasets import load_dataset
20
+
21
+ from metrics import anls_star, citation_f1, kuiper_statistic, wasted_effort_ratio
22
+
23
+
24
+ def load_gold_standard(dataset_name: str = "agentic-document-ai/dataset", split: str = "dev"):
25
+ """Load gold standard from HuggingFace dataset.
26
+
27
+ Returns two mappings:
28
+ - by_text: question text -> gold data (primary)
29
+ - by_id: question id -> gold data (fallback)
30
+ """
31
+ print(f"Loading {dataset_name} ({split} split)...")
32
+ dataset = load_dataset(dataset_name, split=split)
33
+
34
+ by_text = {}
35
+ by_id = {}
36
+
37
+ for ex in dataset:
38
+ question = ex['question'].strip()
39
+ qid = ex.get('id', '')
40
+
41
+ gold_data = {
42
+ 'answers': ex.get('answer_variants', []),
43
+ 'evidence': ex.get('evidence', []),
44
+ 'category': ex.get('document_category', ''),
45
+ 'domain': ex.get('domain', '')
46
+ }
47
+
48
+ by_text[question] = gold_data
49
+ if qid:
50
+ by_id[qid] = gold_data
51
+
52
+ print(f"Loaded {len(by_text)} gold examples")
53
+ return by_text, by_id
54
+
55
+
56
+ def load_results(filepath: Path) -> List[Dict]:
57
+ """Load results from JSONL file."""
58
+ results = []
59
+ with open(filepath) as f:
60
+ for line in f:
61
+ if line.strip():
62
+ results.append(json.loads(line))
63
+ return results
64
+
65
+
66
+ def evaluate_single(
67
+ result: Dict,
68
+ gold_by_text: Dict[str, Dict],
69
+ gold_by_id: Dict[str, Dict]
70
+ ) -> Optional[Dict[str, Any]]:
71
+ """Evaluate a single prediction.
72
+
73
+ Matches by question text first, falls back to question ID if not found.
74
+ """
75
+ question = result.get('question', '').strip()
76
+ qid = result.get('id', '')
77
+
78
+ # Try matching by question text first
79
+ if question in gold_by_text:
80
+ gold_data = gold_by_text[question]
81
+ elif qid and qid in gold_by_id:
82
+ # Fallback to ID-based matching
83
+ gold_data = gold_by_id[qid]
84
+ else:
85
+ return None
86
+ answer = result.get('answer', '')
87
+ citations = result.get('citations', [])
88
+
89
+ # ANLS*
90
+ anls = anls_star(answer, gold_data['answers'])
91
+ correct = anls >= 0.5
92
+
93
+ # Citation F1
94
+ doc_f1 = citation_f1(citations, gold_data['evidence'], level='document')
95
+ page_f1 = citation_f1(citations, gold_data['evidence'], level='page')
96
+
97
+ # Steps (for Kuiper)
98
+ search_history = result.get('search_history', [])
99
+ steps = len(search_history) if search_history else result.get('iterations', 0)
100
+
101
+ return {
102
+ 'question': question,
103
+ 'anls': anls,
104
+ 'correct': correct,
105
+ 'doc_f1': doc_f1['f1'],
106
+ 'page_f1': page_f1['f1'],
107
+ 'steps': steps,
108
+ 'category': gold_data['category'],
109
+ 'domain': gold_data['domain']
110
+ }
111
+
112
+
113
+ def aggregate_metrics(evals: List[Dict]) -> Dict[str, Any]:
114
+ """Aggregate metrics across evaluations."""
115
+ if not evals:
116
+ return {}
117
+
118
+ n = len(evals)
119
+ accuracy = sum(e['correct'] for e in evals) / n
120
+ mean_anls = sum(e['anls'] for e in evals) / n
121
+ mean_doc_f1 = sum(e['doc_f1'] for e in evals) / n
122
+ mean_page_f1 = sum(e['page_f1'] for e in evals) / n
123
+
124
+ # Kuiper
125
+ kuiper = kuiper_statistic(evals)
126
+ wasted = wasted_effort_ratio(evals)
127
+
128
+ return {
129
+ 'n': n,
130
+ 'accuracy': accuracy,
131
+ 'mean_anls': mean_anls,
132
+ 'doc_f1': mean_doc_f1,
133
+ 'page_f1': mean_page_f1,
134
+ 'kuiper_stat': kuiper['kuiper_stat'],
135
+ 'kuiper_degenerate': kuiper['degenerate'],
136
+ 'wasted_effort_ratio': wasted['ratio'],
137
+ 'mean_steps_correct': wasted['mean_steps_correct'],
138
+ 'mean_steps_incorrect': wasted['mean_steps_incorrect'],
139
+ }
140
+
141
+
142
+ def print_metrics(name: str, metrics: Dict, indent: int = 0):
143
+ """Print metrics in a formatted way."""
144
+ prefix = " " * indent
145
+
146
+ if 'n' not in metrics:
147
+ print(f"{prefix}{name}: No data")
148
+ return
149
+
150
+ print(f"{prefix}{name} (n={metrics['n']}):")
151
+ print(f"{prefix} Accuracy (ANLS*≥0.5): {metrics['accuracy']:.1%}")
152
+ print(f"{prefix} Mean ANLS*: {metrics['mean_anls']:.4f}")
153
+ print(f"{prefix} Document F1: {metrics['doc_f1']:.4f}")
154
+ print(f"{prefix} Page F1: {metrics['page_f1']:.4f}")
155
+
156
+ if not metrics.get('kuiper_degenerate'):
157
+ print(f"{prefix} Kuiper Statistic: {metrics['kuiper_stat']:.2f}")
158
+
159
+ if metrics.get('wasted_effort_ratio', 0) < float('inf'):
160
+ print(f"{prefix} Wasted Effort Ratio: {metrics['wasted_effort_ratio']:.3f}")
161
+
162
+
163
+ def evaluate_file(
164
+ filepath: Path,
165
+ gold_by_text: Dict[str, Dict],
166
+ gold_by_id: Dict[str, Dict],
167
+ by_category: bool = False,
168
+ by_domain: bool = False
169
+ ) -> Dict[str, Any]:
170
+ """Evaluate a single results file."""
171
+ results = load_results(filepath)
172
+
173
+ evals = []
174
+ unmatched = 0
175
+
176
+ for result in results:
177
+ ev = evaluate_single(result, gold_by_text, gold_by_id)
178
+ if ev:
179
+ evals.append(ev)
180
+ else:
181
+ unmatched += 1
182
+
183
+ if unmatched > 0:
184
+ print(f" Warning: {unmatched} questions not found in gold standard")
185
+
186
+ # Overall metrics
187
+ overall = aggregate_metrics(evals)
188
+
189
+ output = {'overall': overall}
190
+
191
+ # By category
192
+ if by_category:
193
+ by_cat = defaultdict(list)
194
+ for e in evals:
195
+ by_cat[e['category'] or 'Unknown'].append(e)
196
+ output['by_category'] = {cat: aggregate_metrics(items) for cat, items in sorted(by_cat.items())}
197
+
198
+ # By domain
199
+ if by_domain:
200
+ by_dom = defaultdict(list)
201
+ for e in evals:
202
+ by_dom[e['domain'] or 'Other'].append(e)
203
+ output['by_domain'] = {dom: aggregate_metrics(items) for dom, items in sorted(by_dom.items())}
204
+
205
+ return output
206
+
207
+
208
+ def main():
209
+ parser = argparse.ArgumentParser(
210
+ description="Evaluate model predictions on Agentic Document AI benchmark",
211
+ formatter_class=argparse.RawDescriptionHelpFormatter,
212
+ epilog="""
213
+ Examples:
214
+ python evaluate.py results.jsonl
215
+ python evaluate.py results.jsonl --by-category --by-domain
216
+ python evaluate.py model1.jsonl model2.jsonl --compare
217
+ """
218
+ )
219
+ parser.add_argument('files', nargs='+', type=Path, help='Result JSONL file(s)')
220
+ parser.add_argument('--dataset', default='agentic-document-ai/dataset',
221
+ help='HuggingFace dataset name')
222
+ parser.add_argument('--split', default='dev', help='Dataset split to evaluate on')
223
+ parser.add_argument('--by-category', action='store_true', help='Show metrics by document category')
224
+ parser.add_argument('--by-domain', action='store_true', help='Show metrics by domain')
225
+ parser.add_argument('--compare', action='store_true', help='Compare multiple models side-by-side')
226
+ parser.add_argument('--json', action='store_true', help='Output as JSON')
227
+
228
+ args = parser.parse_args()
229
+
230
+ # Load gold standard
231
+ gold_by_text, gold_by_id = load_gold_standard(args.dataset, args.split)
232
+
233
+ if not gold_by_text:
234
+ print("Error: No gold standard data loaded", file=sys.stderr)
235
+ sys.exit(1)
236
+
237
+ all_results = {}
238
+
239
+ for filepath in args.files:
240
+ if not filepath.exists():
241
+ print(f"Error: File not found: {filepath}", file=sys.stderr)
242
+ continue
243
+
244
+ # Extract model name
245
+ name = filepath.stem
246
+ if name.startswith("results_"):
247
+ name = name[8:]
248
+ if name.endswith("_results"):
249
+ name = name[:-8]
250
+
251
+ print(f"\nEvaluating: {filepath.name}")
252
+ result = evaluate_file(filepath, gold_by_text, gold_by_id, args.by_category, args.by_domain)
253
+ all_results[name] = result
254
+
255
+ # Output
256
+ if args.json:
257
+ # Convert for JSON serialization
258
+ def sanitize(obj):
259
+ if isinstance(obj, float) and (obj != obj or obj == float('inf')): # NaN or inf
260
+ return None
261
+ if isinstance(obj, dict):
262
+ return {k: sanitize(v) for k, v in obj.items()}
263
+ if isinstance(obj, list):
264
+ return [sanitize(v) for v in obj]
265
+ return obj
266
+
267
+ print(json.dumps(sanitize(all_results), indent=2))
268
+ else:
269
+ # Print formatted output
270
+ print("\n" + "=" * 70)
271
+ print("EVALUATION RESULTS")
272
+ print("=" * 70)
273
+
274
+ if args.compare and len(all_results) > 1:
275
+ # Comparison table
276
+ models = list(all_results.keys())
277
+
278
+ print(f"\n{'Model':<35} {'Acc':<8} {'ANLS*':<8} {'Doc F1':<8} {'Page F1':<8} {'Kuiper':<8}")
279
+ print("-" * 75)
280
+
281
+ for model in sorted(models, key=lambda m: -all_results[m]['overall'].get('accuracy', 0)):
282
+ m = all_results[model]['overall']
283
+ kuiper_str = f"{m['kuiper_stat']:.2f}" if not m.get('kuiper_degenerate') else "N/A"
284
+ print(f"{model:<35} {m.get('accuracy', 0):.1%} {m.get('mean_anls', 0):.4f} "
285
+ f"{m.get('doc_f1', 0):.4f} {m.get('page_f1', 0):.4f} {kuiper_str}")
286
+ else:
287
+ # Detailed per-model output
288
+ for model, result in all_results.items():
289
+ print(f"\n{'─' * 40}")
290
+ print_metrics(model, result['overall'])
291
+
292
+ if 'by_category' in result:
293
+ print(f"\n By Category:")
294
+ for cat, metrics in sorted(result['by_category'].items(),
295
+ key=lambda x: -x[1].get('n', 0)):
296
+ print_metrics(cat, metrics, indent=2)
297
+
298
+ if 'by_domain' in result:
299
+ print(f"\n By Domain:")
300
+ for dom, metrics in sorted(result['by_domain'].items(),
301
+ key=lambda x: -x[1].get('n', 0)):
302
+ print_metrics(dom, metrics, indent=2)
303
+
304
+ print()
305
+
306
+
307
+ if __name__ == "__main__":
308
+ main()
309
+
eval/metrics.py ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Core evaluation metrics for document QA.
3
+
4
+ Metrics:
5
+ - ANLS*: Answer-level Normalized Levenshtein Similarity
6
+ - Citation F1: Document-level and Page-level F1 scores
7
+ - Kuiper Statistic: Effort-accuracy calibration measure
8
+ """
9
+
10
+ from typing import Any, Dict, List, Set, Tuple
11
+ import numpy as np
12
+ from anls_star import anls_score
13
+
14
+
15
+ def anls_star(predicted: Any, ground_truths: List[List[str]]) -> float:
16
+ """
17
+ Calculate ANLS* score (case-insensitive).
18
+
19
+ Args:
20
+ predicted: Predicted answer (string or list)
21
+ ground_truths: List of answer variants, each variant is a list of strings
22
+
23
+ Returns:
24
+ Maximum ANLS* score across all variants (0.0 to 1.0)
25
+ """
26
+ if not ground_truths:
27
+ return 0.0
28
+
29
+ if predicted is None:
30
+ predicted = []
31
+
32
+ if isinstance(predicted, str):
33
+ predicted = [predicted]
34
+
35
+ if not predicted:
36
+ return 0.0
37
+
38
+ # Convert all elements to lowercase strings
39
+ pred_lower = [str(p).lower() for p in predicted]
40
+
41
+ max_score = 0.0
42
+ for gold_variant in ground_truths:
43
+ if isinstance(gold_variant, str):
44
+ gold_variant = [gold_variant]
45
+ gold_lower = [g.lower() if isinstance(g, str) else str(g).lower() for g in gold_variant]
46
+ score = anls_score(pred_lower, gold_lower)
47
+ max_score = max(max_score, score)
48
+
49
+ return max_score
50
+
51
+
52
+ def citation_f1(
53
+ predicted_citations: List[Dict[str, Any]],
54
+ gold_locations: List[Dict[str, Any]],
55
+ level: str = 'page'
56
+ ) -> Dict[str, float]:
57
+ """
58
+ Calculate Citation F1 at document or page level.
59
+
60
+ Args:
61
+ predicted_citations: List of dicts with 'file'/'document' and 'page' keys
62
+ gold_locations: List of dicts with 'document' and 'page' keys
63
+ level: 'document' or 'page'
64
+
65
+ Returns:
66
+ Dict with 'precision', 'recall', 'f1', 'support'
67
+ """
68
+ if not gold_locations:
69
+ return {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'support': 0}
70
+
71
+ # Extract gold citations
72
+ if level == 'document':
73
+ gt_set: Set = {loc.get('document') for loc in gold_locations if loc.get('document')}
74
+ else:
75
+ gt_set = {
76
+ (loc.get('document'), loc.get('page'))
77
+ for loc in gold_locations
78
+ if loc.get('document') is not None
79
+ }
80
+
81
+ # Extract predicted citations
82
+ if not predicted_citations:
83
+ pred_set: Set = set()
84
+ else:
85
+ if level == 'document':
86
+ pred_set = {
87
+ cite.get('file') or cite.get('document')
88
+ for cite in predicted_citations
89
+ if (cite.get('file') or cite.get('document'))
90
+ }
91
+ else:
92
+ pred_set = {
93
+ (cite.get('file') or cite.get('document'), cite.get('page'))
94
+ for cite in predicted_citations
95
+ if (cite.get('file') or cite.get('document')) is not None
96
+ }
97
+
98
+ # Clean None values
99
+ gt_set = {c for c in gt_set if c is not None and (not isinstance(c, tuple) or None not in c)}
100
+ pred_set = {c for c in pred_set if c is not None and (not isinstance(c, tuple) or None not in c)}
101
+
102
+ if not gt_set:
103
+ return {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'support': 0}
104
+
105
+ tp = len(gt_set & pred_set)
106
+ precision = tp / len(pred_set) if pred_set else 0.0
107
+ recall = tp / len(gt_set) if gt_set else 0.0
108
+ f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
109
+
110
+ return {'precision': precision, 'recall': recall, 'f1': f1, 'support': len(gt_set)}
111
+
112
+
113
+ def kuiper_statistic(results: List[Dict]) -> Dict[str, Any]:
114
+ """
115
+ Compute Kuiper calibration statistic for effort-accuracy analysis.
116
+
117
+ Measures dependency between effort (steps) and accuracy. Lower values
118
+ indicate more uniform error distribution across effort levels.
119
+
120
+ Args:
121
+ results: List of dicts with 'steps' (int) and 'correct' (bool)
122
+
123
+ Returns:
124
+ Dict with:
125
+ - kuiper_stat: The Kuiper statistic (lower = better calibration)
126
+ - y_bar: Global mean accuracy
127
+ - max_positive: Maximum positive deviation
128
+ - max_negative: Maximum negative deviation
129
+ - n_samples: Number of valid samples
130
+ - degenerate: True if all samples have same correctness
131
+ """
132
+ valid = [r for r in results if r.get('steps', 0) > 0]
133
+
134
+ if not valid:
135
+ return {
136
+ 'kuiper_stat': float('nan'),
137
+ 'y_bar': 0.0,
138
+ 'max_positive': 0.0,
139
+ 'max_negative': 0.0,
140
+ 'n_samples': 0,
141
+ 'degenerate': True
142
+ }
143
+
144
+ # Sort by steps
145
+ sorted_results = sorted(valid, key=lambda x: x['steps'])
146
+ correctness = [1 if r['correct'] else 0 for r in sorted_results]
147
+
148
+ y_bar = np.mean(correctness)
149
+
150
+ # Degenerate case: all same (0% or 100% accuracy)
151
+ if y_bar == 0.0 or y_bar == 1.0:
152
+ return {
153
+ 'kuiper_stat': float('nan'),
154
+ 'y_bar': float(y_bar),
155
+ 'max_positive': 0.0,
156
+ 'max_negative': 0.0,
157
+ 'n_samples': len(valid),
158
+ 'degenerate': True
159
+ }
160
+
161
+ # Cumulative difference: D_k = Σ(y_i - ȳ)
162
+ residuals = np.array(correctness) - y_bar
163
+ cumulative_diff = np.cumsum(residuals)
164
+
165
+ max_positive = float(np.max(cumulative_diff))
166
+ max_negative = float(np.min(cumulative_diff))
167
+ kuiper_stat = max_positive - max_negative
168
+
169
+ return {
170
+ 'kuiper_stat': kuiper_stat,
171
+ 'y_bar': float(y_bar),
172
+ 'max_positive': max_positive,
173
+ 'max_negative': max_negative,
174
+ 'n_samples': len(valid),
175
+ 'degenerate': False
176
+ }
177
+
178
+
179
+ def wasted_effort_ratio(results: List[Dict]) -> Dict[str, float]:
180
+ """
181
+ Compute Wasted Effort Ratio: μ_steps(Incorrect) / μ_steps(Correct).
182
+
183
+ - ρ > 1: Model grinds on unsolved problems (poor calibration)
184
+ - ρ ≈ 1: Model spends similar effort regardless of outcome
185
+ - ρ < 1: Model fails fast (good calibration)
186
+
187
+ Args:
188
+ results: List of dicts with 'steps' and 'correct'
189
+
190
+ Returns:
191
+ Dict with 'ratio', 'mean_steps_correct', 'mean_steps_incorrect'
192
+ """
193
+ correct_steps = [r['steps'] for r in results if r.get('correct') and r.get('steps', 0) > 0]
194
+ incorrect_steps = [r['steps'] for r in results if not r.get('correct') and r.get('steps', 0) > 0]
195
+
196
+ mean_correct = float(np.mean(correct_steps)) if correct_steps else 0.0
197
+ mean_incorrect = float(np.mean(incorrect_steps)) if incorrect_steps else 0.0
198
+
199
+ ratio = mean_incorrect / mean_correct if mean_correct > 0 else float('inf')
200
+
201
+ return {
202
+ 'ratio': ratio,
203
+ 'mean_steps_correct': mean_correct,
204
+ 'mean_steps_incorrect': mean_incorrect,
205
+ 'n_correct': len(correct_steps),
206
+ 'n_incorrect': len(incorrect_steps)
207
+ }
208
+
209
+
eval/requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ anls-star>=0.1.0
2
+ datasets>=2.14.0
3
+ numpy>=1.24.0
4
+
5
+
requirements.txt CHANGED
@@ -1,17 +1,10 @@
1
- APScheduler
2
- black
3
- datasets
4
- gradio
5
- gradio[oauth]
6
- gradio_client
7
- gradio_leaderboard==0.0.13
8
- huggingface-hub>=0.18.0
9
- matplotlib
10
- numpy<2.0
11
  pandas
12
  plotly
 
 
13
  python-dateutil
14
- sentencepiece
15
- tokenizers>=0.15.0
16
- tqdm
17
- transformers
 
1
+ streamlit>=1.37.0
 
 
 
 
 
 
 
 
 
2
  pandas
3
  plotly
4
+ huggingface-hub>=0.18.0
5
+ numpy<2.0
6
  python-dateutil
7
+ # Evaluation dependencies
8
+ anls-star>=0.1.0
9
+ datasets>=2.14.0
10
+