Spaces:

Snowflake
/

MADQA-Leaderboard

Running

App Files Files

Borchmann commited on Jan 3

Commit

7abb0ba

verified ·

1 Parent(s): 86d62d0

Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

.streamlit/config.toml +11 -0
README.md +37 -119
app.py +1623 -380
eval/README.md +82 -0
eval/evaluate.py +309 -0
eval/metrics.py +209 -0
eval/requirements.txt +5 -0
requirements.txt +7 -14

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,11 @@

+[theme]
+# Snowflake Blue as primary color (controls tabs, checkboxes, buttons)
+primaryColor = "#29B5E8"
+backgroundColor = "#0e1117"
+secondaryBackgroundColor = "#1a1a2e"
+textColor = "#ffffff"
+font = "sans serif"
+[server]
+headless = true

README.md CHANGED Viewed

@@ -1,141 +1,59 @@
 ---
 title: Agentic Document AI Leaderboard
-emoji: 🤖📄
-colorFrom: green
 colorTo: indigo
-sdk: gradio
 app_file: app.py
-pinned: true
-license: apache-2.0
-short_description: Leaderboard for evaluating AI agents
-sdk_version: 5.43.1
-tags:
-- leaderboard
-- document-ai
-- agents
 ---
-# 🤖📄 Agentic Document AI Leaderboard
-A leaderboard for evaluating AI agents on complex document understanding tasks that require multi-step reasoning and evidence gathering across documents.
-## 📊 Metrics
-The benchmark evaluates models using **ANLS (Average Normalized Levenshtein Similarity)** across four task categories:
-1. **ANLS (Overall)** - Main score across the entire dataset
-2. **ANLS (Single Evidence)** - Questions requiring single evidence extraction
-3. **ANLS (Multi-Evidence, Same Doc)** - Combining evidence within one document
-4. **ANLS (Multi-Evidence, Multi Doc)** - Synthesizing across multiple documents
-Additionally, we track:
-- **Agent Steps**: Total number of reasoning/action steps
-- **Cost (USD)**: Estimated inference cost
-## 🚀 How to Submit
-### 1. Run Your Model
-Run your model/agent on the Agentic Document AI benchmark dataset.
-### 2. Prepare Your Predictions File
-Create a JSONL file where each line contains one prediction (see `submission_template.jsonl` for examples):
-```jsonl
-{"question": "What is Dr. McElhaney's position at AMRIC?", "answer": ["Senior Scientist"], "citations": [{"file": "1307326.pdf", "page": 1}], "iterations": 1, "id": "q_4"}
-{"question": "Who is the CEO of the company?", "answer": ["John Smith"], "citations": [{"file": "company_report.pdf", "page": 3}], "iterations": 2, "id": "q_5"}
-{"question": "What was the revenue in 2023?", "answer": ["$5.2 million"], "citations": [{"file": "financial_report.pdf", "page": 12}, {"file": "annual_summary.pdf", "page": 4}], "iterations": 3, "id": "q_6"}
 ```
-**Required fields per line:**
-- `question`: The question text (string)
-- `answer`: List of answer strings
-- `citations`: List of dicts with `"file"` and `"page"` keys
-- `iterations`: Number of agent iterations/steps (integer ≥ 0)
-- `id`: Unique question identifier matching the benchmark (string)
-### 3. Submit via the Interface
-1. Go to the "🚀 Submit Results" tab
-2. Fill in:
-   - **Model Name**: A descriptive name for your system (e.g., "GPT-4-Agent-v1")
-   - **Submitted By**: Your name or organization
-   - **Model Type**: Whether your model is behind an API or uses open weights
-   - **Predictions JSONL File**: Upload your JSONL file
-3. Click "Submit Evaluation"
-4. The system will:
-   - Validate your JSONL format
-   - Evaluate against the gold standard
-   - Compute ANLS scores automatically
-   - Display results on the leaderboard
-## ⚙️ Configuration
-Most configuration variables are in:
-- `src/envs.py` - Repository paths and API configuration
-- `src/about.py` - Task definitions and benchmark description
-## 🔬 Implementing the Evaluator
-**IMPORTANT:** You need to implement the evaluation logic in `src/evaluation/evaluator.py`.
-The evaluator should:
-1. Load your gold standard dataset with correct answers and metadata
-2. Compute ANLS (Average Normalized Levenshtein Similarity) for each prediction
-3. Classify questions by evidence type (single/multi-doc same/multi-doc different)
-4. Aggregate scores by category
-5. Calculate agent steps and cost metrics
-See `src/evaluation/evaluator.py` for the template and detailed TODOs.
-**Current Status:** The system uses placeholder scores (0.50) until you implement the evaluator.
-To integrate your evaluator:
-1. Implement functions in `src/evaluation/evaluator.py`
-2. Uncomment lines 120-122 in `src/submission/submit.py`
-3. Test with a sample submission
-## 🗂️ Project Structure
-```
-├── app.py                          # Main Gradio application
-├── src/
-│   ├── about.py                    # Benchmark description and tasks
-│   ├── envs.py                     # Environment configuration
-│   ├── display/
-│   │   ├── utils.py                # Column definitions and data types
-│   │   ├── formatting.py           # Display formatting utilities
-│   │   └── css_html_js.py          # Custom styling
-│   ├── evaluation/
-│   │   └── evaluator.py            # ⚠��� IMPLEMENT THIS: ANLS evaluation logic
-│   ├── leaderboard/
-│   │   └── read_evals.py           # Result parsing logic
-│   ├── submission/
-│   │   ├── submit.py               # Submission handling & validation
-│   │   └── check_validity.py      # Duplicate checking
-│   └── populate.py                 # Dataframe population
-├── eval-queue/                     # Submission requests (auto-generated)
-├── eval-results/                   # Predictions & results (auto-generated)
-├── submission_template.jsonl       # Template for submissions
-└── ADAPTATION_SUMMARY.md           # Detailed adaptation notes
-```
-## 🔧 Troubleshooting
-If you encounter problems with the space:
-- Restart the space to clear cached data
-- Check that `eval-queue` and `eval-results` directories are properly synced with HuggingFace datasets
-- Verify your environment variables in `src/envs.py` are correctly configured
-## 📝 Code Logic
-For advanced customization:
-- **Column definitions**: `src/display/utils.py`
-- **Result parsing**: `src/leaderboard/read_evals.py`
-- **Submission logic**: `src/submission/submit.py` and `src/submission/check_validity.py`
-- **UI layout**: `app.py`
-## 📚 Additional Documentation
-See `ADAPTATION_SUMMARY.md` for detailed information about the changes made to adapt this from the HuggingFace leaderboard template.

 ---
 title: Agentic Document AI Leaderboard
+emoji: 📄
+colorFrom: blue
 colorTo: indigo
+sdk: streamlit
+sdk_version: "1.37.0"
 app_file: app.py
+pinned: false
+hf_oauth: true
 ---
+# Agentic Document AI Leaderboard - Streamlit Version
+This is a Streamlit port of the Agentic Document AI Leaderboard.
+## Features
+- 📊 **Leaderboard**: View and filter model performance rankings
+- 📈 **Visualizations**: Interactive plots showing ANLS vs Agent Steps and Cost
+- 📖 **About**: Information about the benchmark and metrics
+- 📝 **Submit**: Validate and submit your model results
+## Installation
+```bash
+cd streamlit_app
+pip install -r requirements.txt
 ```
+## Running the App
+```bash
+streamlit run app.py
+```
+The app will open in your browser at `http://localhost:8501`.
+## Color Palette (Snowflake)
+- SNOWFLAKE BLUE: #29B5E8
+- MID-BLUE: #11567F
+- STAR BLUE: #75CDD7
+- VALENCIA ORANGE: #FF9F36
+- FIRST LIGHT: #D45B90
+- MEDIUM GRAY: #5B5B5B
+## Differences from Gradio Version
+1. **Native Streamlit components** instead of gradio_leaderboard
+2. **Simplified submission flow** - validates but doesn't upload to HuggingFace Hub
+3. **Native dataframe display** with column configuration
+4. **Streamlit tabs** instead of Gradio tabs
+## Data
+The app reads evaluation results from `../eval-results/` directory (relative to this app).
+Make sure the eval-results folder exists with JSON result files.

app.py CHANGED Viewed

@@ -1,5 +1,9 @@
 """
-Agentic Document AI Leaderboard
 Color palette: Snowflake colors
 - SNOWFLAKE BLUE: #29B5E8
@@ -12,454 +16,1693 @@ Color palette: Snowflake colors
 - PURPLE MOON: #7254A3
 """
 import os
-from typing import Optional
-import gradio as gr
 import pandas as pd
 import plotly.graph_objects as go
-from apscheduler.schedulers.background import BackgroundScheduler
-from gradio_leaderboard import Leaderboard, SelectColumns
-from huggingface_hub import snapshot_download
-from src.about import LLM_BENCHMARKS_TEXT, TITLE
-from src.display.css_html_js import custom_css
-from src.display.utils import BENCHMARK_COLS, COLS, EVAL_COLS, EVAL_TYPES, AutoEvalColumn, ModelType, Tasks, fields
-from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, QUEUE_REPO, REPO_ID, RESULTS_REPO, TOKEN
-from src.populate import get_evaluation_queue_df, get_leaderboard_df
-from src.submission.submit import add_new_eval
-# Set static directory for assets
-# Note: Must be absolute path for gr.set_static_paths
-ASSETS_PATH = os.path.abspath("assets")
-gr.set_static_paths(paths=[ASSETS_PATH])
-# Load SVG icons
-def load_svg(filename):
-    """Load SVG file and return as string"""
-    svg_path = os.path.join("assets", filename)
     try:
-        with open(svg_path, "r") as f:
-            return f.read()
     except Exception:
         return ""
-# Load tab icons
-ICON_MEDAL = load_svg("snow_medal.svg")
-ICON_PLOT = load_svg("snow_eye.svg")
-ICON_DOC = load_svg("snow_docs.svg")
-ICON_WRITE = load_svg("snow_write.svg")
-ICON_CLOUD = load_svg("snow_cloud2.svg")
-ICON_CODE = load_svg("snow_code.svg")
-# Tab brand colors
-LEADERBOARD_COLOR = "--body-text-color"  # Snowflake blue
-VISUALIZATIONS_COLOR = "--body-text-color"  # Valencia orange
-ABOUT_COLOR = "--body-text-color"  # Purple moon
-SUBMIT_COLOR = "--body-text-color"  # First light
-def render_tab_header(title: str, icon_svg: Optional[str] = None, color: str = LEADERBOARD_COLOR) -> str:
-    """Generate HTML string for tab header with optional SVG icon."""
-    icon_style = f'style="--tab-icon-color: {color};"' if icon_svg else ""
-    icon_block = f'<span class="tab-icon" {icon_style}>{icon_svg}</span>' if icon_svg else ""
-    return f'<div class="tab-title">{icon_block}<h1 style="color: {color};">{title}</h1></div>'
-def restart_space():
-    API.restart_space(repo_id=REPO_ID)
-def create_plot_df(leaderboard_df):
-    """Extract data for plotting from leaderboard dataframe."""
-    if leaderboard_df is None or leaderboard_df.empty:
         return pd.DataFrame()
-    # Get the first task column name (Overall ANLS)
-    first_task_col = list(Tasks)[0].value.col_name
-    plot_data = []
-    for _, row in leaderboard_df.iterrows():
-        try:
-            # Extract model name (remove markdown links)
-            model_text = row.get(AutoEvalColumn.model.name, "Unknown")
-            if isinstance(model_text, str):
-                # Extract text from markdown link [text](url)
-                import re
-                match = re.search(r"\[([^\]]+)\]", model_text)
-                model_name = match.group(1) if match else model_text
-            else:
-                model_name = str(model_text)
-            plot_data.append(
-                {
-                    "model": model_name,
-                    "anls": row.get(first_task_col, 0),
-                    "agent_steps": row.get(AutoEvalColumn.agent_steps.name, 0),
-                    "cost_usd": row.get(AutoEvalColumn.cost_usd.name, 0),
-                    "model_type": row.get(AutoEvalColumn.model_type.name, "unknown"),
-                }
-            )
-        except Exception as e:
-            print(f"Error processing row: {e}")
-            continue
-    return pd.DataFrame(plot_data)
-def create_anls_vs_steps_plot(leaderboard_df):
-    """Create scatter plot of ANLS vs Agent Steps."""
-    df = create_plot_df(leaderboard_df)
     if df.empty:
         fig = go.Figure()
         fig.add_annotation(
-            text="No data available", xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False, font=dict(size=20)
         )
         return fig
-    # Snowflake color palette
     color_map = {
-        "api": "#D45B90",  # FIRST LIGHT
-        "open-weight": "#29B5E8",  # SNOWFLAKE BLUE
-        "unknown": "#5B5B5B",  # MEDIUM GRAY
     }
     fig = go.Figure()
-    for model_type in df["model_type"].unique():
-        df_type = df[df["model_type"] == model_type]
-        fig.add_trace(
-            go.Scatter(
-                x=df_type["agent_steps"],
-                y=df_type["anls"],
-                mode="markers+text",
-                name=model_type,
-                text=df_type["model"],
-                textposition="top center",
-                textfont=dict(size=9),
-                marker=dict(size=12, color=color_map.get(model_type, "#95A5A6"), line=dict(width=1, color="white")),
-                hovertemplate="<b>%{text}</b><br>Agent Steps: %{x}<br>ANLS: %{y:.2f}<extra></extra>",
-            )
-        )
     fig.update_layout(
-        title="ANLS Score vs Agent Steps",
-        xaxis_title="Agent Steps",
-        yaxis_title="ANLS Score (%)",
         hovermode="closest",
-        template="plotly_white",
-        height=600,
         showlegend=True,
-        legend=dict(title="Model Type", yanchor="top", y=0.99, xanchor="right", x=0.99),
     )
     return fig
-def create_anls_vs_cost_plot(leaderboard_df):
-    """Create scatter plot of ANLS vs Cost (USD)."""
-    df = create_plot_df(leaderboard_df)
     if df.empty:
         fig = go.Figure()
         fig.add_annotation(
-            text="No data available", xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False, font=dict(size=20)
         )
         return fig
-    # Filter out models with zero cost for better visualization
-    df_with_cost = df[df["cost_usd"] > 0]
-    if df_with_cost.empty:
-        df_with_cost = df  # Fall back to all data if no cost data
-    # Snowflake color palette
     color_map = {
-        "api": "#FF9F36",  # VALENCIA ORANGE
-        "open-weight": "#29B5E8",  # SNOWFLAKE BLUE
-        "unknown": "#5B5B5B",  # MEDIUM GRAY
     }
     fig = go.Figure()
-    for model_type in df_with_cost["model_type"].unique():
-        df_type = df_with_cost[df_with_cost["model_type"] == model_type]
-        fig.add_trace(
-            go.Scatter(
-                x=df_type["cost_usd"],
-                y=df_type["anls"],
-                mode="markers+text",
-                name=model_type,
-                text=df_type["model"],
-                textposition="top center",
-                textfont=dict(size=9),
-                marker=dict(size=12, color=color_map.get(model_type, "#95A5A6"), line=dict(width=1, color="white")),
-                hovertemplate="<b>%{text}</b><br>Cost: $%{x:.2f}<br>ANLS: %{y:.2f}<extra></extra>",
-            )
-        )
     fig.update_layout(
-        title="ANLS Score vs Cost (USD)",
-        xaxis_title="Cost (USD)",
-        yaxis_title="ANLS Score (%)",
         hovermode="closest",
-        template="plotly_white",
-        height=600,
         showlegend=True,
-        legend=dict(title="Model Type", yanchor="top", y=0.99, xanchor="right", x=0.99),
     )
     return fig
-### Space initialisation
-try:
-    print(EVAL_REQUESTS_PATH)
-    snapshot_download(
-        repo_id=QUEUE_REPO,
-        local_dir=EVAL_REQUESTS_PATH,
-        repo_type="dataset",
-        tqdm_class=None,
-        etag_timeout=30,
-        token=TOKEN,
-    )
-except Exception:
-    restart_space()
-try:
-    print(EVAL_RESULTS_PATH)
-    snapshot_download(
-        repo_id=RESULTS_REPO,
-        local_dir=EVAL_RESULTS_PATH,
-        repo_type="dataset",
-        tqdm_class=None,
-        etag_timeout=30,
-        token=TOKEN,
     )
-except Exception:
-    restart_space()
-LEADERBOARD_DF = get_leaderboard_df(EVAL_RESULTS_PATH, EVAL_REQUESTS_PATH, COLS, BENCHMARK_COLS)
-(
-    finished_eval_queue_df,
-    running_eval_queue_df,
-    pending_eval_queue_df,
-) = get_evaluation_queue_df(EVAL_REQUESTS_PATH, EVAL_COLS)
-def init_leaderboard(dataframe):
-    if dataframe is None or dataframe.empty:
-        raise ValueError("Leaderboard DataFrame is empty or None.")
-    # Calculate dynamic filter ranges from actual data
-    max_agent_steps = int(dataframe[AutoEvalColumn.agent_steps.name].max()) if len(dataframe) > 0 else 1000
-    max_cost = float(dataframe[AutoEvalColumn.cost_usd.name].max()) if len(dataframe) > 0 else 10.0
-    # Add some headroom to max values
-    max_agent_steps = max(max_agent_steps + 100, 1000)
-    max_cost = max(max_cost + 1.0, 10.0)
-    return Leaderboard(
-        value=dataframe,
-        datatype=[c.type for c in fields(AutoEvalColumn)],
-        select_columns=SelectColumns(
-            default_selection=[c.name for c in fields(AutoEvalColumn) if c.displayed_by_default],
-            cant_deselect=[c.name for c in fields(AutoEvalColumn) if c.never_hidden],
-            label="Select columns to display:",
-        ),
-        search_columns=[AutoEvalColumn.model.name, AutoEvalColumn.organization.name],
-        hide_columns=[c.name for c in fields(AutoEvalColumn) if c.hidden] + ["Type"],
-        bool_checkboxgroup_label="Hide models",
-        interactive=False,
     )
-demo = gr.Blocks(
-    css=custom_css,
-    theme=gr.themes.Default(
-        primary_hue=gr.themes.Color(
-            c50="#E6F7FC",
-            c100="#B3E5F5",
-            c200="#80D3ED",
-            c300="#4DC1E5",
-            c400="#29B5E8",  # SNOWFLAKE BLUE
-            c500="#29B5E8",  # SNOWFLAKE BLUE (primary)
-            c600="#11567F",  # MID-BLUE
-            c700="#0D4464",
-            c800="#093248",
-            c900="#05202D",
-            c950="#021018",
-            name="snowflake_blue",
-        ),
-        secondary_hue=gr.themes.Color(
-            c50="#FFF4E6",
-            c100="#FFE4B3",
-            c200="#FFD480",
-            c300="#FFC44D",
-            c400="#FFB41A",
-            c500="#FF9F36",  # VALENCIA ORANGE
-            c600="#E68A1F",
-            c700="#CC7A1B",
-            c800="#B36A17",
-            c900="#995A13",
-            c950="#804A0F",
-            name="valencia_orange",
-        ),
-        neutral_hue=gr.themes.Color(
-            c50="#F5F5F5",
-            c100="#E0E0E0",
-            c200="#CCCCCC",
-            c300="#B8B8B8",
-            c400="#A3A3A3",
-            c500="#8F8F8F",
-            c600="#7A7A7A",
-            c700="#5B5B5B",  # MEDIUM GRAY
-            c800="#474747",
-            c900="#333333",
-            c950="#1F1F1F",
-            name="medium_gray",
-        ),
-    ),
-)
-with demo:
-    gr.HTML(TITLE)
-    with gr.Tabs(elem_classes="tab-buttons") as tabs:
-        with gr.TabItem("Leaderboard", elem_id="llm-benchmark-tab-table", id=0):
-            with gr.Row():
-                with gr.Column():
-                    gr.HTML(render_tab_header("Leaderboard", ICON_MEDAL, LEADERBOARD_COLOR))
-            leaderboard = init_leaderboard(LEADERBOARD_DF)
-        with gr.TabItem("Visualizations", elem_id="llm-benchmark-tab-viz", id=1):
-            with gr.Row():
-                gr.HTML(render_tab_header("Visualizations", ICON_PLOT, VISUALIZATIONS_COLOR))
-            gr.Markdown("## Performance vs Cost Analysis", elem_classes="markdown-text")
-            with gr.Row():
-                with gr.Column():
-                    plot_steps = gr.Plot(value=create_anls_vs_steps_plot(LEADERBOARD_DF))
-                with gr.Column():
-                    plot_cost = gr.Plot(value=create_anls_vs_cost_plot(LEADERBOARD_DF))
-            gr.Markdown(
-                """
             **Understanding the plots:**
             - Each point represents a model submission
             - **Orange points**: API-based models
             - **Blue points**: Open-weight models
             - Hover over points to see model details
-            - Upper-left quadrant = better performance with lower cost (optimal)
-            """,
-                elem_classes="markdown-text",
-            )
-        with gr.TabItem("About", elem_id="llm-benchmark-tab-about", id=2):
-            with gr.Row():
-                gr.HTML(render_tab_header("About", ICON_DOC, ABOUT_COLOR))
-            gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
-        with gr.TabItem("Submit Results", elem_id="llm-benchmark-tab-submit", id=3):
-            with gr.Row():
-                gr.HTML(render_tab_header("Submit Results", ICON_WRITE, SUBMIT_COLOR))
-            with gr.Column():
-                with gr.Column():
-                    with gr.Accordion(
-                        f"✅ Finished Evaluations ({len(finished_eval_queue_df)})",
-                        open=False,
-                    ):
-                        with gr.Row():
-                            finished_eval_table = gr.components.Dataframe(
-                                value=finished_eval_queue_df,
-                                headers=EVAL_COLS,
-                                datatype=EVAL_TYPES,
-                                row_count=5,
-                            )
-                    with gr.Accordion(
-                        f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})",
-                        open=False,
-                    ):
-                        with gr.Row():
-                            running_eval_table = gr.components.Dataframe(
-                                value=running_eval_queue_df,
-                                headers=EVAL_COLS,
-                                datatype=EVAL_TYPES,
-                                row_count=5,
-                            )
-                    with gr.Accordion(
-                        f"⏳ Pending Evaluation Queue ({len(pending_eval_queue_df)})",
-                        open=False,
-                    ):
-                        with gr.Row():
-                            pending_eval_table = gr.components.Dataframe(
-                                value=pending_eval_queue_df,
-                                headers=EVAL_COLS,
-                                datatype=EVAL_TYPES,
-                                row_count=5,
-                            )
-            with gr.Row():
-                gr.Markdown("# ✉️✨ Submit your results here!", elem_classes="markdown-text")
-            with gr.Row():
-                with gr.Column():
-                    model_name_textbox = gr.Textbox(
-                        label="Model Name", placeholder="e.g., GPT-4-Turbo-Agent, Claude-3-Opus-Agent"
-                    )
-                    organization_textbox = gr.Textbox(
-                        label="Organization", placeholder="e.g., OpenAI, Anthropic, Meta, or your organization name"
-                    )
-                    model_type = gr.Dropdown(
-                        choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
-                        label="Model Type",
-                        multiselect=False,
-                        value=None,
-                        interactive=True,
-                    )
-                    link_textbox = gr.Textbox(
-                        label="Link (Optional)",
-                        placeholder="e.g., https://arxiv.org/abs/... or https://github.com/...",
-                        info="Link to paper, code repository, or model card (optional)"
-                    )
-                with gr.Column():
-                    predictions_file = gr.File(label="Predictions JSONL File", file_types=[".jsonl"], type="filepath")
-                    gr.Markdown(
-                        """
-                    **Expected JSONL format (one prediction per line):**
-                    ```json
-                    {"question": "What is Dr. McElhaney's position?", "answer": ["Senior Scientist"], "citations": [{"file": "1307326.pdf", "page": 1}], "iterations": 1, "id": "q_4"}
-                    {"question": "Who is the CEO?", "answer": ["John Smith"], "citations": [{"file": "report.pdf", "page": 3}], "iterations": 2, "id": "q_5"}
-                    ```
-                    **Required fields per line:**
-                    - `question`: The question text
-                    - `answer`: List of answer strings
-                    - `citations`: List of dicts with "file" and "page"
-                    - `iterations`: Number of agent iterations
-                    - `id`: Unique question identifier
-                    """
-                    )
-            submit_button = gr.Button("Submit Evaluation", variant="primary")
-            submission_result = gr.Markdown()
-            submit_button.click(
-                add_new_eval,
-                [
-                    model_name_textbox,
-                    organization_textbox,
-                    model_type,
-                    predictions_file,
-                    link_textbox,
-                ],
-                submission_result,
-            )
-scheduler = BackgroundScheduler()
-scheduler.add_job(restart_space, "interval", seconds=1800)
-scheduler.start()
-demo.queue(default_concurrency_limit=40).launch(allowed_paths=[ASSETS_PATH])

 """
+Agentic Document VQA Leaderboard - Streamlit Version
+Benchmark for evaluating AI systems on document collection question answering.
+Based on the paper: "Strategic Navigation or Stochastic Search?
+How Agents and Humans Handle Large Document Collections"
 Color palette: Snowflake colors
 - SNOWFLAKE BLUE: #29B5E8
 - PURPLE MOON: #7254A3
 """
+import base64
+import json
 import os
+import sys
+from datetime import datetime, timezone
+from pathlib import Path
 import pandas as pd
 import plotly.graph_objects as go
+import streamlit as st
+from huggingface_hub import snapshot_download, HfApi
+# Add eval module to path
+sys.path.insert(0, str(Path(__file__).parent / "eval"))
+try:
+    from metrics import anls_star, citation_f1, kuiper_statistic
+    from datasets import load_dataset
+    EVAL_AVAILABLE = True
+except ImportError:
+    EVAL_AVAILABLE = False
+# Page configuration
+st.set_page_config(
+    page_title="Agentic Document VQA",
+    page_icon="📄",
+    layout="wide",
+    initial_sidebar_state="collapsed",
+)
+# HuggingFace Hub configuration
+TOKEN = os.environ.get("HF_TOKEN")
+QUEUE_REPO = "agentic-document-ai/backend-requests"
+RESULTS_REPO = "agentic-document-ai/backend-results"
+CACHE_PATH = os.getenv("HF_HOME", ".")
+def get_hf_user() -> dict | None:
+    """Get the logged-in HuggingFace user info from OAuth.
+    Returns dict with 'username', 'name', 'picture' if logged in, None otherwise.
+    Works on HuggingFace Spaces with hf_oauth: true in README.md
+    """
+    # Check if running on HF Spaces with OAuth enabled
+    if hasattr(st, 'context') and hasattr(st.context, 'headers'):
+        headers = st.context.headers
+        # HF Spaces passes user info in headers when OAuth is enabled
+        hf_user = headers.get("HF-User")
+        if hf_user:
+            return {
+                'username': hf_user,
+                'name': headers.get("HF-User-Name", hf_user),
+                'picture': headers.get("HF-User-Picture", ""),
+            }
+    # Check for st.user (Streamlit 1.37+)
+    if hasattr(st, 'user') and st.user.get('email'):
+        return {
+            'username': st.user.get('email', '').split('@')[0],
+            'name': st.user.get('name', ''),
+            'picture': st.user.get('picture', ''),
+        }
+    return None
+# Colors
+SNOWFLAKE_BLUE = "#29B5E8"
+MID_BLUE = "#11567F"
+VALENCIA_ORANGE = "#FF9F36"
+STAR_BLUE = "#75CDD7"
+FIRST_LIGHT = "#D45B90"
+PURPLE_MOON = "#7254A3"
+MEDIUM_GRAY = "#5B5B5B"
+# Available tags for filtering - can be extended
+AVAILABLE_TAGS = [
+    "Agentic",
+    "Conventional RAG",
+    "BM25 Search Tool",
+    "Semantic Search Tool",
+    "Vision and Language",
+    "Text-only",
+]
+# Tag colors for visual distinction (cycling through Snowflake secondary colors)
+TAG_COLORS = {
+    "Agentic": MID_BLUE,
+    "Conventional RAG": STAR_BLUE,
+    "BM25 Search Tool": VALENCIA_ORANGE,
+    "Semantic Search Tool": FIRST_LIGHT,
+    "Vision and Language": PURPLE_MOON,
+    "Text-only": SNOWFLAKE_BLUE,
+}
+# Custom CSS following Snowflake Brand Color Guide
+# Primary: MID-BLUE (#11567F) for accents/sections, SNOWFLAKE BLUE (#29B5E8) sparingly
+# Use white text on dark backgrounds per accessibility guidelines
+st.markdown(f"""
+<style>
+    /* Dark theme base - using near-black for good contrast */
+    .stApp {{
+        background-color: #0e1117;
+    }}
+    /* ===== TAB STYLING ===== */
+    .stTabs [data-baseweb="tab-list"] {{
+        gap: 8px;
+        background-color: transparent;
+        border-bottom: 2px solid {MID_BLUE};
+        padding-bottom: 0;
+    }}
+    .stTabs [data-baseweb="tab"] {{
+        height: 50px;
+        padding: 0 28px;
+        background-color: transparent !important;
+        border-radius: 0;
+        font-weight: 500;
+        font-size: 18px;
+        color: {MEDIUM_GRAY} !important;
+        border-bottom: 3px solid transparent !important;
+        margin-bottom: -2px;
+    }}
+    .stTabs [aria-selected="true"] {{
+        background-color: transparent !important;
+        color: {SNOWFLAKE_BLUE} !important;
+        border-bottom: 3px solid {SNOWFLAKE_BLUE} !important;
+    }}
+    .stTabs [data-baseweb="tab"]:hover {{
+        color: {SNOWFLAKE_BLUE} !important;
+    }}
+    /* Tab indicator overrides */
+    .stTabs [data-baseweb="tab-highlight"],
+    div[data-baseweb="tab-highlight"] {{
+        background-color: {SNOWFLAKE_BLUE} !important;
+    }}
+    .stTabs [role="tablist"] > div:last-child {{
+        background-color: {SNOWFLAKE_BLUE} !important;
+    }}
+    /* ===== CHECKBOX STYLING - Clean, no background highlight ===== */
+    .stCheckbox {{
+        background: transparent !important;
+    }}
+    .stCheckbox label {{
+        background: transparent !important;
+        color: white !important;
+    }}
+    .stCheckbox label span {{
+        background: transparent !important;
+        color: white !important;
+    }}
+    /* Remove any highlight/selection background from checkbox labels */
+    .stCheckbox > label,
+    .stCheckbox label > span,
+    .stCheckbox label > div {{
+        background-color: transparent !important;
+        background: none !important;
+    }}
+    /* The checkbox box itself */
+    [data-baseweb="checkbox"] > div:first-child {{
+        border-color: {MEDIUM_GRAY} !important;
+        background-color: transparent !important;
+    }}
+    [data-baseweb="checkbox"][aria-checked="true"] > div:first-child {{
+        background-color: {SNOWFLAKE_BLUE} !important;
+        border-color: {SNOWFLAKE_BLUE} !important;
+    }}
+    /* Checkmark icon */
+    [data-baseweb="checkbox"] svg {{
+        color: white !important;
+    }}
+    /* ===== BUTTON STYLING - MID-BLUE primary ===== */
+    .stButton > button {{
+        background-color: {MID_BLUE} !important;
+        color: white !important;
+        border: none !important;
+        border-radius: 6px;
+        font-weight: 500;
+        padding: 0.5rem 1.5rem;
+        transition: all 0.2s ease;
+    }}
+    .stButton > button:hover {{
+        background-color: {SNOWFLAKE_BLUE} !important;
+    }}
+    .stButton > button:active, .stButton > button:focus {{
+        background-color: {MID_BLUE} !important;
+        box-shadow: 0 0 0 2px {SNOWFLAKE_BLUE} !important;
+    }}
+    /* Download button */
+    .stDownloadButton > button {{
+        background-color: {MID_BLUE} !important;
+        color: white !important;
+        border: none !important;
+    }}
+    .stDownloadButton > button:hover {{
+        background-color: {SNOWFLAKE_BLUE} !important;
+    }}
+    /* ===== FORM ELEMENTS ===== */
+    /* Text inputs */
+    .stTextInput > div > div > input {{
+        border-color: {MEDIUM_GRAY} !important;
+        background-color: #1a1a2e !important;
+    }}
+    .stTextInput > div > div > input:focus {{
+        border-color: {SNOWFLAKE_BLUE} !important;
+        box-shadow: 0 0 0 1px {SNOWFLAKE_BLUE} !important;
+    }}
+    /* Select boxes */
+    .stSelectbox [data-baseweb="select"] > div {{
+        border-color: {MEDIUM_GRAY} !important;
+        background-color: #1a1a2e !important;
+    }}
+    /* Multiselect chips */
+    .stMultiSelect [data-baseweb="tag"] {{
+        background-color: {MID_BLUE} !important;
+        color: white !important;
+    }}
+    /* File uploader */
+    [data-testid="stFileUploader"] {{
+        border: 2px dashed {MEDIUM_GRAY} !important;
+        border-radius: 12px;
+        padding: 2rem 1.5rem !important;
+        background-color: transparent !important;
+        transition: all 0.2s ease;
+    }}
+    [data-testid="stFileUploader"]:hover {{
+        border-color: {SNOWFLAKE_BLUE} !important;
+        background-color: rgba(17, 86, 127, 0.08) !important;
+    }}
+    [data-testid="stFileUploaderDropzone"] {{
+        background-color: transparent !important;
+    }}
+    [data-testid="stFileUploader"] section {{
+        padding: 0 !important;
+    }}
+    [data-testid="stFileUploader"] section > div {{
+        padding: 0.5rem 0 !important;
+    }}
+    /* ===== LINKS - Snowflake Blue for visibility ===== */
+    a {{
+        color: {SNOWFLAKE_BLUE} !important;
+        text-decoration: none !important;
+    }}
+    a:hover {{
+        color: {STAR_BLUE} !important;
+        text-decoration: underline !important;
+    }}
+    /* ===== SECTION HEADERS ===== */
+    h3 {{
+        color: white;
+    }}
+    /* ===== ALERTS/MESSAGES ===== */
+    .stAlert, [data-testid="stAlert"] {{
+        border-radius: 8px !important;
+        border: none !important;
+    }}
+    /* Info messages - Snowflake Blue */
+    .stInfo, [data-testid="stAlert"]:has([data-testid="stMarkdownContainer"]) {{
+        background-color: rgba(41, 181, 232, 0.15) !important;
+        border-left: 4px solid {SNOWFLAKE_BLUE} !important;
+    }}
+    /* Warning messages - Valencia Orange */
+    .stWarning, [role="alert"]:has(svg[data-testid="stIconWarning"]) {{
+        background-color: rgba(255, 159, 54, 0.15) !important;
+        border-left: 4px solid {VALENCIA_ORANGE} !important;
+    }}
+    /* Error messages - First Light (pink/red) */
+    .stError, [role="alert"]:has(svg[data-testid="stIconError"]) {{
+        background-color: rgba(212, 91, 144, 0.15) !important;
+        border-left: 4px solid {FIRST_LIGHT} !important;
+    }}
+    /* Success messages - Star Blue */
+    .stSuccess, [role="alert"]:has(svg[data-testid="stIconSuccess"]) {{
+        background-color: rgba(117, 205, 215, 0.15) !important;
+        border-left: 4px solid {STAR_BLUE} !important;
+    }}
+    /* Alert text and icon colors */
+    .stAlert p, [data-testid="stAlert"] p {{
+        color: rgba(255, 255, 255, 0.9) !important;
+    }}
+    /* Override default alert backgrounds */
+    [data-testid="stNotification"] {{
+        background-color: transparent !important;
+    }}
+    div[data-baseweb="notification"] {{
+        background-color: rgba(41, 181, 232, 0.15) !important;
+        border-left: 4px solid {SNOWFLAKE_BLUE} !important;
+        border-radius: 8px !important;
+    }}
+    /* ===== SPINNER ===== */
+    .stSpinner > div {{
+        border-top-color: {SNOWFLAKE_BLUE} !important;
+    }}
+    /* ===== EXPANDER ===== */
+    .streamlit-expanderHeader {{
+        border-left: 3px solid {MID_BLUE};
+        background-color: rgba(17, 86, 127, 0.1) !important;
+    }}
+    /* ===== CODE BLOCKS ===== */
+    code {{
+        background-color: rgba(17, 86, 127, 0.2);
+        padding: 0.2em 0.4em;
+        border-radius: 3px;
+        color: {STAR_BLUE};
+    }}
+    /* ===== SCROLLBAR ===== */
+    ::-webkit-scrollbar {{
+        width: 8px;
+        height: 8px;
+    }}
+    ::-webkit-scrollbar-track {{
+        background: #1a1a2e;
+    }}
+    ::-webkit-scrollbar-thumb {{
+        background: {MID_BLUE};
+        border-radius: 4px;
+    }}
+    ::-webkit-scrollbar-thumb:hover {{
+        background: {SNOWFLAKE_BLUE};
+    }}
+    /* ===== ROOT VARIABLES ===== */
+    :root {{
+        --primary-color: {SNOWFLAKE_BLUE} !important;
+    }}
+    /* ===== MULTISELECT STYLING ===== */
+    /* Tag filter multiselect - MID_BLUE (gradient start) */
+    div[data-testid="stHorizontalBlock"] > div:first-child .stMultiSelect [data-baseweb="tag"] {{
+        background-color: {MID_BLUE} !important;
+        color: white !important;
+    }}
+    /* Column selector multiselect - SNOWFLAKE_BLUE (gradient end) */
+    div[data-testid="stHorizontalBlock"] > div:last-child .stMultiSelect [data-baseweb="tag"] {{
+        background-color: {SNOWFLAKE_BLUE} !important;
+        color: white !important;
+    }}
+    /* Default multiselect styling */
+    .stMultiSelect [data-baseweb="tag"] {{
+        border-radius: 12px !important;
+        padding: 2px 10px !important;
+        margin: 2px !important;
+        font-weight: 500 !important;
+    }}
+    .stMultiSelect [data-baseweb="tag"] span {{
+        color: inherit !important;
+    }}
+    /* Remove button in tag */
+    .stMultiSelect [data-baseweb="tag"] svg {{
+        color: white !important;
+        opacity: 0.8;
+    }}
+    .stMultiSelect [data-baseweb="tag"] svg:hover {{
+        opacity: 1;
+    }}
+    /* Placeholder text */
+    .stMultiSelect input::placeholder {{
+        color: {MEDIUM_GRAY} !important;
+    }}
+</style>
+""", unsafe_allow_html=True)
+# Data paths
+EVAL_RESULTS_PATH = Path(CACHE_PATH) / "eval-results"
+EVAL_REQUESTS_PATH = Path(CACHE_PATH) / "eval-queue"
+@st.cache_data(ttl=300)  # Cache for 5 minutes
+def download_data():
+    """Download data from HuggingFace Hub."""
     try:
+        snapshot_download(
+            repo_id=QUEUE_REPO,
+            local_dir=str(EVAL_REQUESTS_PATH),
+            repo_type="dataset",
+            tqdm_class=None,
+            etag_timeout=30,
+            token=TOKEN,
+        )
+    except Exception as e:
+        st.warning(f"Could not download queue data: {e}")
+    try:
+        snapshot_download(
+            repo_id=RESULTS_REPO,
+            local_dir=str(EVAL_RESULTS_PATH),
+            repo_type="dataset",
+            tqdm_class=None,
+            etag_timeout=30,
+            token=TOKEN,
+        )
+    except Exception as e:
+        st.warning(f"Could not download results data: {e}")
+class ModelType:
+    API = "api"
+    OPEN_WEIGHT = "open-weight"
+    @staticmethod
+    def get_color(model_type: str) -> str:
+        if model_type == ModelType.API:
+            return VALENCIA_ORANGE
+        elif model_type == ModelType.OPEN_WEIGHT:
+            return STAR_BLUE
+        return MEDIUM_GRAY
+# Load SVG icons from local assets folder
+ASSETS_PATH = Path(__file__).resolve().parent / "assets"
+def load_svg_icon(icon_name: str, fill_color: str = None) -> str:
+    """Load SVG icon and return as data URI with optional color replacement.
+    This matches the Gradio app's load_svg_data_uri function.
+    """
+    svg_file = ASSETS_PATH / f"{icon_name}.svg"
+    if not svg_file.exists():
+        return ""
+    try:
+        with open(svg_file, "r", encoding="utf-8") as f:
+            svg_content = f.read()
+        # Replace black fill with specified color for visibility on dark background
+        if fill_color:
+            svg_content = svg_content.replace('fill="black"', f'fill="{fill_color}"')
+            svg_content = svg_content.replace('stroke="black"', f'stroke="{fill_color}"')
+        b64 = base64.b64encode(svg_content.encode()).decode()
+        return f"data:image/svg+xml;base64,{b64}"
     except Exception:
         return ""
+# Preload icons with Snowflake colors (matching Gradio app)
+ICON_CLOUD = load_svg_icon("snow_cloud2", VALENCIA_ORANGE)  # Orange cloud for API (same as Gradio)
+ICON_CODE = load_svg_icon("snow_code", STAR_BLUE)  # Blue code for open-weight (same as Gradio)
+# Tab header icons - use white to match header text color
+HEADER_ICON_COLOR = "#FFFFFF"
+ICON_MEDAL = load_svg_icon("snow_medal", HEADER_ICON_COLOR)  # Leaderboard header icon
+ICON_EYE = load_svg_icon("snow_eye", HEADER_ICON_COLOR)  # Visualizations header icon
+ICON_DOCS = load_svg_icon("snow_docs", HEADER_ICON_COLOR)  # About header icon
+ICON_WRITE = load_svg_icon("snow_write", HEADER_ICON_COLOR)  # Submit header icon
+def generate_placeholder_description(model_name: str, tags: list, model_type: str) -> str:
+    """Generate a placeholder description based on model metadata."""
+    parts = []
+    # Describe model type
+    if model_type == "api":
+        parts.append("API-based")
+    elif model_type == "open-weight":
+        parts.append("Open-weight")
+    # Describe approach based on tags
+    if tags:
+        if "Agentic" in tags:
+            parts.append("agentic system")
+        elif "Conventional RAG" in tags:
+            parts.append("RAG pipeline")
+        else:
+            parts.append("model")
+        # Add tool/capability info
+        capabilities = []
+        if "BM25 Search Tool" in tags:
+            capabilities.append("BM25 search")
+        if "Semantic Search Tool" in tags:
+            capabilities.append("semantic search")
+        if "Vision and Language" in tags:
+            capabilities.append("vision")
+        if "Text-only" in tags:
+            capabilities.append("text-only")
+        if capabilities:
+            parts.append(f"with {', '.join(capabilities)}")
+    else:
+        parts.append("model")
+    return " ".join(parts) if parts else ""
+def get_model_type_html(model_type: str) -> str:
+    """Get HTML for model type with icon and colored text."""
+    color = ModelType.get_color(model_type)
+    icon_uri = ICON_CLOUD if model_type == ModelType.API else ICON_CODE
+    # Fallback emoji if icon doesn't load
+    fallback_emoji = "☁️" if model_type == ModelType.API else "</>"
+    if icon_uri:
+        return f'''<div style="display: inline-flex; align-items: center; white-space: nowrap;">
+            <img src="{icon_uri}" style="width: 20px; height: 20px; vertical-align: middle;" />
+            <span style="color: {color}; font-weight: 500; margin-left: 6px;">{model_type}</span>
+        </div>'''
+    # Fallback without icon
+    return f'<span style="color: {color}; font-weight: 500;">{fallback_emoji} {model_type}</span>'
+@st.cache_data(ttl=300)  # Cache for 5 minutes
+def load_eval_results() -> pd.DataFrame:
+    """Load evaluation results from JSON files."""
+    results = []
+    results_path = Path(EVAL_RESULTS_PATH)
+    if not results_path.exists():
         return pd.DataFrame()
+    for org_dir in results_path.iterdir():
+        if org_dir.is_dir() and not org_dir.name.startswith('.'):
+            for result_file in org_dir.glob("*_results_*.json"):
+                try:
+                    with open(result_file) as f:
+                        data = json.load(f)
+                    # Extract data
+                    model_name = data.get("model_name", "Unknown")
+                    metadata = data.get("metadata", {})
+                    result_scores = data.get("results", {})
+                    # Get tags - default to ["Agentic"] if not specified
+                    tags = data.get("tags", metadata.get("tags", ["Agentic"]))
+                    if isinstance(tags, str):
+                        tags = [tags]  # Convert single tag to list
+                    # Get per-domain scores if available
+                    by_domain = result_scores.get("by_domain", {})
+                    results.append({
+                        "Model": model_name,
+                        "Organization": data.get("organization", data.get("submitted_by", org_dir.name)),
+                        "Model Type": metadata.get("model_type", "unknown"),
+                        "Tags": tags,  # Store as list
+                        # Answer correctness metrics (ANLS*)
+                        "Accuracy (ANLS*)": result_scores.get("overall", {}).get("anls", 0.0),
+                        "Acc. Single-Hop": result_scores.get("single_evidence", {}).get("anls", 0.0),
+                        "Acc. Cross-Page": result_scores.get("multi_evidence_same_doc", {}).get("anls", 0.0),
+                        "Acc. Cross-Doc": result_scores.get("multi_evidence_multi_doc", {}).get("anls", 0.0),
+                        # Attribution metrics
+                        "Attribution (Page F1)": result_scores.get("overall", {}).get("page_f1", 0.0),
+                        "Attribution (Doc F1)": result_scores.get("overall", {}).get("doc_f1", 0.0),
+                        # Calibration metric
+                        "Effort (Kuiper)": result_scores.get("overall", {}).get("kuiper", 0.0),
+                        "Submission Date": data.get("submission_date", ""),
+                        "Link": data.get("link", ""),
+                        "Description": data.get("description", metadata.get("description", "")) or
+                                       generate_placeholder_description(model_name, tags, metadata.get("model_type", "")),
+                        # Per-domain scores (stored as JSON string for DataFrame compatibility)
+                        "_by_domain": json.dumps(by_domain) if by_domain else "{}",
+                    })
+                except Exception as e:
+                    st.warning(f"Error loading {result_file}: {e}")
+    if not results:
+        return pd.DataFrame()
+    df = pd.DataFrame(results)
+    df = df.sort_values("Accuracy (ANLS*)", ascending=False).reset_index(drop=True)
+    return df
+def get_all_tags_from_df(df: pd.DataFrame) -> list:
+    """Extract all unique tags from the DataFrame."""
+    all_tags = set()
+    if "Tags" in df.columns:
+        for tags in df["Tags"]:
+            if isinstance(tags, list):
+                all_tags.update(tags)
+    return sorted(list(all_tags))
+def filter_df_by_tags(df: pd.DataFrame, selected_tags: list) -> pd.DataFrame:
+    """Filter DataFrame to show only rows that have at least one of the selected tags."""
+    if not selected_tags:
+        return df
+    def has_any_tag(row_tags):
+        if not isinstance(row_tags, list):
+            return False
+        return any(tag in row_tags for tag in selected_tags)
+    return df[df["Tags"].apply(has_any_tag)]
+def render_tags_html(tags: list) -> str:
+    """Render tags as styled badges."""
+    if not tags or not isinstance(tags, list):
+        return ""
+    badges = []
+    for tag in tags:
+        color = TAG_COLORS.get(tag, MID_BLUE)
+        # Use lighter background with colored border for better readability
+        badge = f'''<span style="
+            display: inline-block;
+            padding: 2px 8px;
+            margin: 2px 3px;
+            border-radius: 12px;
+            font-size: 11px;
+            font-weight: 500;
+            background-color: {color}20;
+            color: {color};
+            border: 1px solid {color};
+            white-space: nowrap;
+        ">{tag}</span>'''
+        badges.append(badge)
+    return "".join(badges)
+def format_model_name(row) -> str:
+    """Format model name with optional link."""
+    model_name = row["Model"]
+    link = row.get("Link", "")
+    if link and link.strip():
+        return f'<a href="{link}" target="_blank">{model_name}</a>'
+    return model_name
+def format_model_type(model_type: str) -> str:
+    """Format model type with icon and color."""
+    icon = ModelType.get_icon(model_type)
+    color = ModelType.get_color(model_type)
+    return f'<span style="color: {color};">{icon} {model_type}</span>'
+# Metric tooltips for table headers
+METRIC_TOOLTIPS = {
+    "Accuracy (ANLS*)": "Overall answer accuracy using ANLS* (Average Normalized Levenshtein Similarity). Higher is better.",
+    "Acc. Single-Hop": "Accuracy on questions requiring evidence from a single page.",
+    "Acc. Cross-Page": "Accuracy on multi-hop questions requiring evidence from multiple pages within the same document.",
+    "Acc. Cross-Doc": "Accuracy on multi-hop questions requiring evidence from multiple documents.",
+    "Attribution (Page F1)": "F1 score for page-level attribution. Measures overlap between cited pages and gold evidence. Higher is better.",
+    "Attribution (Doc F1)": "F1 score for document-level attribution. Measures whether the correct documents were identified. Higher is better.",
+    "Effort (Kuiper)": "Effort calibration metric (Kuiper statistic). Measures if effort correlates with problem difficulty. Lower is better.",
+    "Model Type": "API = cloud-based model, open-weight = downloadable weights",
+    "Tags": "Approach characteristics: Agentic, RAG, search tools, vision capabilities, etc.",
+}
+def render_leaderboard_table(df: pd.DataFrame, columns: list):
+    """Render an HTML table matching the Gradio leaderboard style."""
+    if df.empty:
+        st.warning("No data available")
+        return
+    # Build table HTML with tooltips
+    header_cells = []
+    for col in columns:
+        # Add line break before brackets for cleaner display
+        display_col = col.replace(" (", "<br>(") if " (" in col else col
+        tooltip = METRIC_TOOLTIPS.get(col, "")
+        if tooltip:
+            header_cells.append(f'<th title="{tooltip}" style="cursor: help;">{display_col}</th>')
+        else:
+            header_cells.append(f'<th>{display_col}</th>')
+    header_cells = "".join(header_cells)
+    rows_html = ""
+    for _, row in df.iterrows():
+        cells = []
+        for col in columns:
+            value = row.get(col, "")
+            if col == "Model":
+                # Model name with optional link and description
+                link = row.get("Link", "")
+                description = row.get("Description", "")
+                if link and str(link).strip():
+                    name_html = f'<a href="{link}" target="_blank" style="color: #29B5E8; font-weight: 500;">{value}</a>'
+                else:
+                    name_html = f'<span style="font-weight: 500;">{value}</span>'
+                if description and str(description).strip():
+                    cell_html = f'{name_html}<br><span style="font-size: 12px; color: {MEDIUM_GRAY}; font-weight: normal;">{description}</span>'
+                else:
+                    cell_html = name_html
+            elif col == "Model Type":
+                # Model type with icon
+                cell_html = get_model_type_html(str(value))
+            elif col == "Tags":
+                # Render tags as badges
+                cell_html = render_tags_html(value)
+            elif col == "Accuracy (ANLS*)" or col.startswith("Acc."):
+                # Format accuracy scores (ANLS*, scale 0-100)
+                try:
+                    cell_html = f"{float(value):.1f}" if value else "0"
+                except (ValueError, TypeError):
+                    cell_html = str(value)
+            elif col.startswith("Attribution"):
+                # Format F1 scores (scale 0-100)
+                try:
+                    cell_html = f"{float(value):.1f}" if value else "0"
+                except (ValueError, TypeError):
+                    cell_html = str(value)
+            elif col == "Effort (Kuiper)":
+                # Format Kuiper statistic (lower is better for calibration)
+                try:
+                    cell_html = f"{float(value):.3f}" if value else "0"
+                except (ValueError, TypeError):
+                    cell_html = str(value)
+            else:
+                cell_html = str(value) if value else ""
+            cells.append(f'<td>{cell_html}</td>')
+        rows_html += f'<tr>{"".join(cells)}</tr>'
+    table_html = f'''
+    <style>
+        .leaderboard-wrapper {{
+            border: 2px solid {MID_BLUE};
+            border-radius: 8px;
+            overflow: hidden;
+            font-size: 0;
+        }}
+        .leaderboard-table {{
+            width: 100%;
+            border-collapse: collapse;
+            border-spacing: 0;
+            font-size: 14px;
+            background-color: #0e1117;
+            margin: 0;
+            padding: 0;
+            border: none;
+        }}
+        .leaderboard-table thead tr {{
+            background: linear-gradient(135deg, {MID_BLUE} 0%, {SNOWFLAKE_BLUE} 100%);
+        }}
+        .leaderboard-table thead th {{
+            background: transparent;
+            color: white;
+            text-align: center;
+            padding: 1.2em 0.75em;
+            font-weight: 500;
+            border: none;
+            text-transform: none;
+        }}
+        .leaderboard-table thead th:not(:last-child) {{
+            border-right: 1px solid rgba(255,255,255,0.15);
+        }}
+        .leaderboard-table tbody td {{
+            padding: 0.75em;
+            border-bottom: 1px solid {MEDIUM_GRAY}40;
+            vertical-align: middle;
+            color: white;
+        }}
+        .leaderboard-table tbody tr:last-child td {{
+            border-bottom: none;
+        }}
+        .leaderboard-table tbody tr:nth-child(even) {{
+            background-color: rgba(17, 86, 127, 0.12);
+        }}
+        .leaderboard-table tbody tr:hover {{
+            background-color: rgba(17, 86, 127, 0.25);
+        }}
+        .leaderboard-table td:first-child {{
+            min-width: 280px;
+            max-width: 350px;
+            word-wrap: break-word;
+        }}
+        /* Links in table use Snowflake Blue */
+        .leaderboard-table a {{
+            color: {SNOWFLAKE_BLUE};
+            text-decoration: none;
+        }}
+        .leaderboard-table a:hover {{
+            color: {STAR_BLUE};
+            text-decoration: underline;
+        }}
+    </style>
+    <div class="leaderboard-wrapper">
+        <table class="leaderboard-table">
+            <thead>
+                <tr>{header_cells}</tr>
+            </thead>
+            <tbody>
+                {rows_html}
+            </tbody>
+        </table>
+    </div>
+    '''
+    st.markdown(table_html, unsafe_allow_html=True)
+def create_accuracy_vs_attribution_plot(df: pd.DataFrame) -> go.Figure:
+    """Create scatter plot of Accuracy vs Attribution."""
     if df.empty:
         fig = go.Figure()
         fig.add_annotation(
+            text="No data available",
+            xref="paper", yref="paper",
+            x=0.5, y=0.5, showarrow=False,
+            font=dict(size=20, color="white")
         )
         return fig
     color_map = {
+        "api": VALENCIA_ORANGE,  # Orange for API
+        "open-weight": STAR_BLUE,  # Star Blue for open-weight
     }
     fig = go.Figure()
+    for model_type in df["Model Type"].unique():
+        df_type = df[df["Model Type"] == model_type]
+        fig.add_trace(go.Scatter(
+            x=df_type["Attribution (Page F1)"],
+            y=df_type["Accuracy (ANLS*)"],
+            mode="markers+text",
+            name=model_type,
+            text=df_type["Model"],
+            textposition="top center",
+            textfont=dict(size=9, color="#ccc"),
+            marker=dict(
+                size=14,
+                color=color_map.get(model_type, MEDIUM_GRAY),
+                line=dict(width=2, color="white")
+            ),
+            hovertemplate="<b>%{text}</b><br>Attribution: %{x:.1f}<br>Accuracy: %{y:.1f}<extra></extra>",
+        ))
     fig.update_layout(
+        title=dict(text="Accuracy vs Attribution", font=dict(color="white")),
+        xaxis_title="Attribution (Page F1)",
+        yaxis_title="Accuracy (ANLS*)",
         hovermode="closest",
+        template="plotly_dark",
+        height=500,
         showlegend=True,
+        legend=dict(title="Model Type", yanchor="top", y=0.99, xanchor="right", x=0.99, font=dict(color="#ccc")),
+        paper_bgcolor="rgba(0,0,0,0)",
+        plot_bgcolor="rgba(14,17,23,0.8)",
+        xaxis=dict(gridcolor=MID_BLUE, zerolinecolor=MID_BLUE),
+        yaxis=dict(gridcolor=MID_BLUE, zerolinecolor=MID_BLUE),
     )
     return fig
+def create_accuracy_vs_effort_plot(df: pd.DataFrame) -> go.Figure:
+    """Create scatter plot of Accuracy vs Effort (Kuiper)."""
     if df.empty:
         fig = go.Figure()
         fig.add_annotation(
+            text="No data available",
+            xref="paper", yref="paper",
+            x=0.5, y=0.5, showarrow=False,
+            font=dict(size=20, color="white")
         )
         return fig
     color_map = {
+        "api": VALENCIA_ORANGE,  # Orange for API
+        "open-weight": STAR_BLUE,  # Star Blue for open-weight
     }
     fig = go.Figure()
+    for model_type in df["Model Type"].unique():
+        df_type = df[df["Model Type"] == model_type]
+        fig.add_trace(go.Scatter(
+            x=df_type["Effort (Kuiper)"],
+            y=df_type["Accuracy (ANLS*)"],
+            mode="markers+text",
+            name=model_type,
+            text=df_type["Model"],
+            textposition="top center",
+            textfont=dict(size=9, color="#ccc"),
+            marker=dict(
+                size=14,
+                color=color_map.get(model_type, MEDIUM_GRAY),
+                line=dict(width=2, color="white")
+            ),
+            hovertemplate="<b>%{text}</b><br>Effort: %{x:.3f}<br>Accuracy: %{y:.1f}<extra></extra>",
+        ))
     fig.update_layout(
+        title=dict(text="Accuracy vs Effort", font=dict(color="white")),
+        xaxis_title="Effort (Kuiper) — lower is better",
+        yaxis_title="Accuracy (ANLS*)",
         hovermode="closest",
+        template="plotly_dark",
+        height=500,
         showlegend=True,
+        legend=dict(title="Model Type", yanchor="top", y=0.99, xanchor="right", x=0.99, font=dict(color="#ccc")),
+        paper_bgcolor="rgba(0,0,0,0)",
+        plot_bgcolor="rgba(14,17,23,0.8)",
+        xaxis=dict(gridcolor=MID_BLUE, zerolinecolor=MID_BLUE),
+        yaxis=dict(gridcolor=MID_BLUE, zerolinecolor=MID_BLUE),
     )
     return fig
+def create_domain_accuracy_chart(by_domain: dict, model_name: str, overall_accuracy: float = 0) -> go.Figure:
+    """Create a horizontal bar chart showing accuracy by domain."""
+    # Filter out "Other" category
+    filtered_domain = {k: v for k, v in by_domain.items() if k.lower() != 'other'}
+    if not filtered_domain:
+        fig = go.Figure()
+        fig.add_annotation(
+            text="No per-domain data available",
+            xref="paper", yref="paper",
+            x=0.5, y=0.5, showarrow=False,
+            font=dict(size=16, color="white")
+        )
+        fig.update_layout(
+            template="plotly_dark",
+            paper_bgcolor="rgba(0,0,0,0)",
+            plot_bgcolor="rgba(14,17,23,0.8)",
+        )
+        return fig
+    # Sort domains by accuracy (descending)
+    sorted_domains = sorted(filtered_domain.items(), key=lambda x: x[1].get('anls', 0), reverse=True)
+    domains = [d[0] for d in sorted_domains]
+    accuracies = [d[1].get('anls', 0) for d in sorted_domains]
+    counts = [d[1].get('n', 0) for d in sorted_domains]
+    # Color based on above/below overall accuracy
+    colors = [SNOWFLAKE_BLUE if acc >= overall_accuracy else VALENCIA_ORANGE for acc in accuracies]
+    fig = go.Figure()
+    fig.add_trace(go.Bar(
+        y=domains,
+        x=accuracies,
+        orientation='h',
+        marker=dict(
+            color=colors,
+            line=dict(width=1, color='white')
+        ),
+        text=[f"{acc:.1f}% (n={n})" for acc, n in zip(accuracies, counts)],
+        textposition='auto',
+        textfont=dict(color='white', size=11),
+        hovertemplate="<b>%{y}</b><br>Accuracy: %{x:.1f}%<extra></extra>",
+    ))
+    fig.update_layout(
+        title=dict(
+            text=f"Accuracy by Domain: {model_name}",
+            font=dict(color="white", size=16)
+        ),
+        xaxis_title="Accuracy (ANLS* %)",
+        yaxis_title="",
+        template="plotly_dark",
+        height=max(400, len(domains) * 35),  # Dynamic height based on number of domains
+        paper_bgcolor="rgba(0,0,0,0)",
+        plot_bgcolor="rgba(14,17,23,0.8)",
+        xaxis=dict(
+            gridcolor=MID_BLUE,
+            zerolinecolor=MID_BLUE,
+            range=[0, 100]
+        ),
+        yaxis=dict(
+            gridcolor=MID_BLUE,
+            autorange="reversed"  # Keep highest at top
+        ),
+        margin=dict(l=150, r=50, t=60, b=50),
     )
+    return fig
+def show_model_details(model_name: str):
+    """Show detailed per-domain breakdown for a model."""
+    # Load model data from cached DataFrame
+    df = load_eval_results()
+    if df.empty:
+        st.warning("No model data available")
+        return
+    model_row = df[df["Model"] == model_name]
+    if model_row.empty:
+        st.warning(f"Model '{model_name}' not found")
+        return
+    model_data = model_row.iloc[0]
+    # Display model info
+    col1, col2, col3 = st.columns(3)
+    with col1:
+        st.metric("Overall Accuracy", f"{model_data['Accuracy (ANLS*)']:.1f}%")
+    with col2:
+        st.metric("Attribution (Page F1)", f"{model_data['Attribution (Page F1)']:.1f}%")
+    with col3:
+        kuiper = model_data.get('Effort (Kuiper)', 0)
+        st.metric("Effort (Kuiper)", f"{kuiper:.2f}" if kuiper else "N/A")
+    # Get per-domain data
+    by_domain_str = model_data.get('_by_domain', '{}')
+    try:
+        by_domain = json.loads(by_domain_str) if isinstance(by_domain_str, str) else by_domain_str
+    except (json.JSONDecodeError, TypeError):
+        by_domain = {}
+    if by_domain:
+        # Show per-domain chart (use overall accuracy as threshold for coloring)
+        overall_accuracy = model_data.get('Accuracy (ANLS*)', 0)
+        fig = create_domain_accuracy_chart(by_domain, model_name, overall_accuracy)
+        st.plotly_chart(fig, width="stretch")
+    else:
+        st.info("Per-domain breakdown not available for this submission. Newer submissions will include this data.")
+def validate_jsonl_submission(file_content: str) -> tuple[bool, str, list]:
+    """Validate JSONL submission format and return parsed predictions."""
+    try:
+        lines = file_content.strip().split("\n")
+        if not lines or (len(lines) == 1 and not lines[0].strip()):
+            return False, "File is empty", []
+        predictions = []
+        for line_num, line in enumerate(lines, 1):
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                pred = json.loads(line)
+            except json.JSONDecodeError as e:
+                return False, f"Line {line_num}: Invalid JSON - {str(e)}", []
+            # Required: question and answer
+            if "question" not in pred:
+                return False, f"Line {line_num}: Missing required field 'question'", []
+            if "answer" not in pred:
+                return False, f"Line {line_num}: Missing required field 'answer'", []
+            predictions.append(pred)
+        return True, "", predictions
+    except Exception as e:
+        return False, f"Error reading file: {str(e)}", []
+@st.cache_data(ttl=3600)  # Cache for 1 hour
+def load_gold_standard(dataset_name: str = "agentic-document-ai/dataset-PRIVATE", split: str = "test"):
+    """Load gold standard from HuggingFace dataset.
+    Note: Uses dataset-PRIVATE for test split (contains gold answers).
+    """
+    if not EVAL_AVAILABLE:
+        return {}, {}
+    try:
+        dataset = load_dataset(dataset_name, split=split)
+        by_text = {}
+        by_id = {}
+        for ex in dataset:
+            question = ex['question'].strip()
+            qid = ex.get('id', '')
+            # Try multiple field names for answers (different splits may use different names)
+            answers = ex.get('answer_variants') or ex.get('answers') or []
+            # If answers is a string, wrap it in a list
+            if isinstance(answers, str):
+                answers = [[answers]]
+            # If answers is a flat list of strings, wrap each in a list
+            elif answers and isinstance(answers[0], str):
+                answers = [answers]
+            gold_data = {
+                'answers': answers,
+                'evidence': ex.get('evidence', []),
+                'category': ex.get('document_category', ''),
+                'domain': ex.get('domain', ''),
+                'hop_type': ex.get('hop_type', 'single')
+            }
+            by_text[question] = gold_data
+            if qid:
+                by_id[qid] = gold_data
+        return by_text, by_id
+    except Exception as e:
+        st.error(f"Error loading dataset: {e}")
+        return {}, {}
+def evaluate_predictions(predictions: list, gold_by_text: dict, gold_by_id: dict) -> dict:
+    """Evaluate predictions against gold standard."""
+    if not EVAL_AVAILABLE:
+        return {"error": "Evaluation module not available"}
+    evals = []
+    unmatched = []
+    for pred in predictions:
+        question = pred.get('question', '').strip()
+        qid = pred.get('id', '')
+        # Match to gold
+        if question in gold_by_text:
+            gold_data = gold_by_text[question]
+        elif qid and qid in gold_by_id:
+            gold_data = gold_by_id[qid]
+        else:
+            unmatched.append(question[:50] + "..." if len(question) > 50 else question)
+            continue
+        # Get prediction data
+        answer = pred.get('answer', '')
+        citations = pred.get('citations', [])
+        search_history = pred.get('search_history', [])
+        steps = len(search_history) if search_history else pred.get('iterations', 0)
+        # Calculate metrics
+        anls = anls_star(answer, gold_data['answers'])
+        correct = anls >= 0.5
+        doc_f1 = citation_f1(citations, gold_data['evidence'], level='document')
+        page_f1 = citation_f1(citations, gold_data['evidence'], level='page')
+        evals.append({
+            'question': question,
+            'anls': anls,
+            'correct': correct,
+            'doc_f1': doc_f1['f1'],
+            'page_f1': page_f1['f1'],
+            'steps': steps,
+            'hop_type': gold_data.get('hop_type', 'single'),
+            'category': gold_data['category'],
+            'domain': gold_data['domain']
+        })
+    if not evals:
+        return {"error": "No predictions matched the gold standard"}
+    # Aggregate overall metrics
+    n = len(evals)
+    accuracy = sum(e['correct'] for e in evals) / n * 100  # Scale to 0-100
+    mean_anls = sum(e['anls'] for e in evals) / n * 100
+    mean_doc_f1 = sum(e['doc_f1'] for e in evals) / n * 100
+    mean_page_f1 = sum(e['page_f1'] for e in evals) / n * 100
+    # Kuiper statistic
+    kuiper = kuiper_statistic(evals)
+    # By hop type
+    single_hop = [e for e in evals if e['hop_type'] == 'single']
+    cross_page = [e for e in evals if e['hop_type'] == 'cross_page']
+    cross_doc = [e for e in evals if e['hop_type'] == 'cross_doc']
+    # By domain
+    from collections import defaultdict
+    by_domain = defaultdict(list)
+    for e in evals:
+        domain = e['domain'] or 'Other'
+        by_domain[domain].append(e)
+    domain_scores = {}
+    for domain, domain_evals in sorted(by_domain.items()):
+        domain_scores[domain] = {
+            'anls': sum(e['anls'] for e in domain_evals) / len(domain_evals) * 100,
+            'n': len(domain_evals)
+        }
+    results = {
+        'n_evaluated': n,
+        'n_unmatched': len(unmatched),
+        'unmatched_samples': unmatched[:5],  # Show first 5
+        'overall': {
+            'anls': mean_anls,
+            'accuracy': accuracy,
+            'doc_f1': mean_doc_f1,
+            'page_f1': mean_page_f1,
+            'kuiper': kuiper['kuiper_stat'] if not kuiper.get('degenerate') else None,
+        },
+        'single_evidence': {
+            'anls': sum(e['anls'] for e in single_hop) / len(single_hop) * 100 if single_hop else 0,
+            'n': len(single_hop)
+        },
+        'multi_evidence_same_doc': {
+            'anls': sum(e['anls'] for e in cross_page) / len(cross_page) * 100 if cross_page else 0,
+            'n': len(cross_page)
+        },
+        'multi_evidence_multi_doc': {
+            'anls': sum(e['anls'] for e in cross_doc) / len(cross_doc) * 100 if cross_doc else 0,
+            'n': len(cross_doc)
+        },
+        'by_domain': domain_scores
+    }
+    return results
+@st.fragment
+def submit_results_fragment():
+    """Fragment for file upload and evaluation to prevent full page reruns."""
+    # Check HuggingFace login
+    hf_user = get_hf_user()
+    if not hf_user:
+        st.warning("🔐 **Login Required**: Please sign in with your HuggingFace account to submit results.")
+        # Show login button (works on HF Spaces with hf_oauth: true)
+        if hasattr(st, 'login_button'):
+            st.login_button("huggingface", use_container_width=True)
+        else:
+            st.info("""
+            To enable login:
+            1. Deploy this app on HuggingFace Spaces
+            2. Add `hf_oauth: true` to your Space's README.md metadata
+            Or run locally with a test user by setting environment variables.
+            """)
+        return
+    # Show logged-in user
+    st.success(f"✅ Logged in as **{hf_user['username']}**")
+    # Step 1: Upload and Evaluate
+    st.markdown("### Step 1: Upload Predictions")
+    uploaded_file = st.file_uploader(
+        "Upload your predictions JSONL file",
+        type=["jsonl"],
+        help="One prediction per line with 'question' and 'answer' fields",
+        key="predictions_uploader"
     )
+    with st.expander("📋 Expected JSONL format"):
+        st.code('''{"question": "What is the total revenue?", "answer": "$1.2M", "citations": [{"file": "report.pdf", "page": 5}], "iterations": 3}
+{"question": "Who signed the contract?", "answer": ["John Smith", "Jane Doe"], "citations": [{"file": "contract.pdf", "page": 12}], "iterations": 2}''', language="json")
+        st.markdown("""
+        **Required fields:**
+        - `question`: The question text (must match dataset)
+        - `answer`: Predicted answer (string or list)
+        **Optional fields (for full metrics):**
+        - `citations`: List of `{"file": "...", "page": N}` for attribution metrics
+        - `iterations` or `search_history`: For effort/calibration metrics
+        - `id`: Question ID (fallback matching)
+        """)
+    # Initialize session state for evaluation results
+    if 'eval_results' not in st.session_state:
+        st.session_state.eval_results = None
+    if 'predictions' not in st.session_state:
+        st.session_state.predictions = None
+    if uploaded_file is not None:
+        file_content = uploaded_file.read().decode("utf-8")
+        is_valid, error_msg, predictions = validate_jsonl_submission(file_content)
+        if not is_valid:
+            st.error(f"❌ Invalid file: {error_msg}")
+        else:
+            st.success(f"✅ Loaded {len(predictions)} predictions")
+            st.session_state.predictions = predictions
+            # Evaluate button
+            if st.button("🔬 Run Evaluation", type="primary"):
+                with st.spinner("Loading gold standard and evaluating..."):
+                    gold_by_text, gold_by_id = load_gold_standard()
+                    if not gold_by_text:
+                        st.error("Failed to load gold standard dataset")
+                    else:
+                        results = evaluate_predictions(predictions, gold_by_text, gold_by_id)
+                        st.session_state.eval_results = results
+    # Show evaluation results
+    if st.session_state.eval_results:
+        results = st.session_state.eval_results
+        if 'error' in results:
+            st.error(results['error'])
+        else:
+            st.markdown("### 📊 Evaluation Results")
+            # Summary metrics
+            col1, col2, col3, col4 = st.columns(4)
+            with col1:
+                st.metric("Accuracy (ANLS*)", f"{results['overall']['anls']:.1f}")
+            with col2:
+                st.metric("Attribution (Page F1)", f"{results['overall']['page_f1']:.1f}")
+            with col3:
+                kuiper_val = results['overall']['kuiper']
+                st.metric("Effort (Kuiper)", f"{kuiper_val:.3f}" if kuiper_val else "N/A")
+            with col4:
+                st.metric("Evaluated", f"{results['n_evaluated']} / {results['n_evaluated'] + results['n_unmatched']}")
+            # Detailed breakdown
+            with st.expander("📈 Detailed Breakdown"):
+                st.markdown(f"""
+                | Metric | Value |
+                |--------|-------|
+                | **Overall ANLS*** | {results['overall']['anls']:.1f} |
+                | **Acc. Single-Hop** (n={results['single_evidence']['n']}) | {results['single_evidence']['anls']:.1f} |
+                | **Acc. Cross-Page** (n={results['multi_evidence_same_doc']['n']}) | {results['multi_evidence_same_doc']['anls']:.1f} |
+                | **Acc. Cross-Doc** (n={results['multi_evidence_multi_doc']['n']}) | {results['multi_evidence_multi_doc']['anls']:.1f} |
+                | **Attribution (Doc F1)** | {results['overall']['doc_f1']:.1f} |
+                | **Attribution (Page F1)** | {results['overall']['page_f1']:.1f} |
+                """)
+            if results['n_unmatched'] > 0:
+                with st.expander(f"⚠️ {results['n_unmatched']} unmatched questions"):
+                    for q in results['unmatched_samples']:
+                        st.text(f"• {q}")
+                    if results['n_unmatched'] > 5:
+                        st.text(f"... and {results['n_unmatched'] - 5} more")
+            # Step 2: Model Information
+            st.markdown("---")
+            st.markdown("### Step 2: Model Information")
+            col1, col2 = st.columns(2)
+            with col1:
+                model_name = st.text_input("Model Name *", placeholder="e.g., GPT-4o-Agent")
+                organization = st.text_input("Organization *", placeholder="e.g., OpenAI")
+                model_type = st.selectbox("Model Type *", options=["", "api", "open-weight"])
+            with col2:
+                description = st.text_area(
+                    "Description",
+                    placeholder="Brief description of your approach (e.g., 'Vision-language model with BM25 search tool')",
+                    height=80
+                )
+                link = st.text_input("Link (Optional)", placeholder="https://arxiv.org/abs/... or https://github.com/...")
+                selected_tags = st.multiselect(
+                    "Tags",
+                    options=AVAILABLE_TAGS,
+                    default=["Agentic"],
+                    help="Select tags that describe your approach"
+                )
+            # Step 3: Submit
+            st.markdown("---")
+            st.markdown("### Step 3: Submit to Leaderboard")
+            if st.button("🚀 Submit to Leaderboard", type="primary", disabled=not (model_name and organization and model_type)):
+                if not model_name or not organization or not model_type:
+                    st.error("Please fill in all required fields (Model Name, Organization, Model Type)")
+                else:
+                    # Get current user for submission tracking
+                    hf_user = get_hf_user()
+                    # Prepare submission data
+                    submission = {
+                        "model_name": model_name.strip(),
+                        "organization": organization.strip(),
+                        "description": description.strip() if description else "",
+                        "link": link.strip() if link else "",
+                        "tags": selected_tags,
+                        "submitted_by": hf_user['username'] if hf_user else "anonymous",
+                        "metadata": {
+                            "model_type": model_type,
+                        },
+                        "results": {
+                            "overall": {
+                                "anls": results['overall']['anls'],
+                                "page_f1": results['overall']['page_f1'],
+                                "doc_f1": results['overall']['doc_f1'],
+                                "kuiper": results['overall']['kuiper'],
+                            },
+                            "single_evidence": results['single_evidence'],
+                            "multi_evidence_same_doc": results['multi_evidence_same_doc'],
+                            "multi_evidence_multi_doc": results['multi_evidence_multi_doc'],
+                            "by_domain": results.get('by_domain', {}),
+                        },
+                        "submission_date": datetime.now(timezone.utc).isoformat(),
+                    }
+                    # Upload to HuggingFace Hub
+                    with st.spinner("Uploading to leaderboard..."):
+                        try:
+                            # Create path matching expected structure: {org}/{model}_results_{timestamp}.json
+                            safe_org = organization.strip().replace(" ", "_").replace("/", "-")
+                            safe_model = model_name.strip().replace(" ", "_").replace("/", "-")
+                            timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
+                            filename = f"{safe_model}_results_{timestamp}.json"
+                            path_in_repo = f"{safe_org}/{filename}"
+                            # Upload using HfApi
+                            api = HfApi()
+                            api.upload_file(
+                                path_or_fileobj=json.dumps(submission, indent=2).encode("utf-8"),
+                                path_in_repo=path_in_repo,
+                                repo_id=RESULTS_REPO,
+                                repo_type="dataset",
+                                token=TOKEN,
+                                commit_message=f"Add results for {organization}/{model_name}"
+                            )
+                            st.success(f"✅ Successfully submitted to leaderboard!")
+                            st.balloons()
+                            with st.expander("📄 Submission Details"):
+                                st.code(json.dumps(submission, indent=2), language="json")
+                            # Clear cache to force refresh
+                            download_data.clear()
+                            load_eval_results.clear()
+                            st.info("✨ Your submission has been saved! Click below to see it on the leaderboard.")
+                            if st.button("🔄 View Updated Leaderboard", type="primary"):
+                                st.rerun(scope="app")  # Full page rerun, not just fragment
+                        except Exception as e:
+                            st.error(f"❌ Upload failed: {str(e)}")
+                            st.warning("Please ensure HF_TOKEN environment variable is set with write access to the repository.")
+                            with st.expander("📄 Submission JSON (for manual upload)"):
+                                st.code(json.dumps(submission, indent=2), language="json")
+                            st.info(f"""
+                            **To submit manually:**
+                            1. Copy the JSON above
+                            2. Save as `{path_in_repo}`
+                            3. Upload to `{RESULTS_REPO}` on HuggingFace Hub
+                            Or contact lukasz.borchmann@snowflake.com
+                            """)
+def main():
+    # Download data from HuggingFace Hub
+    with st.spinner("Loading data from HuggingFace Hub..."):
+        download_data()
+    # Load data
+    df = load_eval_results()
+    # Tabs - matching Gradio style (no emojis)
+    tab1, tab2, tab3, tab4 = st.tabs(["Leaderboard", "Visualizations", "About", "Submit Results"])
+    # ===== LEADERBOARD TAB =====
+    with tab1:
+        # Header with icon (fallback to emoji if icon doesn't load)
+        if ICON_MEDAL:
+            icon_html = f'<img src="{ICON_MEDAL}" style="width: 40px; height: 40px; vertical-align: middle; margin-right: 12px;" />'
+        else:
+            icon_html = f'<span style="font-size: 36px; margin-right: 12px;">🏆</span>'
+        st.markdown(f'<h3 style="display: flex; align-items: center; margin-top: 1.5rem; margin-bottom: 1.2rem;">{icon_html} Leaderboard</h3>', unsafe_allow_html=True)
+        if df.empty:
+            st.warning("No evaluation results found. Submit your results to appear on the leaderboard!")
+        else:
+            # ===== FILTERS SIDE BY SIDE =====
+            filter_col1, filter_col2 = st.columns(2)
+            with filter_col1:
+                # TAG FILTER - chips use MID_BLUE (darker, gradient start)
+                tags_in_data = get_all_tags_from_df(df)
+                all_available_tags = sorted(list(set(AVAILABLE_TAGS + tags_in_data)))
+                selected_tags = st.multiselect(
+                    "Filter by techniques/features:",
+                    options=all_available_tags,
+                    default=["Agentic"],
+                    placeholder="Click to filter by tags...",
+                    key="tag_filter",
+                )
+            with filter_col2:
+                # COLUMN SELECTOR - chips use SNOWFLAKE_BLUE (lighter, gradient end)
+                # Mapping: short chip name -> full column name
+                COLUMN_CHIP_NAMES = {
+                    "Accuracy": "Accuracy (ANLS*)",
+                    "Acc. Single-Hop": "Acc. Single-Hop",
+                    "Acc. Cross-Page": "Acc. Cross-Page",
+                    "Acc. Cross-Doc": "Acc. Cross-Doc",
+                    "Attribution": "Attribution (Page F1)",
+                    "Attribution (Doc)": "Attribution (Doc F1)",
+                    "Effort": "Effort (Kuiper)",
+                    "Model Type": "Model Type",
+                    "Tags": "Tags",
+                }
+                # Reverse mapping for lookup
+                CHIP_TO_COLUMN = COLUMN_CHIP_NAMES
+                COLUMN_TO_CHIP = {v: k for k, v in COLUMN_CHIP_NAMES.items()}
+                all_columns = list(df.columns)
+                # Model and Organization are always visible (not in selector)
+                always_visible = ["Model", "Organization"]
+                # Hidden columns (used internally but not shown as separate columns)
+                hidden_cols = ["Link", "Submission Date", "Description", "_by_domain"]
+                # Full column names that are optional (Tags moved to end)
+                optional_full_cols = [c for c in all_columns if c not in hidden_cols + always_visible and c != "Tags"]
+                optional_full_cols.append("Tags")  # Add Tags at the end
+                # Convert to chip names for display
+                optional_chips = [COLUMN_TO_CHIP.get(c, c) for c in optional_full_cols]
+                default_chips = ["Model Type", "Tags", "Accuracy", "Attribution", "Effort"]
+                default_selected = [c for c in default_chips if c in optional_chips]
+                selected_chips = st.multiselect(
+                    "Select columns to display:",
+                    options=optional_chips,
+                    default=default_selected,
+                    key="column_selector",
+                )
+                # Convert selected chips back to full column names
+                selected_optional = [CHIP_TO_COLUMN.get(c, c) for c in selected_chips]
+            # Apply tag filter
+            filtered_df = filter_df_by_tags(df, selected_tags)
+            # Show filter status
+            if selected_tags:
+                st.caption(f"Showing {len(filtered_df)} of {len(df)} models matching selected tags")
+            # Model and Organization are always included first
+            selected_columns = ["Model", "Organization"] + [c for c in optional_full_cols if c in selected_optional]
+            if selected_columns:
+                # Render HTML table with proper styling
+                render_leaderboard_table(filtered_df, selected_columns)
+                # Download button
+                st.markdown("")  # Small spacing
+                csv = filtered_df.to_csv(index=False)
+                st.download_button(
+                    label="Download as CSV",
+                    data=csv,
+                    file_name="leaderboard.csv",
+                    mime="text/csv",
+                )
+    # ===== VISUALIZATIONS TAB =====
+    with tab2:
+        if ICON_EYE:
+            icon_html = f'<img src="{ICON_EYE}" style="width: 40px; height: 40px; vertical-align: middle; margin-right: 12px;" />'
+        else:
+            icon_html = f'<span style="font-size: 36px; margin-right: 12px;">📈</span>'
+        st.markdown(f'<h3 style="display: flex; align-items: center; margin-top: 1.5rem; margin-bottom: 1.2rem;">{icon_html} Visualizations</h3>', unsafe_allow_html=True)
+        if df.empty:
+            st.warning("No data available for visualization.")
+        else:
+            # Two plots side by side
+            col1, col2 = st.columns(2)
+            with col1:
+                fig_attribution = create_accuracy_vs_attribution_plot(df)
+                st.plotly_chart(fig_attribution, width="stretch")
+            with col2:
+                fig_effort = create_accuracy_vs_effort_plot(df)
+                st.plotly_chart(fig_effort, width="stretch")
+            st.markdown("""
             **Understanding the plots:**
             - Each point represents a model submission
             - **Orange points**: API-based models
             - **Blue points**: Open-weight models
             - Hover over points to see model details
+            - **Left plot**: Upper-right = high accuracy with good attribution (optimal)
+            - **Right plot**: Upper-left = high accuracy with good effort calibration (optimal)
+            """)
+            # Model details selector
+            st.markdown("---")
+            st.markdown("### 📊 Model Details")
+            model_names = df["Model"].tolist()
+            selected_model = st.selectbox("Select a model to view per-domain breakdown:", model_names)
+            if selected_model:
+                show_model_details(selected_model)
+    # ===== ABOUT TAB =====
+    with tab3:
+        if ICON_DOCS:
+            icon_html = f'<img src="{ICON_DOCS}" style="width: 40px; height: 40px; vertical-align: middle; margin-right: 12px;" />'
+        else:
+            icon_html = f'<span style="font-size: 36px; margin-right: 12px;">📖</span>'
+        st.markdown(f'<h3 style="display: flex; align-items: center; margin-top: 1.5rem; margin-bottom: 1.2rem;">{icon_html} About</h3>', unsafe_allow_html=True)
+        st.markdown("""
+        ## Agentic Document VQA Benchmark
+        This benchmark evaluates AI systems on **Agentic Document Collection Visual Question Answering** —
+        a task requiring systems to navigate, retrieve, reason over, and aggregate information from
+        heterogeneous document collections.
+        ### Dataset
+        - **2,266** human-authored question-answer pairs
+        - **769** multi-page PDF documents from diverse real-world domains
+        - **16,652** total pages with rich visual layouts
+        - **17.3%** multi-hop questions (cross-page and cross-document)
+        - **61** document categories across 13 high-level domains
+        ### Task Properties
+        The task is characterized by five formal properties:
+        1. **Extractive**: Answers are drawn from evidence pages, not generated abstractly
+        2. **Multi-Hop**: Evidence may span multiple disjoint pages requiring aggregation
+        3. **Closed-World**: Answers must be derivable solely from the corpus
+        4. **Grounded Attribution**: Answers must be faithfully attributed to minimal evidence
+        5. **Agentic**: Requires iterative retrieval and reasoning (planning, navigation, aggregation)
+        ## Metrics
+        ### Accuracy (ANLS*)
+        - **Accuracy (ANLS*)**: Main score using Average Normalized Levenshtein Similarity with optimal element alignment for lists/sets
+        - **Acc. Single-Hop**: Accuracy on questions requiring a single evidence page
+        - **Acc. Cross-Page**: Accuracy on multi-hop questions within the same document
+        - **Acc. Cross-Doc**: Accuracy on multi-hop questions spanning multiple documents
+        ### Attribution (Page F1)
+        - **Attribution (Page F1)**: F1 score measuring overlap between cited pages and gold evidence pages (penalizes both missing and spurious citations)
+        - **Attribution (Doc F1)**: Document-level attribution accuracy (whether the correct documents were identified)
+        ### Effort (Kuiper)
+        - **Effort (Kuiper)**: Measures whether computational effort correlates with problem difficulty. Lower values indicate better calibration—the system "knows what it knows" and doesn't waste effort on unsolvable queries
+        """)
+    # ===== SUBMIT TAB =====
+    with tab4:
+        if ICON_WRITE:
+            icon_html = f'<img src="{ICON_WRITE}" style="width: 40px; height: 40px; vertical-align: middle; margin-right: 12px;" />'
+        else:
+            icon_html = f'<span style="font-size: 36px; margin-right: 12px;">📝</span>'
+        st.markdown(f'<h3 style="display: flex; align-items: center; margin-top: 1.5rem; margin-bottom: 1.2rem;">{icon_html} Submit Results</h3>', unsafe_allow_html=True)
+        if not EVAL_AVAILABLE:
+            st.warning("⚠️ Evaluation module not available. Please install dependencies: `pip install anls-star datasets`")
+        # Use fragment to prevent tab switch on file upload
+        submit_results_fragment()
+if __name__ == "__main__":
+    main()

eval/README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+# Agentic Document AI Evaluation
+Evaluation library for the [agentic-document-ai/dataset](https://huggingface.co/datasets/agentic-document-ai/dataset) benchmark.
+## Installation
+```bash
+pip install -r requirements.txt
+```
+## Usage
+### Command Line
+```bash
+# Basic evaluation
+python evaluate.py results.jsonl
+# With category/domain breakdown
+python evaluate.py results.jsonl --by-category --by-domain
+# Compare multiple models
+python evaluate.py model1.jsonl model2.jsonl model3.jsonl --compare
+# Output as JSON
+python evaluate.py results.jsonl --json
+```
+### Expected Input Format
+JSONL file with one prediction per line:
+```json
+{"id": "test/0", "question": "What is the total revenue?", "answer": "$1.2M", "citations": [{"document": "report.pdf", "page": 5}], "search_history": ["query1", "query2"]}
+```
+Required fields:
+- `question`: The question text (used to match with gold standard)
+- `answer`: Predicted answer string
+Optional fields:
+- `id`: Question ID (fallback if question text doesn't match)
+- `citations`: List of `{document, page}` for citation evaluation
+- `search_history`: List of search queries (for Kuiper effort analysis)
+- `iterations`: Alternative to `search_history` length
+### Dataset Splits
+By default, evaluates against the `dev` split. Use `--split test` for test set evaluation.
+## Metrics
+| Metric | Description |
+|--------|-------------|
+| **ANLS\*** | Answer-level Normalized Levenshtein Similarity (0-1) |
+| **Accuracy** | Fraction with ANLS* ≥ 0.5 |
+| **Document F1** | Citation accuracy at document level |
+| **Page F1** | Citation accuracy at page level |
+| **Kuiper Statistic** | Effort-accuracy calibration (lower = better) |
+| **Wasted Effort Ratio** | μ_steps(incorrect) / μ_steps(correct) |
+## Python API
+```python
+from metrics import anls_star, citation_f1, kuiper_statistic
+# ANLS* score
+score = anls_star("$1.2 million", [["$1.2M", "1.2 million dollars"]])
+# Citation F1
+f1 = citation_f1(
+    predicted=[{"document": "a.pdf", "page": 1}],
+    gold_locations=[{"document": "a.pdf", "page": 1}, {"document": "a.pdf", "page": 2}],
+    level='page'
+)
+# Kuiper statistic
+results = [{"steps": 3, "correct": True}, {"steps": 7, "correct": False}, ...]
+kuiper = kuiper_statistic(results)
+```

eval/evaluate.py ADDED Viewed

	@@ -0,0 +1,309 @@

+#!/usr/bin/env python3
+"""
+Evaluation CLI for Agentic Document AI.
+Evaluates model predictions against the agentic-document-ai/dataset benchmark.
+Usage:
+    python evaluate.py results.jsonl [--by-category] [--by-domain]
+    python evaluate.py results_*.jsonl --compare
+"""
+import argparse
+import json
+import sys
+from collections import defaultdict
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+from datasets import load_dataset
+from metrics import anls_star, citation_f1, kuiper_statistic, wasted_effort_ratio
+def load_gold_standard(dataset_name: str = "agentic-document-ai/dataset", split: str = "dev"):
+    """Load gold standard from HuggingFace dataset.
+    Returns two mappings:
+    - by_text: question text -> gold data (primary)
+    - by_id: question id -> gold data (fallback)
+    """
+    print(f"Loading {dataset_name} ({split} split)...")
+    dataset = load_dataset(dataset_name, split=split)
+    by_text = {}
+    by_id = {}
+    for ex in dataset:
+        question = ex['question'].strip()
+        qid = ex.get('id', '')
+        gold_data = {
+            'answers': ex.get('answer_variants', []),
+            'evidence': ex.get('evidence', []),
+            'category': ex.get('document_category', ''),
+            'domain': ex.get('domain', '')
+        }
+        by_text[question] = gold_data
+        if qid:
+            by_id[qid] = gold_data
+    print(f"Loaded {len(by_text)} gold examples")
+    return by_text, by_id
+def load_results(filepath: Path) -> List[Dict]:
+    """Load results from JSONL file."""
+    results = []
+    with open(filepath) as f:
+        for line in f:
+            if line.strip():
+                results.append(json.loads(line))
+    return results
+def evaluate_single(
+    result: Dict,
+    gold_by_text: Dict[str, Dict],
+    gold_by_id: Dict[str, Dict]
+) -> Optional[Dict[str, Any]]:
+    """Evaluate a single prediction.
+    Matches by question text first, falls back to question ID if not found.
+    """
+    question = result.get('question', '').strip()
+    qid = result.get('id', '')
+    # Try matching by question text first
+    if question in gold_by_text:
+        gold_data = gold_by_text[question]
+    elif qid and qid in gold_by_id:
+        # Fallback to ID-based matching
+        gold_data = gold_by_id[qid]
+    else:
+        return None
+    answer = result.get('answer', '')
+    citations = result.get('citations', [])
+    # ANLS*
+    anls = anls_star(answer, gold_data['answers'])
+    correct = anls >= 0.5
+    # Citation F1
+    doc_f1 = citation_f1(citations, gold_data['evidence'], level='document')
+    page_f1 = citation_f1(citations, gold_data['evidence'], level='page')
+    # Steps (for Kuiper)
+    search_history = result.get('search_history', [])
+    steps = len(search_history) if search_history else result.get('iterations', 0)
+    return {
+        'question': question,
+        'anls': anls,
+        'correct': correct,
+        'doc_f1': doc_f1['f1'],
+        'page_f1': page_f1['f1'],
+        'steps': steps,
+        'category': gold_data['category'],
+        'domain': gold_data['domain']
+    }
+def aggregate_metrics(evals: List[Dict]) -> Dict[str, Any]:
+    """Aggregate metrics across evaluations."""
+    if not evals:
+        return {}
+    n = len(evals)
+    accuracy = sum(e['correct'] for e in evals) / n
+    mean_anls = sum(e['anls'] for e in evals) / n
+    mean_doc_f1 = sum(e['doc_f1'] for e in evals) / n
+    mean_page_f1 = sum(e['page_f1'] for e in evals) / n
+    # Kuiper
+    kuiper = kuiper_statistic(evals)
+    wasted = wasted_effort_ratio(evals)
+    return {
+        'n': n,
+        'accuracy': accuracy,
+        'mean_anls': mean_anls,
+        'doc_f1': mean_doc_f1,
+        'page_f1': mean_page_f1,
+        'kuiper_stat': kuiper['kuiper_stat'],
+        'kuiper_degenerate': kuiper['degenerate'],
+        'wasted_effort_ratio': wasted['ratio'],
+        'mean_steps_correct': wasted['mean_steps_correct'],
+        'mean_steps_incorrect': wasted['mean_steps_incorrect'],
+    }
+def print_metrics(name: str, metrics: Dict, indent: int = 0):
+    """Print metrics in a formatted way."""
+    prefix = "  " * indent
+    if 'n' not in metrics:
+        print(f"{prefix}{name}: No data")
+        return
+    print(f"{prefix}{name} (n={metrics['n']}):")
+    print(f"{prefix}  Accuracy (ANLS*≥0.5): {metrics['accuracy']:.1%}")
+    print(f"{prefix}  Mean ANLS*:           {metrics['mean_anls']:.4f}")
+    print(f"{prefix}  Document F1:          {metrics['doc_f1']:.4f}")
+    print(f"{prefix}  Page F1:              {metrics['page_f1']:.4f}")
+    if not metrics.get('kuiper_degenerate'):
+        print(f"{prefix}  Kuiper Statistic:     {metrics['kuiper_stat']:.2f}")
+    if metrics.get('wasted_effort_ratio', 0) < float('inf'):
+        print(f"{prefix}  Wasted Effort Ratio:  {metrics['wasted_effort_ratio']:.3f}")
+def evaluate_file(
+    filepath: Path,
+    gold_by_text: Dict[str, Dict],
+    gold_by_id: Dict[str, Dict],
+    by_category: bool = False,
+    by_domain: bool = False
+) -> Dict[str, Any]:
+    """Evaluate a single results file."""
+    results = load_results(filepath)
+    evals = []
+    unmatched = 0
+    for result in results:
+        ev = evaluate_single(result, gold_by_text, gold_by_id)
+        if ev:
+            evals.append(ev)
+        else:
+            unmatched += 1
+    if unmatched > 0:
+        print(f"  Warning: {unmatched} questions not found in gold standard")
+    # Overall metrics
+    overall = aggregate_metrics(evals)
+    output = {'overall': overall}
+    # By category
+    if by_category:
+        by_cat = defaultdict(list)
+        for e in evals:
+            by_cat[e['category'] or 'Unknown'].append(e)
+        output['by_category'] = {cat: aggregate_metrics(items) for cat, items in sorted(by_cat.items())}
+    # By domain
+    if by_domain:
+        by_dom = defaultdict(list)
+        for e in evals:
+            by_dom[e['domain'] or 'Other'].append(e)
+        output['by_domain'] = {dom: aggregate_metrics(items) for dom, items in sorted(by_dom.items())}
+    return output
+def main():
+    parser = argparse.ArgumentParser(
+        description="Evaluate model predictions on Agentic Document AI benchmark",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  python evaluate.py results.jsonl
+  python evaluate.py results.jsonl --by-category --by-domain
+  python evaluate.py model1.jsonl model2.jsonl --compare
+        """
+    )
+    parser.add_argument('files', nargs='+', type=Path, help='Result JSONL file(s)')
+    parser.add_argument('--dataset', default='agentic-document-ai/dataset',
+                        help='HuggingFace dataset name')
+    parser.add_argument('--split', default='dev', help='Dataset split to evaluate on')
+    parser.add_argument('--by-category', action='store_true', help='Show metrics by document category')
+    parser.add_argument('--by-domain', action='store_true', help='Show metrics by domain')
+    parser.add_argument('--compare', action='store_true', help='Compare multiple models side-by-side')
+    parser.add_argument('--json', action='store_true', help='Output as JSON')
+    args = parser.parse_args()
+    # Load gold standard
+    gold_by_text, gold_by_id = load_gold_standard(args.dataset, args.split)
+    if not gold_by_text:
+        print("Error: No gold standard data loaded", file=sys.stderr)
+        sys.exit(1)
+    all_results = {}
+    for filepath in args.files:
+        if not filepath.exists():
+            print(f"Error: File not found: {filepath}", file=sys.stderr)
+            continue
+        # Extract model name
+        name = filepath.stem
+        if name.startswith("results_"):
+            name = name[8:]
+        if name.endswith("_results"):
+            name = name[:-8]
+        print(f"\nEvaluating: {filepath.name}")
+        result = evaluate_file(filepath, gold_by_text, gold_by_id, args.by_category, args.by_domain)
+        all_results[name] = result
+    # Output
+    if args.json:
+        # Convert for JSON serialization
+        def sanitize(obj):
+            if isinstance(obj, float) and (obj != obj or obj == float('inf')):  # NaN or inf
+                return None
+            if isinstance(obj, dict):
+                return {k: sanitize(v) for k, v in obj.items()}
+            if isinstance(obj, list):
+                return [sanitize(v) for v in obj]
+            return obj
+        print(json.dumps(sanitize(all_results), indent=2))
+    else:
+        # Print formatted output
+        print("\n" + "=" * 70)
+        print("EVALUATION RESULTS")
+        print("=" * 70)
+        if args.compare and len(all_results) > 1:
+            # Comparison table
+            models = list(all_results.keys())
+            print(f"\n{'Model':<35} {'Acc':<8} {'ANLS*':<8} {'Doc F1':<8} {'Page F1':<8} {'Kuiper':<8}")
+            print("-" * 75)
+            for model in sorted(models, key=lambda m: -all_results[m]['overall'].get('accuracy', 0)):
+                m = all_results[model]['overall']
+                kuiper_str = f"{m['kuiper_stat']:.2f}" if not m.get('kuiper_degenerate') else "N/A"
+                print(f"{model:<35} {m.get('accuracy', 0):.1%}    {m.get('mean_anls', 0):.4f}  "
+                      f"{m.get('doc_f1', 0):.4f}  {m.get('page_f1', 0):.4f}  {kuiper_str}")
+        else:
+            # Detailed per-model output
+            for model, result in all_results.items():
+                print(f"\n{'─' * 40}")
+                print_metrics(model, result['overall'])
+                if 'by_category' in result:
+                    print(f"\n  By Category:")
+                    for cat, metrics in sorted(result['by_category'].items(),
+                                              key=lambda x: -x[1].get('n', 0)):
+                        print_metrics(cat, metrics, indent=2)
+                if 'by_domain' in result:
+                    print(f"\n  By Domain:")
+                    for dom, metrics in sorted(result['by_domain'].items(),
+                                              key=lambda x: -x[1].get('n', 0)):
+                        print_metrics(dom, metrics, indent=2)
+    print()
+if __name__ == "__main__":
+    main()

eval/metrics.py ADDED Viewed

	@@ -0,0 +1,209 @@

+"""
+Core evaluation metrics for document QA.
+Metrics:
+- ANLS*: Answer-level Normalized Levenshtein Similarity
+- Citation F1: Document-level and Page-level F1 scores
+- Kuiper Statistic: Effort-accuracy calibration measure
+"""
+from typing import Any, Dict, List, Set, Tuple
+import numpy as np
+from anls_star import anls_score
+def anls_star(predicted: Any, ground_truths: List[List[str]]) -> float:
+    """
+    Calculate ANLS* score (case-insensitive).
+    Args:
+        predicted: Predicted answer (string or list)
+        ground_truths: List of answer variants, each variant is a list of strings
+    Returns:
+        Maximum ANLS* score across all variants (0.0 to 1.0)
+    """
+    if not ground_truths:
+        return 0.0
+    if predicted is None:
+        predicted = []
+    if isinstance(predicted, str):
+        predicted = [predicted]
+    if not predicted:
+        return 0.0
+    # Convert all elements to lowercase strings
+    pred_lower = [str(p).lower() for p in predicted]
+    max_score = 0.0
+    for gold_variant in ground_truths:
+        if isinstance(gold_variant, str):
+            gold_variant = [gold_variant]
+        gold_lower = [g.lower() if isinstance(g, str) else str(g).lower() for g in gold_variant]
+        score = anls_score(pred_lower, gold_lower)
+        max_score = max(max_score, score)
+    return max_score
+def citation_f1(
+    predicted_citations: List[Dict[str, Any]],
+    gold_locations: List[Dict[str, Any]],
+    level: str = 'page'
+) -> Dict[str, float]:
+    """
+    Calculate Citation F1 at document or page level.
+    Args:
+        predicted_citations: List of dicts with 'file'/'document' and 'page' keys
+        gold_locations: List of dicts with 'document' and 'page' keys
+        level: 'document' or 'page'
+    Returns:
+        Dict with 'precision', 'recall', 'f1', 'support'
+    """
+    if not gold_locations:
+        return {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'support': 0}
+    # Extract gold citations
+    if level == 'document':
+        gt_set: Set = {loc.get('document') for loc in gold_locations if loc.get('document')}
+    else:
+        gt_set = {
+            (loc.get('document'), loc.get('page'))
+            for loc in gold_locations
+            if loc.get('document') is not None
+        }
+    # Extract predicted citations
+    if not predicted_citations:
+        pred_set: Set = set()
+    else:
+        if level == 'document':
+            pred_set = {
+                cite.get('file') or cite.get('document')
+                for cite in predicted_citations
+                if (cite.get('file') or cite.get('document'))
+            }
+        else:
+            pred_set = {
+                (cite.get('file') or cite.get('document'), cite.get('page'))
+                for cite in predicted_citations
+                if (cite.get('file') or cite.get('document')) is not None
+            }
+    # Clean None values
+    gt_set = {c for c in gt_set if c is not None and (not isinstance(c, tuple) or None not in c)}
+    pred_set = {c for c in pred_set if c is not None and (not isinstance(c, tuple) or None not in c)}
+    if not gt_set:
+        return {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'support': 0}
+    tp = len(gt_set & pred_set)
+    precision = tp / len(pred_set) if pred_set else 0.0
+    recall = tp / len(gt_set) if gt_set else 0.0
+    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
+    return {'precision': precision, 'recall': recall, 'f1': f1, 'support': len(gt_set)}
+def kuiper_statistic(results: List[Dict]) -> Dict[str, Any]:
+    """
+    Compute Kuiper calibration statistic for effort-accuracy analysis.
+    Measures dependency between effort (steps) and accuracy. Lower values
+    indicate more uniform error distribution across effort levels.
+    Args:
+        results: List of dicts with 'steps' (int) and 'correct' (bool)
+    Returns:
+        Dict with:
+        - kuiper_stat: The Kuiper statistic (lower = better calibration)
+        - y_bar: Global mean accuracy
+        - max_positive: Maximum positive deviation
+        - max_negative: Maximum negative deviation
+        - n_samples: Number of valid samples
+        - degenerate: True if all samples have same correctness
+    """
+    valid = [r for r in results if r.get('steps', 0) > 0]
+    if not valid:
+        return {
+            'kuiper_stat': float('nan'),
+            'y_bar': 0.0,
+            'max_positive': 0.0,
+            'max_negative': 0.0,
+            'n_samples': 0,
+            'degenerate': True
+        }
+    # Sort by steps
+    sorted_results = sorted(valid, key=lambda x: x['steps'])
+    correctness = [1 if r['correct'] else 0 for r in sorted_results]
+    y_bar = np.mean(correctness)
+    # Degenerate case: all same (0% or 100% accuracy)
+    if y_bar == 0.0 or y_bar == 1.0:
+        return {
+            'kuiper_stat': float('nan'),
+            'y_bar': float(y_bar),
+            'max_positive': 0.0,
+            'max_negative': 0.0,
+            'n_samples': len(valid),
+            'degenerate': True
+        }
+    # Cumulative difference: D_k = Σ(y_i - ȳ)
+    residuals = np.array(correctness) - y_bar
+    cumulative_diff = np.cumsum(residuals)
+    max_positive = float(np.max(cumulative_diff))
+    max_negative = float(np.min(cumulative_diff))
+    kuiper_stat = max_positive - max_negative
+    return {
+        'kuiper_stat': kuiper_stat,
+        'y_bar': float(y_bar),
+        'max_positive': max_positive,
+        'max_negative': max_negative,
+        'n_samples': len(valid),
+        'degenerate': False
+    }
+def wasted_effort_ratio(results: List[Dict]) -> Dict[str, float]:
+    """
+    Compute Wasted Effort Ratio: μ_steps(Incorrect) / μ_steps(Correct).
+    - ρ > 1: Model grinds on unsolved problems (poor calibration)
+    - ρ ≈ 1: Model spends similar effort regardless of outcome
+    - ρ < 1: Model fails fast (good calibration)
+    Args:
+        results: List of dicts with 'steps' and 'correct'
+    Returns:
+        Dict with 'ratio', 'mean_steps_correct', 'mean_steps_incorrect'
+    """
+    correct_steps = [r['steps'] for r in results if r.get('correct') and r.get('steps', 0) > 0]
+    incorrect_steps = [r['steps'] for r in results if not r.get('correct') and r.get('steps', 0) > 0]
+    mean_correct = float(np.mean(correct_steps)) if correct_steps else 0.0
+    mean_incorrect = float(np.mean(incorrect_steps)) if incorrect_steps else 0.0
+    ratio = mean_incorrect / mean_correct if mean_correct > 0 else float('inf')
+    return {
+        'ratio': ratio,
+        'mean_steps_correct': mean_correct,
+        'mean_steps_incorrect': mean_incorrect,
+        'n_correct': len(correct_steps),
+        'n_incorrect': len(incorrect_steps)
+    }

eval/requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+anls-star>=0.1.0
+datasets>=2.14.0
+numpy>=1.24.0

requirements.txt CHANGED Viewed

@@ -1,17 +1,10 @@
-APScheduler
-black
-datasets
-gradio
-gradio[oauth]
-gradio_client
-gradio_leaderboard==0.0.13
-huggingface-hub>=0.18.0
-matplotlib
-numpy<2.0
 pandas
 plotly
 python-dateutil
-sentencepiece
-tokenizers>=0.15.0
-tqdm
-transformers

+streamlit>=1.37.0
 pandas
 plotly
+huggingface-hub>=0.18.0
+numpy<2.0
 python-dateutil
+# Evaluation dependencies
+anls-star>=0.1.0
+datasets>=2.14.0