File size: 3,679 Bytes
1c6306b
 
 
6da8289
1c6306b
 
 
 
 
 
 
 
 
 
6da8289
86d62d0
 
 
 
1c6306b
 
 
6da8289
 
1c6306b
 
 
6da8289
 
 
 
 
 
 
 
 
1c6306b
 
 
6da8289
1c6306b
 
6da8289
 
 
 
 
1c6306b
6da8289
1c6306b
6da8289
 
 
 
 
1c6306b
6da8289
 
 
 
 
1c6306b
6da8289
1c6306b
6da8289
 
1c6306b
6da8289
 
 
 
1c6306b
6da8289
 
 
 
 
 
1c6306b
6da8289
 
1c6306b
6da8289
1c6306b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
from dataclasses import dataclass
from enum import Enum


@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard
    task0 = Task("overall", "anls", "ANLS (Overall)")
    task1 = Task("single_evidence", "anls", "ANLS (Single Evidence)")
    task2 = Task("multi_evidence_same_doc", "anls", "ANLS (Multi-Evidence, Same Doc)")
    task3 = Task("multi_evidence_multi_doc", "anls", "ANLS (Multi-Evidence, Multi Doc)")


# Your leaderboard name
# Static files are served relative to the static path set in app.py
TITLE = ""  #  """<h1 align="center" id="space-title">Agentic Document AI Benchmark</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
Welcome to the **Agentic Document AI Leaderboard**! This benchmark evaluates the performance of AI agents on complex document understanding tasks that require multi-step reasoning and evidence gathering across documents.

The benchmark uses **ANLS (Average Normalized Levenshtein Similarity)** as the primary evaluation metric, measuring how well models can extract and synthesize information from documents. We evaluate performance on:
- **Overall accuracy** across the entire dataset
- **Single evidence** questions (information from one source)
- **Multi-evidence, same document** questions (combining information within a document)
- **Multi-evidence, multi-document** questions (synthesizing across multiple documents)

We also track **inference costs** in terms of agent steps and USD to help understand the efficiency of different approaches.
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = """
## How it works

The Agentic Document AI benchmark evaluates AI systems on their ability to:
1. Navigate and understand complex document structures
2. Extract relevant evidence to answer questions
3. Synthesize information from multiple sources
4. Perform multi-step reasoning with document context

## Metrics

### Performance Metrics (ANLS-based)
- **ANLS (Overall)**: Main score - Average Normalized Levenshtein Similarity across the entire dataset
- **ANLS (Single Evidence)**: Performance on questions requiring single evidence extraction
- **ANLS (Multi-Evidence, Same Doc)**: Performance when combining evidence within one document
- **ANLS (Multi-Evidence, Multi Doc)**: Performance when synthesizing across multiple documents

### Efficiency Metrics
- **Agent Steps**: Total number of reasoning/action steps taken by the agent
- **Cost (USD)**: Estimated inference cost in US dollars

## Reproducibility

To submit your results:

1. Run your model/agent on the benchmark dataset
2. Generate a JSONL file where each line contains one prediction:

```json
{"question": "What is Dr. McElhaney's position?", "answer": ["Senior Scientist"], "citations": [{"file": "1307326.pdf", "page": 1}], "iterations": 1, "id": "q_4"}
{"question": "Who is the CEO?", "answer": ["John Smith"], "citations": [{"file": "report.pdf", "page": 3}], "iterations": 2, "id": "q_5"}
```

**Required fields:**
- `question`: The question text (string)
- `answer`: List of answer strings
- `citations`: List of citation dicts with "file" and "page"
- `iterations`: Number of agent iterations/steps (integer)
- `id`: Unique question identifier (string)

3. Submit your JSONL file through the submission tab
4. The system will evaluate your predictions against the gold standard and compute ANLS scores

See `submission_template.jsonl` for a complete example.
"""