File size: 4,290 Bytes
7f17b23
 
45c9afd
 
 
7f17b23
45c9afd
 
7f17b23
 
45c9afd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
title: DISBench Leaderboard
emoji: πŸ†
colorFrom: yellow
colorTo: red
sdk: docker
app_port: 7860
pinned: true
---

# πŸ† DISBench Leaderboard

Welcome to the official leaderboard for **DISBench (DeepImageSearch Benchmark)**!

DISBench is a comprehensive benchmark for evaluating DeepImageSearch methods on photo collections. This leaderboard tracks and compares the performance of various approaches on standardized evaluation metrics.

## πŸ“Š Evaluation Metrics

### Core Metrics

- **Exact Match (EM)**: Percentage of queries where the predicted photo set exactly matches the ground truth
- **F1 Score**: Harmonic mean of precision and recall at the set level

### Query Types

1. **Intra-Event**: Search within a single event or time period (e.g., "sunset photos from our beach vacation")
2. **Inter-Event**: Search across multiple events or time periods (e.g., "all birthday party photos from last year")

### Tracks

- **Standard Track**: Uses predefined constraints and standard model configurations
- **Open Track**: Allows custom models, additional training data, and external resources

All metrics are reported as:
- Overall (all queries)
- Intra-event only
- Inter-event only

## πŸš€ How to Submit

### Step-by-Step Guide

1. **Prepare Your Results**
   - Run your method on the DISBench test set
   - Format predictions according to the submission schema (see below)

2. **Submit via Web Interface**
   - Navigate to the **Submit** tab on this Space
   - Upload your JSON file containing metadata and predictions
   - Click "Submit"

3. **Automated Processing**
   - The system validates your submission format
   - A Pull Request is automatically created
   - Maintainers review the submission

4. **Leaderboard Update**
   - Once approved and merged, the Space automatically rebuilds
   - Your results are evaluated against ground truth
   - The leaderboard updates with your scores

### Submission Format

```json
{
  "meta": {
    "method_name": "Your Method Name",
    "organization": "Your Organization",
    "track": "Standard",
    "agent_framework": "Your Agent Framework (if applicable)",
    "backbone_model": "Your Backbone Model",
    "retriever_model": "Your Retriever Model (if applicable)",
    "project_url": "https://github.com/your-repo"
  },
  "predictions": {
    "1": ["photo_id_1", "photo_id_2", "photo_id_3"],
    "2": ["photo_id_4"],
    "3": ["photo_id_5", "photo_id_6"],
    ...
  }
}
```

### Field Descriptions

**Meta Fields:**
- `method_name` (required): Name of your method/system
- `organization` (optional): Your institution or organization
- `track` (required): Either "Standard" or "Open"
- `agent_framework` (optional): Agent framework used (e.g., "ReAct", "AutoGPT")
- `backbone_model` (required): Core model used (e.g., "GPT-4", "Claude-3")
- `retriever_model` (optional): Retrieval model used (e.g., "CLIP-ViT-L/14", "BM25")
- `project_url` (optional): Link to your project/paper

**Predictions:**
- Keys are query IDs (as strings)
- Values are arrays of photo IDs (as strings)
- Photo IDs should match those in the ground truth dataset

## πŸ“‹ Leaderboard Rules

### Uniqueness & Deduplication

Each entry is uniquely identified by the combination of:
- Method name
- Agent framework
- Backbone model
- Retriever model
- Track

If you submit multiple times with the same configuration, only the **latest submission** will appear on the leaderboard.

### Ranking

Entries are ranked by **Overall EM Score** in descending order. The leaderboard displays:
- Overall EM & F1
- Intra-event EM & F1
- Inter-event EM & F1

### Separate Tracks

Standard and Open track submissions are ranked separately to ensure fair comparison.

## πŸ“„ Citation

If you use DISBench in your research, please cite:

```bibtex
@misc{deng2026deepimagesearchbenchmarkingmultimodalagents,
  title={DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories}, 
  author={Chenlong Deng and Mengjie Deng and Junjie Wu and Dun Zeng and Teng Wang and Qingsong Xie and Jiadeng Huang and Shengjie Ma and Changwang Zhang and Zhaoxiang Wang and Jun Wang and Yutao Zhu and Zhicheng Dou},
  year={2026},
  eprint={2602.10809},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.10809}
}
```