ChenlongDeng's picture
Upload 12 files
45c9afd verified
raw
history blame
16.3 kB
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DISBench Leaderboard</title>
<!-- Favicon: RUC Logo -->
<link rel="icon" type="image/svg+xml" href="{{ url_for('static', filename='ruc-logo.png') }}" />
<link rel="stylesheet" href="{{ url_for('static', filename='styles.css') }}" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css" />
</head>
<body>
<div class="page">
<!-- Paper Header -->
<header class="paper-header">
<!-- RUC Logo -->
<!-- <div class="institution-logo">
<img src="{{ url_for('static', filename='ruc-logo.png') }}" alt="Renmin University of China" class="ruc-logo" />
</div> -->
<h1 class="paper-title">DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories</h1>
<p class="paper-authors">
Chenlong Deng<sup>1</sup>, Mengjie Deng<sup>1</sup>, Junjie Wu<sup>2</sup>, Dun Zeng<sup>2</sup>, Teng Wang<sup>2</sup>, Qingsong Xie<sup>2</sup>, Jiadeng Huang<sup>2</sup>,
Shengjie Ma<sup>1</sup>, Changwang Zhang<sup>2</sup>, Zhaoxiang Wang<sup>2</sup>, Jun Wang<sup>2</sup>, Yutao Zhu<sup>1</sup>, Zhicheng Dou<sup>1</sup>
</p>
<p class="paper-affiliations">
<span class="affil"><sup>1</sup> Gaoling School of Artificial Intelligence, Renmin University of China</span>
<span class="affil-sep">&middot;</span>
<span class="affil"><sup>2</sup> OPPO Research Institute</span>
</p>
<div class="paper-links">
<a class="badge-link" href="https://arxiv.org/abs/2602.10809" target="_blank"><i class="fas fa-file-pdf"></i> Paper</a>
<a class="badge-link" href="https://github.com/RUC-NLPIR/DeepImageSearch" target="_blank"><i class="fab fa-github"></i> Code</a>
<a class="badge-link" href="https://huggingface.co/datasets/RUC-NLPIR/DISBench" target="_blank"><i class="fas fa-database"></i> Dataset</a>
</div>
</header>
<!-- Leaderboard Title -->
<section class="leaderboard-hero">
<div class="hero-divider"></div>
<h2 class="leaderboard-title"><i class="fas fa-ranking-star"></i> 🏆 DISBench Leaderboard</h2>
<p class="leaderboard-subtitle">Track and compare multimodal agents on the DeepImageSearch task</p>
</section>
<main>
<!-- Info Cards -->
<section class="info-section">
<div class="info-grid">
<div class="info-card">
<h3><i class="fas fa-lightbulb"></i> What is DeepImageSearch and DISBench?</h3>
<p>
<strong>DeepImageSearch</strong> represents a paradigm evolution in image retrieval, advancing from independent image matching to <strong>corpus-level contextual reasoning over visual histories</strong>.
People capture thousands of photos over the years, forming rich episodic memories where information is distributed across temporal sequences rather than confined to single snapshots.
Many real-world queries over such episodic memories cannot be resolved by evaluating each image independently.
The target images can only be identified by exploring and reasoning over the entire image corpus.
This <strong>corpus-level contextual reasoning</strong> makes <strong>agentic capabilities essential rather than auxiliary</strong>.
</p>
<p>
<strong>DISBench</strong> is the <strong>first benchmark</strong> designed for this task.
Given a user's photo collection and a natural language query,
agents must autonomously plan search trajectories, discover latent cross-image associations,
and chain scattered visual evidence through multi-step exploration to return the exact set of qualifying images.
The benchmark covers two reasoning patterns:
<strong>Intra-Event</strong> queries that require locating a target event via contextual clues and then filtering within it,
and <strong>Inter-Event</strong> queries that demand scanning across multiple events to find recurring elements under temporal or spatial constraints.
</p>
</div>
<div class="info-card">
<h3><i class="fas fa-book-open"></i> How to Read the Leaderboard</h3>
<p>
<strong>Champion List</strong> shows top results per track. Use the sub-tabs to switch between:
</p>
<ul class="info-list">
<li><span class="track-tag Standard">Standard</span> Pre-processing is limited to encoding images into embeddings for building a retrieval index. No additional pre-computation (e.g., captioning, graph construction) is allowed. Tests agentic reasoning over raw visual data.</li>
<li><span class="track-tag Open">Open</span> Arbitrary pre-processing is permitted (captioning, knowledge graph construction, structured indexing, etc.). Tests system-level upper bounds with full engineering freedom.</li>
</ul>
<p>
<strong>Full Analysis</strong> lets you compare across tracks and filter by agent framework, backbone model, or retriever. Click any score column header to sort and highlight.
</p>
</div>
<div class="info-card">
<h3><i class="fas fa-ruler-combined"></i> Evaluation Metrics</h3>
<p>
All metrics are computed at the <strong>set level</strong>: models must predict the exact set of target images for each query.
</p>
<ul class="info-list">
<li><strong>EM (Exact Match)</strong>: the predicted set must be identical to the ground truth (no extra, no missing).</li>
<li><strong>F1 Score</strong>: harmonic mean of precision and recall over the predicted vs. ground-truth image sets.</li>
</ul>
<p>
Scores are reported across three dimensions:
<strong>Overall</strong> (all queries),
<strong>Intra-Event</strong> (locate a specific event, then filter targets within it), and
<strong>Inter-Event</strong> (scan across multiple events to find recurring elements under temporal/spatial constraints).
</p>
</div>
<div class="info-card">
<h3><i class="fas fa-cloud-arrow-up"></i> How to Submit</h3>
<p>
Prepare a <code>.json</code> file with two fields: <code>meta</code> (your method info) and <code>predictions</code> (your model outputs).
Go to the <strong>Submit</strong> tab, upload the file, and the system will automatically create a
<strong>Pull Request</strong> on the Space repository for review.
</p>
<p>
After maintainers merge your PR, the evaluation script will compute scores and update the leaderboard.
</p>
<p>Required fields in <code>meta</code>:</p>
<ul class="info-list compact">
<li><code>method_name</code>: display name for your method</li>
<li><code>agent_framework</code>, <code>backbone_model</code>, <code>retriever_model</code></li>
<li><code>track</code>: <code>"Standard"</code> or <code>"Open"</code></li>
</ul>
<p>See the <strong>Submit</strong> tab for the full JSON template and format details.</p>
</div>
</div>
</section>
<!-- Main Navigation -->
<nav class="main-nav">
<button class="nav-btn active" onclick="switchMainTab('leaderboard')">
<i class="fas fa-trophy"></i> Champion List
</button>
<button class="nav-btn" onclick="switchMainTab('full-metrics')">
<i class="fas fa-chart-bar"></i> Full Analysis
</button>
<button class="nav-btn" onclick="switchMainTab('submit')">
<i class="fas fa-upload"></i> Submit
</button>
</nav>
<!-- Champion List View -->
<section id="view-leaderboard" class="view-section active">
<div class="sub-tabs-container">
<button class="sub-tab-btn active" data-track="Standard">Standard Track</button>
<button class="sub-tab-btn" data-track="Open">Open Track</button>
</div>
<div class="table-container">
<table id="champion-table">
<thead>
<tr>
<th class="rank-col">Rank</th>
<th class="method-col align-left">Method</th>
<th>Agent</th>
<th>Backbone</th>
<th>Retriever</th>
<th class="sortable active-sort" data-key="overall_em">Overall EM ↓</th>
<th class="sortable" data-key="overall_f1">Overall F1</th>
<th class="sortable" data-key="intra_em">Intra EM</th>
<th class="sortable" data-key="intra_f1">Intra F1</th>
<th class="sortable" data-key="inter_em">Inter EM</th>
<th class="sortable" data-key="inter_f1">Inter F1</th>
</tr>
</thead>
<tbody></tbody>
</table>
</div>
</section>
<!-- Full Analysis View -->
<section id="view-full-metrics" class="view-section" style="display: none;">
<div class="filters-toolbar">
<div class="filter-group">
<label>Track:</label>
<select id="filter-track"><option value="all">All Tracks</option><option value="Standard">Standard</option><option value="Open">Open</option></select>
</div>
<div class="filter-group">
<label>Agent:</label>
<input type="text" id="filter-agent" placeholder="e.g. ImageSeeker">
</div>
<div class="filter-group">
<label>Backbone:</label>
<input type="text" id="filter-backbone" placeholder="e.g. Gemini">
</div>
<div class="filter-group">
<label>Retriever:</label>
<input type="text" id="filter-retriever" placeholder="e.g. CLIP">
</div>
</div>
<div class="table-container">
<table id="full-table">
<thead>
<tr>
<th class="rank-col">Rank</th>
<th class="method-col align-left">Method</th>
<th class="track-col">Track</th>
<th>Agent</th>
<th>Backbone</th>
<th>Retriever</th>
<th class="sortable active-sort" data-key="overall_em">Overall EM ↓</th>
<th class="sortable" data-key="overall_f1">Overall F1</th>
<th class="sortable" data-key="intra_em">Intra EM</th>
<th class="sortable" data-key="intra_f1">Intra F1</th>
<th class="sortable" data-key="inter_em">Inter EM</th>
<th class="sortable" data-key="inter_f1">Inter F1</th>
</tr>
</thead>
<tbody></tbody>
</table>
</div>
</section>
<!-- Submit View -->
<section id="view-submit" class="view-section" style="display: none;">
<div class="submit-container">
<div class="submit-card">
<h2><i class="fas fa-code-pull-request"></i> Submit via Pull Request</h2>
<p>
Upload a <code>.json</code> file containing your method metadata and predictions.
The system will create a <strong>Pull Request</strong> on the
<a href="https://huggingface.co/spaces/" target="_blank">Space repository</a>.
Maintainers will review, run evaluation, and merge your results into the leaderboard.
</p>
<div class="submit-flow">
<div class="flow-step">
<div class="flow-icon"><i class="fas fa-upload"></i></div>
<div class="flow-text"><strong>1. Upload</strong><br>Submit your JSON file below</div>
</div>
<div class="flow-arrow"><i class="fas fa-arrow-right"></i></div>
<div class="flow-step">
<div class="flow-icon"><i class="fas fa-code-pull-request"></i></div>
<div class="flow-text"><strong>2. PR Created</strong><br>A Pull Request is opened automatically</div>
</div>
<div class="flow-arrow"><i class="fas fa-arrow-right"></i></div>
<div class="flow-step">
<div class="flow-icon"><i class="fas fa-check-circle"></i></div>
<div class="flow-text"><strong>3. Review &amp; Merge</strong><br>Maintainers evaluate and publish scores</div>
</div>
</div>
<h4>JSON Format</h4>
<pre class="code-block">{
"meta": {
"method_name": "My-Agent",
"organization": "My-Org",
"project_url": "https://github.com/...",
"agent_framework": "ImageSeeker",
"backbone_model": "Gemini-3-Pro",
"retriever_model": "Qwen3-VL-Embedding-8B",
"track": "Standard"
},
"predictions": {
"query_001": ["photo_id_1", "photo_id_2"],
"query_002": ["photo_id_5"],
...
}
}</pre>
<h4>Field Descriptions</h4>
<div class="field-desc">
<p><code>meta.method_name</code>: Display name shown on the leaderboard (required)</p>
<p><code>meta.organization</code>: Your team or organization name (optional, not displayed)</p>
<p><code>meta.project_url</code>: Link to paper, code, or project page (optional)</p>
<p><code>meta.agent_framework</code>: Agent framework used (e.g. ReAct, ImageSeeker)</p>
<p><code>meta.backbone_model</code>: Main LLM/VLM backbone (e.g. GPT-4o, Gemini-3-Pro)</p>
<p><code>meta.retriever_model</code>: Retrieval model used (e.g. Qwen3-VL-Embedding-8B)</p>
<p><code>meta.track</code>: Must be <code>"Standard"</code> or <code>"Open"</code></p>
<p><code>predictions</code>: A dict mapping each query ID to a list of predicted photo IDs</p>
</div>
<form id="submit-form" class="upload-form">
<input type="file" name="file" id="submit-file" accept=".json" required>
<button type="submit" class="btn primary" id="submit-btn">
<i class="fas fa-paper-plane"></i> Submit &amp; Create PR
</button>
</form>
<!-- Submission Result -->
<div id="submit-result" style="display: none;"></div>
</div>
</div>
</section>
</main>
<!-- Citation -->
<section class="citation-section">
<h3><i class="fas fa-quote-left"></i> Citation</h3>
<pre class="code-block citation-block">@misc{deng2026deepimagesearchbenchmarkingmultimodalagents,
title={DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories},
author={Chenlong Deng and Mengjie Deng and Junjie Wu and Dun Zeng and Teng Wang and Qingsong Xie and Jiadeng Huang and Shengjie Ma and Changwang Zhang and Zhaoxiang Wang and Jun Wang and Yutao Zhu and Zhicheng Dou},
year={2026},
eprint={2602.10809},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.10809}
}</pre>
</section>
<footer class="footer">
<p>DISBench Leaderboard &middot; Powered by Hugging Face Spaces</p>
</footer>
</div>
<script>
window.SERVER_DATA = {{ data | tojson | default([]) }};
</script>
<script src="{{ url_for('static', filename='script.js') }}"></script>
</body>
</html>