DDR_Bench / index.html
thinkwee
feat: Add new Qwen and Gemini model data, implement entropy data processing, and introduce various visualization and data management scripts.
9337e18
raw
history blame
10.1 kB
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="DDR-Bench: A Deep Data Research Agent Benchmark for LLMs">
<title>DDR-Bench | Deep Data Research Benchmark</title>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
<script src="https://cdn.plot.ly/plotly-2.27.0.min.js"></script>
<script src="data.js" defer></script>
<script src="entropy_data.js" defer></script>
<script src="charts.js" defer></script>
<link rel="stylesheet" href="styles.css">
<style>
/* Inline critical CSS for chart loading states */
.chart-loading {
display: flex;
align-items: center;
justify-content: center;
min-height: 300px;
color: #86868b;
font-size: 14px;
}
.chart-loading::after {
content: 'Loading chart...';
animation: pulse 1.5s ease-in-out infinite;
}
@keyframes pulse {
0%,
100% {
opacity: 0.4;
}
50% {
opacity: 1;
}
}
</style>
</head>
<body>
<!-- Hero Section -->
<header class="hero">
<div class="hero-content">
<div class="badge">๐Ÿ”ฌ Research Benchmark</div>
<h1>DDR-Bench</h1>
<p class="subtitle">Deep Data Research Agent Benchmark for Large Language Models</p>
<p class="description">
A comprehensive evaluation framework measuring AI agents' ability to conduct deep, iterative data
exploration across medical records (MIMIC), financial filings (10-K), and behavioral data (GLOBEM).
</p>
<div class="stats-row">
<div class="stat-item">
<span class="stat-value">22+</span>
<span class="stat-label">Models Evaluated</span>
</div>
<div class="stat-item">
<span class="stat-value">3</span>
<span class="stat-label">Diverse Datasets</span>
</div>
<div class="stat-item">
<span class="stat-value">5</span>
<span class="stat-label">Analysis Dimensions</span>
</div>
</div>
</div>
</header>
<!-- Main Content - All sections visible -->
<main class="content">
<!-- 1. Scaling Analysis Section -->
<section id="scaling" class="section visible">
<div class="section-header">
<h2>๐Ÿ“ˆ Scaling Analysis</h2>
<p>Explore how model performance scales with interaction turns, token usage, and inference cost.</p>
</div>
<div class="dimension-toggle">
<button class="dim-btn active" data-dim="turn">๐Ÿ”„ Turns</button>
<button class="dim-btn" data-dim="token">๐Ÿ“Š Tokens</button>
<button class="dim-btn" data-dim="cost">๐Ÿ’ฐ Cost</button>
</div>
<div id="scaling-legend" class="shared-legend"></div>
<div class="charts-grid three-col">
<div class="chart-card">
<h3>MIMIC</h3>
<div id="scaling-mimic" class="chart-container"></div>
</div>
<div class="chart-card">
<h3>10-K</h3>
<div id="scaling-10k" class="chart-container"></div>
</div>
<div class="chart-card">
<h3>GLOBEM</h3>
<div id="scaling-globem" class="chart-container"></div>
</div>
</div>
</section>
<!-- 2. Ranking Comparison Section -->
<section id="ranking" class="section visible">
<div class="section-header">
<h2>๐Ÿ† Ranking Comparison</h2>
<p>Novelty (Bradley-Terry) vs Accuracy ranking. โ— = Novelty, โ—‡ = Accuracy. Purple = Proprietary, Green =
Open-source.</p>
</div>
<div class="dimension-toggle">
<button class="dim-btn ranking-dim active" data-mode="novelty">๐ŸŽฏ Sort by Novelty</button>
<button class="dim-btn ranking-dim" data-mode="accuracy">๐Ÿ“Š Sort by Accuracy</button>
</div>
<div class="charts-grid three-col">
<div class="chart-card">
<h3>MIMIC</h3>
<div id="ranking-mimic" class="chart-container-tall"></div>
</div>
<div class="chart-card">
<h3>10-K</h3>
<div id="ranking-10k" class="chart-container-tall"></div>
</div>
<div class="chart-card">
<h3>GLOBEM</h3>
<div id="ranking-globem" class="chart-container-tall"></div>
</div>
</div>
</section>
<!-- 3. Turn Distribution Section -->
<section id="turn" class="section visible">
<div class="section-header">
<h2>๐Ÿ”„ Turn Distribution</h2>
<p>Analyze the distribution of interaction turns across different models and datasets.</p>
</div>
<div class="charts-grid three-col">
<div class="chart-card">
<h3>MIMIC</h3>
<div id="turn-mimic" class="chart-container-tall"></div>
</div>
<div class="chart-card">
<h3>10-K</h3>
<div id="turn-10k" class="chart-container-tall"></div>
</div>
<div class="chart-card">
<h3>GLOBEM</h3>
<div id="turn-globem" class="chart-container-tall"></div>
</div>
</div>
</section>
<!-- 4. Entropy Analysis Section -->
<section id="entropy" class="section visible">
<div class="section-header">
<h2>๐Ÿ”ฌ Entropy Analysis</h2>
<p>Scatter plot showing Access Entropy vs Coverage by model. Opacity represents accuracy. Higher entropy
= more uniform access; Higher coverage = more fields explored.</p>
</div>
<div class="dimension-toggle">
<button class="toggle-btn active" data-entropy-scenario="10k">10-K</button>
<button class="toggle-btn" data-entropy-scenario="mimic">MIMIC</button>
</div>
<div class="charts-grid three-col">
<div class="chart-card">
<h3 id="entropy-model-0-title">GPT-5.2</h3>
<div id="entropy-model-0" class="chart-container-tall"></div>
</div>
<div class="chart-card">
<h3 id="entropy-model-1-title">Claude-4.5-Sonnet</h3>
<div id="entropy-model-1" class="chart-container-tall"></div>
</div>
<div class="chart-card">
<h3 id="entropy-model-2-title">Gemini-3-Flash</h3>
<div id="entropy-model-2" class="chart-container-tall"></div>
</div>
<div class="chart-card">
<h3 id="entropy-model-3-title">GLM-4.6</h3>
<div id="entropy-model-3" class="chart-container-tall"></div>
</div>
<div class="chart-card">
<h3 id="entropy-model-4-title">Qwen3-Next-80B-A3B</h3>
<div id="entropy-model-4" class="chart-container-tall"></div>
</div>
<div class="chart-card">
<h3 id="entropy-model-5-title">DeepSeek-V3.2</h3>
<div id="entropy-model-5" class="chart-container-tall"></div>
</div>
</div>
</section>
<!-- 5. Error Analysis Section -->
<section id="error" class="section visible">
<div class="section-header">
<h2>โš ๏ธ Error Analysis</h2>
<p>Breakdown of error types encountered during agent interactions, grouped by main categories.</p>
</div>
<div class="charts-grid single">
<div class="chart-card wide">
<div id="error-chart" class="chart-container-double"></div>
</div>
</div>
</section>
<!-- 6. Probing Results Section -->
<section id="probing" class="section visible">
<div class="section-header">
<h2>๐Ÿ” Probing Results</h2>
<p>Analyze the average log probability of FINISH messages across conversation turns and progress.</p>
</div>
<div id="probing-legend" class="shared-legend"></div>
<div class="charts-grid three-col">
<div class="chart-card">
<h3>MIMIC</h3>
<div id="probing-mimic" class="chart-container-tall"></div>
</div>
<div class="chart-card">
<h3>GLOBEM</h3>
<div id="probing-globem" class="chart-container-tall"></div>
</div>
<div class="chart-card">
<h3>10-K</h3>
<div id="probing-10k" class="chart-container-tall"></div>
</div>
</div>
</section>
</main>
<!-- Footer -->
<footer class="footer">
<p>DDR-Bench ยฉ 2026 | Deep Data Research Agent Benchmark</p>
</footer>
<!-- Scripts loaded via defer in head for better parallelization -->
</body>
</html>