thinkwee
feat: Add new Qwen and Gemini model data, implement entropy data processing, and introduce various visualization and data management scripts.
9337e18 | <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <meta name="description" content="DDR-Bench: A Deep Data Research Agent Benchmark for LLMs"> | |
| <title>DDR-Bench | Deep Data Research Benchmark</title> | |
| <link rel="preconnect" href="https://fonts.googleapis.com"> | |
| <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin> | |
| <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet"> | |
| <script src="https://cdn.plot.ly/plotly-2.27.0.min.js"></script> | |
| <script src="data.js" defer></script> | |
| <script src="entropy_data.js" defer></script> | |
| <script src="charts.js" defer></script> | |
| <link rel="stylesheet" href="styles.css"> | |
| <style> | |
| /* Inline critical CSS for chart loading states */ | |
| .chart-loading { | |
| display: flex; | |
| align-items: center; | |
| justify-content: center; | |
| min-height: 300px; | |
| color: #86868b; | |
| font-size: 14px; | |
| } | |
| .chart-loading::after { | |
| content: 'Loading chart...'; | |
| animation: pulse 1.5s ease-in-out infinite; | |
| } | |
| @keyframes pulse { | |
| 0%, | |
| 100% { | |
| opacity: 0.4; | |
| } | |
| 50% { | |
| opacity: 1; | |
| } | |
| } | |
| </style> | |
| </head> | |
| <body> | |
| <!-- Hero Section --> | |
| <header class="hero"> | |
| <div class="hero-content"> | |
| <div class="badge">๐ฌ Research Benchmark</div> | |
| <h1>DDR-Bench</h1> | |
| <p class="subtitle">Deep Data Research Agent Benchmark for Large Language Models</p> | |
| <p class="description"> | |
| A comprehensive evaluation framework measuring AI agents' ability to conduct deep, iterative data | |
| exploration across medical records (MIMIC), financial filings (10-K), and behavioral data (GLOBEM). | |
| </p> | |
| <div class="stats-row"> | |
| <div class="stat-item"> | |
| <span class="stat-value">22+</span> | |
| <span class="stat-label">Models Evaluated</span> | |
| </div> | |
| <div class="stat-item"> | |
| <span class="stat-value">3</span> | |
| <span class="stat-label">Diverse Datasets</span> | |
| </div> | |
| <div class="stat-item"> | |
| <span class="stat-value">5</span> | |
| <span class="stat-label">Analysis Dimensions</span> | |
| </div> | |
| </div> | |
| </div> | |
| </header> | |
| <!-- Main Content - All sections visible --> | |
| <main class="content"> | |
| <!-- 1. Scaling Analysis Section --> | |
| <section id="scaling" class="section visible"> | |
| <div class="section-header"> | |
| <h2>๐ Scaling Analysis</h2> | |
| <p>Explore how model performance scales with interaction turns, token usage, and inference cost.</p> | |
| </div> | |
| <div class="dimension-toggle"> | |
| <button class="dim-btn active" data-dim="turn">๐ Turns</button> | |
| <button class="dim-btn" data-dim="token">๐ Tokens</button> | |
| <button class="dim-btn" data-dim="cost">๐ฐ Cost</button> | |
| </div> | |
| <div id="scaling-legend" class="shared-legend"></div> | |
| <div class="charts-grid three-col"> | |
| <div class="chart-card"> | |
| <h3>MIMIC</h3> | |
| <div id="scaling-mimic" class="chart-container"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3>10-K</h3> | |
| <div id="scaling-10k" class="chart-container"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3>GLOBEM</h3> | |
| <div id="scaling-globem" class="chart-container"></div> | |
| </div> | |
| </div> | |
| </section> | |
| <!-- 2. Ranking Comparison Section --> | |
| <section id="ranking" class="section visible"> | |
| <div class="section-header"> | |
| <h2>๐ Ranking Comparison</h2> | |
| <p>Novelty (Bradley-Terry) vs Accuracy ranking. โ = Novelty, โ = Accuracy. Purple = Proprietary, Green = | |
| Open-source.</p> | |
| </div> | |
| <div class="dimension-toggle"> | |
| <button class="dim-btn ranking-dim active" data-mode="novelty">๐ฏ Sort by Novelty</button> | |
| <button class="dim-btn ranking-dim" data-mode="accuracy">๐ Sort by Accuracy</button> | |
| </div> | |
| <div class="charts-grid three-col"> | |
| <div class="chart-card"> | |
| <h3>MIMIC</h3> | |
| <div id="ranking-mimic" class="chart-container-tall"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3>10-K</h3> | |
| <div id="ranking-10k" class="chart-container-tall"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3>GLOBEM</h3> | |
| <div id="ranking-globem" class="chart-container-tall"></div> | |
| </div> | |
| </div> | |
| </section> | |
| <!-- 3. Turn Distribution Section --> | |
| <section id="turn" class="section visible"> | |
| <div class="section-header"> | |
| <h2>๐ Turn Distribution</h2> | |
| <p>Analyze the distribution of interaction turns across different models and datasets.</p> | |
| </div> | |
| <div class="charts-grid three-col"> | |
| <div class="chart-card"> | |
| <h3>MIMIC</h3> | |
| <div id="turn-mimic" class="chart-container-tall"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3>10-K</h3> | |
| <div id="turn-10k" class="chart-container-tall"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3>GLOBEM</h3> | |
| <div id="turn-globem" class="chart-container-tall"></div> | |
| </div> | |
| </div> | |
| </section> | |
| <!-- 4. Entropy Analysis Section --> | |
| <section id="entropy" class="section visible"> | |
| <div class="section-header"> | |
| <h2>๐ฌ Entropy Analysis</h2> | |
| <p>Scatter plot showing Access Entropy vs Coverage by model. Opacity represents accuracy. Higher entropy | |
| = more uniform access; Higher coverage = more fields explored.</p> | |
| </div> | |
| <div class="dimension-toggle"> | |
| <button class="toggle-btn active" data-entropy-scenario="10k">10-K</button> | |
| <button class="toggle-btn" data-entropy-scenario="mimic">MIMIC</button> | |
| </div> | |
| <div class="charts-grid three-col"> | |
| <div class="chart-card"> | |
| <h3 id="entropy-model-0-title">GPT-5.2</h3> | |
| <div id="entropy-model-0" class="chart-container-tall"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3 id="entropy-model-1-title">Claude-4.5-Sonnet</h3> | |
| <div id="entropy-model-1" class="chart-container-tall"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3 id="entropy-model-2-title">Gemini-3-Flash</h3> | |
| <div id="entropy-model-2" class="chart-container-tall"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3 id="entropy-model-3-title">GLM-4.6</h3> | |
| <div id="entropy-model-3" class="chart-container-tall"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3 id="entropy-model-4-title">Qwen3-Next-80B-A3B</h3> | |
| <div id="entropy-model-4" class="chart-container-tall"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3 id="entropy-model-5-title">DeepSeek-V3.2</h3> | |
| <div id="entropy-model-5" class="chart-container-tall"></div> | |
| </div> | |
| </div> | |
| </section> | |
| <!-- 5. Error Analysis Section --> | |
| <section id="error" class="section visible"> | |
| <div class="section-header"> | |
| <h2>โ ๏ธ Error Analysis</h2> | |
| <p>Breakdown of error types encountered during agent interactions, grouped by main categories.</p> | |
| </div> | |
| <div class="charts-grid single"> | |
| <div class="chart-card wide"> | |
| <div id="error-chart" class="chart-container-double"></div> | |
| </div> | |
| </div> | |
| </section> | |
| <!-- 6. Probing Results Section --> | |
| <section id="probing" class="section visible"> | |
| <div class="section-header"> | |
| <h2>๐ Probing Results</h2> | |
| <p>Analyze the average log probability of FINISH messages across conversation turns and progress.</p> | |
| </div> | |
| <div id="probing-legend" class="shared-legend"></div> | |
| <div class="charts-grid three-col"> | |
| <div class="chart-card"> | |
| <h3>MIMIC</h3> | |
| <div id="probing-mimic" class="chart-container-tall"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3>GLOBEM</h3> | |
| <div id="probing-globem" class="chart-container-tall"></div> | |
| </div> | |
| <div class="chart-card"> | |
| <h3>10-K</h3> | |
| <div id="probing-10k" class="chart-container-tall"></div> | |
| </div> | |
| </div> | |
| </section> | |
| </main> | |
| <!-- Footer --> | |
| <footer class="footer"> | |
| <p>DDR-Bench ยฉ 2026 | Deep Data Research Agent Benchmark</p> | |
| </footer> | |
| <!-- Scripts loaded via defer in head for better parallelization --> | |
| </body> | |
| </html> |