Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available: 6.12.0
metadata
title: SDR-Arena
emoji: π―
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
DR-Bench: Deep Research Agent Leaderboard
An open benchmark and leaderboard for evaluating Deep Research agents on real-world business development research tasks.
Live Leaderboard: Hugging Face Spaces (coming soon)
Overview
DR-Bench measures how well AI research agents can:
- Research companies using web search
- Synthesize information from multiple sources
- Generate targeted, fact-based sales pitch points
All agents are benchmarked under identical conditions:
- Same LLM (configurable, default: GPT-4o)
- Same search provider (Brightdata SERP API + Crawl4AI content fetching)
- Same prompt dataset (PHI Prompts - business development research tasks)
Only the agent's orchestration logic differs - how it decomposes topics, generates search queries, iterates on results, and synthesizes final outputs.
Repository Structure
dr-bench/
βββ app.py # HF Spaces entry point (Gradio)
βββ requirements.txt # Python dependencies
β
βββ leaderboard/ # Phase 1: Interactive leaderboard UI
β βββ app.py # Main Gradio application
β βββ data_loader.py # Loads leaderboard data
β βββ tabs/ # UI tabs (leaderboard, comparison, explorer, etc.)
β
βββ benchmark/ # Benchmark framework
β βββ interface.py # BaseResearchAgent - the standard agent interface
β βββ runner.py # Benchmark orchestrator
β βββ schemas.py # Data models
β βββ websearch.py # WebSearch client (Brightdata)
β βββ llm.py # Standardised LLM client wrapper
β βββ prompts.py # PHI prompts loader
β
βββ agents/ # Reference agent implementations
β βββ baseline_agent.py # Simple single-pass agent
β βββ iterative_agent.py # Multi-turn research agent
β
βββ services/ # Supporting microservices
β βββ websearch/ # Brightdata-based WebSearch service
β
βββ evaluation/ # Evaluation framework (pluggable)
β βββ evaluator.py # Placeholder for scoring pipeline
β
βββ data/ # Benchmark data
β βββ leaderboard.json # Current leaderboard state
β βββ phi_prompts/ # Benchmark prompt files
β βββ results/ # Per-agent result files
β
βββ scripts/ # CLI utilities
βββ run_benchmark.py # Run benchmark for an agent
βββ convert_results.py # Convert old format results
βββ start_services.sh # Start WebSearch service
Quick Start
View the Leaderboard (Phase 1)
pip install -r requirements.txt
python app.py
# Open http://localhost:7860
Run a Benchmark (Phase 3)
- Start the WebSearch service:
# Set up environment
cp .env.example .env
# Edit .env with your Brightdata API credentials
# Install websearch dependencies
pip install -r services/websearch/requirements.txt
# Start the service
chmod +x scripts/start_services.sh
./scripts/start_services.sh
- Run an agent:
# Run the baseline agent on the first 5 prompts
python scripts/run_benchmark.py \
--agent agents/baseline_agent.py \
--prompts data/phi_prompts/ \
--limit 5 \
--update-leaderboard
# Run with a different model
python scripts/run_benchmark.py \
--agent agents/iterative_agent.py \
--model gpt-4o-mini \
--prompts data/phi_prompts/
Submit Your Own Agent (Phase 2)
- Create a Python file implementing
BaseResearchAgent:
from benchmark.interface import BaseResearchAgent, ResearchOutput
class MyAgent(BaseResearchAgent):
@property
def name(self) -> str:
return "my-research-agent"
@property
def description(self) -> str:
return "My custom research methodology"
@property
def author(self) -> str:
return "Your Name"
async def research(self, topic, llm, websearch, **kwargs):
# Your orchestration logic here
# Use llm for LLM calls, websearch for web searches
return ResearchOutput(report="...", searches_made=[...])
- Upload via the Submit Agent tab on the leaderboard, or run locally:
python scripts/run_benchmark.py --agent path/to/my_agent.py --update-leaderboard
Metrics
| Metric | Description |
|---|---|
| Quality Score | LLM-as-judge evaluation (when available) |
| Success Rate | % of prompts completed without errors |
| Avg Duration | Average time per prompt (seconds) |
| Avg Tokens/Prompt | Average LLM tokens consumed per prompt |
| Avg Searches/Prompt | Average web search queries per prompt |
Converting Existing Results
If you have results from run_phi_benchmark_v2.py:
python scripts/convert_results.py path/to/old_results.json -o data/leaderboard.json
Environment Variables
| Variable | Description | Default |
|---|---|---|
LLM_API_KEY |
API key for the LLM | (required) |
LLM_MODEL |
Model name | gpt-4o |
LLM_PROVIDER |
Provider (openai/azure/custom) | openai |
LLM_BASE_URL |
API base URL | https://api.openai.com/v1 |
BRIGHTDATA_API_TOKEN |
Brightdata SERP API token | (required for benchmarks) |
BRIGHTDATA_SERP_ZONE |
Brightdata zone name | serp_api_bdr |
WEBSEARCH_PORT |
WebSearch service port | 8002 |
License
(To be determined)