SDR-Arena / README.md
behavior-in-the-wild's picture
Deploy SDR-Arena leaderboard
f9e2361 verified

A newer version of the Gradio SDK is available: 6.12.0

Upgrade
metadata
title: SDR-Arena
emoji: 🎯
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false

DR-Bench: Deep Research Agent Leaderboard

An open benchmark and leaderboard for evaluating Deep Research agents on real-world business development research tasks.

Live Leaderboard: Hugging Face Spaces (coming soon)

Overview

DR-Bench measures how well AI research agents can:

  • Research companies using web search
  • Synthesize information from multiple sources
  • Generate targeted, fact-based sales pitch points

All agents are benchmarked under identical conditions:

  • Same LLM (configurable, default: GPT-4o)
  • Same search provider (Brightdata SERP API + Crawl4AI content fetching)
  • Same prompt dataset (PHI Prompts - business development research tasks)

Only the agent's orchestration logic differs - how it decomposes topics, generates search queries, iterates on results, and synthesizes final outputs.

Repository Structure

dr-bench/
β”œβ”€β”€ app.py                      # HF Spaces entry point (Gradio)
β”œβ”€β”€ requirements.txt            # Python dependencies
β”‚
β”œβ”€β”€ leaderboard/                # Phase 1: Interactive leaderboard UI
β”‚   β”œβ”€β”€ app.py                  # Main Gradio application
β”‚   β”œβ”€β”€ data_loader.py          # Loads leaderboard data
β”‚   └── tabs/                   # UI tabs (leaderboard, comparison, explorer, etc.)
β”‚
β”œβ”€β”€ benchmark/                  # Benchmark framework
β”‚   β”œβ”€β”€ interface.py            # BaseResearchAgent - the standard agent interface
β”‚   β”œβ”€β”€ runner.py               # Benchmark orchestrator
β”‚   β”œβ”€β”€ schemas.py              # Data models
β”‚   β”œβ”€β”€ websearch.py            # WebSearch client (Brightdata)
β”‚   β”œβ”€β”€ llm.py                  # Standardised LLM client wrapper
β”‚   └── prompts.py              # PHI prompts loader
β”‚
β”œβ”€β”€ agents/                     # Reference agent implementations
β”‚   β”œβ”€β”€ baseline_agent.py       # Simple single-pass agent
β”‚   └── iterative_agent.py      # Multi-turn research agent
β”‚
β”œβ”€β”€ services/                   # Supporting microservices
β”‚   └── websearch/              # Brightdata-based WebSearch service
β”‚
β”œβ”€β”€ evaluation/                 # Evaluation framework (pluggable)
β”‚   └── evaluator.py            # Placeholder for scoring pipeline
β”‚
β”œβ”€β”€ data/                       # Benchmark data
β”‚   β”œβ”€β”€ leaderboard.json        # Current leaderboard state
β”‚   β”œβ”€β”€ phi_prompts/            # Benchmark prompt files
β”‚   └── results/                # Per-agent result files
β”‚
└── scripts/                    # CLI utilities
    β”œβ”€β”€ run_benchmark.py        # Run benchmark for an agent
    β”œβ”€β”€ convert_results.py      # Convert old format results
    └── start_services.sh       # Start WebSearch service

Quick Start

View the Leaderboard (Phase 1)

pip install -r requirements.txt
python app.py
# Open http://localhost:7860

Run a Benchmark (Phase 3)

  1. Start the WebSearch service:
# Set up environment
cp .env.example .env
# Edit .env with your Brightdata API credentials

# Install websearch dependencies
pip install -r services/websearch/requirements.txt

# Start the service
chmod +x scripts/start_services.sh
./scripts/start_services.sh
  1. Run an agent:
# Run the baseline agent on the first 5 prompts
python scripts/run_benchmark.py \
    --agent agents/baseline_agent.py \
    --prompts data/phi_prompts/ \
    --limit 5 \
    --update-leaderboard

# Run with a different model
python scripts/run_benchmark.py \
    --agent agents/iterative_agent.py \
    --model gpt-4o-mini \
    --prompts data/phi_prompts/

Submit Your Own Agent (Phase 2)

  1. Create a Python file implementing BaseResearchAgent:
from benchmark.interface import BaseResearchAgent, ResearchOutput

class MyAgent(BaseResearchAgent):
    @property
    def name(self) -> str:
        return "my-research-agent"

    @property
    def description(self) -> str:
        return "My custom research methodology"

    @property
    def author(self) -> str:
        return "Your Name"

    async def research(self, topic, llm, websearch, **kwargs):
        # Your orchestration logic here
        # Use llm for LLM calls, websearch for web searches
        return ResearchOutput(report="...", searches_made=[...])
  1. Upload via the Submit Agent tab on the leaderboard, or run locally:
python scripts/run_benchmark.py --agent path/to/my_agent.py --update-leaderboard

Metrics

Metric Description
Quality Score LLM-as-judge evaluation (when available)
Success Rate % of prompts completed without errors
Avg Duration Average time per prompt (seconds)
Avg Tokens/Prompt Average LLM tokens consumed per prompt
Avg Searches/Prompt Average web search queries per prompt

Converting Existing Results

If you have results from run_phi_benchmark_v2.py:

python scripts/convert_results.py path/to/old_results.json -o data/leaderboard.json

Environment Variables

Variable Description Default
LLM_API_KEY API key for the LLM (required)
LLM_MODEL Model name gpt-4o
LLM_PROVIDER Provider (openai/azure/custom) openai
LLM_BASE_URL API base URL https://api.openai.com/v1
BRIGHTDATA_API_TOKEN Brightdata SERP API token (required for benchmarks)
BRIGHTDATA_SERP_ZONE Brightdata zone name serp_api_bdr
WEBSEARCH_PORT WebSearch service port 8002

License

(To be determined)