Spaces:

behavior-in-the-wild
/

SDR-Arena

Sleeping

App Files Files Community

SDR-Arena / README.md

behavior-in-the-wild

Deploy SDR-Arena leaderboard

f9e2361 verified about 2 months ago

preview code

raw

history blame contribute delete

5.84 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

metadata

title: SDR-Arena
emoji: 🎯
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false

DR-Bench: Deep Research Agent Leaderboard

An open benchmark and leaderboard for evaluating Deep Research agents on real-world business development research tasks.

Live Leaderboard: Hugging Face Spaces (coming soon)

Overview

DR-Bench measures how well AI research agents can:

Research companies using web search
Synthesize information from multiple sources
Generate targeted, fact-based sales pitch points

All agents are benchmarked under identical conditions:

Same LLM (configurable, default: GPT-4o)
Same search provider (Brightdata SERP API + Crawl4AI content fetching)
Same prompt dataset (PHI Prompts - business development research tasks)

Only the agent's orchestration logic differs - how it decomposes topics, generates search queries, iterates on results, and synthesizes final outputs.

Repository Structure

dr-bench/
├── app.py                      # HF Spaces entry point (Gradio)
├── requirements.txt            # Python dependencies
│
├── leaderboard/                # Phase 1: Interactive leaderboard UI
│   ├── app.py                  # Main Gradio application
│   ├── data_loader.py          # Loads leaderboard data
│   └── tabs/                   # UI tabs (leaderboard, comparison, explorer, etc.)
│
├── benchmark/                  # Benchmark framework
│   ├── interface.py            # BaseResearchAgent - the standard agent interface
│   ├── runner.py               # Benchmark orchestrator
│   ├── schemas.py              # Data models
│   ├── websearch.py            # WebSearch client (Brightdata)
│   ├── llm.py                  # Standardised LLM client wrapper
│   └── prompts.py              # PHI prompts loader
│
├── agents/                     # Reference agent implementations
│   ├── baseline_agent.py       # Simple single-pass agent
│   └── iterative_agent.py      # Multi-turn research agent
│
├── services/                   # Supporting microservices
│   └── websearch/              # Brightdata-based WebSearch service
│
├── evaluation/                 # Evaluation framework (pluggable)
│   └── evaluator.py            # Placeholder for scoring pipeline
│
├── data/                       # Benchmark data
│   ├── leaderboard.json        # Current leaderboard state
│   ├── phi_prompts/            # Benchmark prompt files
│   └── results/                # Per-agent result files
│
└── scripts/                    # CLI utilities
    ├── run_benchmark.py        # Run benchmark for an agent
    ├── convert_results.py      # Convert old format results
    └── start_services.sh       # Start WebSearch service

Quick Start

View the Leaderboard (Phase 1)

pip install -r requirements.txt
python app.py
# Open http://localhost:7860

Run a Benchmark (Phase 3)

Start the WebSearch service:

# Set up environment
cp .env.example .env
# Edit .env with your Brightdata API credentials

# Install websearch dependencies
pip install -r services/websearch/requirements.txt

# Start the service
chmod +x scripts/start_services.sh
./scripts/start_services.sh

Run an agent:

# Run the baseline agent on the first 5 prompts
python scripts/run_benchmark.py \
    --agent agents/baseline_agent.py \
    --prompts data/phi_prompts/ \
    --limit 5 \
    --update-leaderboard

# Run with a different model
python scripts/run_benchmark.py \
    --agent agents/iterative_agent.py \
    --model gpt-4o-mini \
    --prompts data/phi_prompts/

Submit Your Own Agent (Phase 2)

Create a Python file implementing BaseResearchAgent:

from benchmark.interface import BaseResearchAgent, ResearchOutput

class MyAgent(BaseResearchAgent):
    @property
    def name(self) -> str:
        return "my-research-agent"

    @property
    def description(self) -> str:
        return "My custom research methodology"

    @property
    def author(self) -> str:
        return "Your Name"

    async def research(self, topic, llm, websearch, **kwargs):
        # Your orchestration logic here
        # Use llm for LLM calls, websearch for web searches
        return ResearchOutput(report="...", searches_made=[...])

Upload via the Submit Agent tab on the leaderboard, or run locally:

python scripts/run_benchmark.py --agent path/to/my_agent.py --update-leaderboard

Metrics

Metric	Description
Quality Score	LLM-as-judge evaluation (when available)
Success Rate	% of prompts completed without errors
Avg Duration	Average time per prompt (seconds)
Avg Tokens/Prompt	Average LLM tokens consumed per prompt
Avg Searches/Prompt	Average web search queries per prompt

Converting Existing Results

If you have results from run_phi_benchmark_v2.py:

python scripts/convert_results.py path/to/old_results.json -o data/leaderboard.json

Environment Variables

Variable	Description	Default
`LLM_API_KEY`	API key for the LLM	(required)
`LLM_MODEL`	Model name	`gpt-4o`
`LLM_PROVIDER`	Provider (openai/azure/custom)	`openai`
`LLM_BASE_URL`	API base URL	`https://api.openai.com/v1`
`BRIGHTDATA_API_TOKEN`	Brightdata SERP API token	(required for benchmarks)
`BRIGHTDATA_SERP_ZONE`	Brightdata zone name	`serp_api_bdr`
`WEBSEARCH_PORT`	WebSearch service port	`8002`

License

(To be determined)