Spaces:

behavior-in-the-wild
/

SDR-Arena

Sleeping

App Files Files Community

SDR-Arena / QUICK_START.md

behavior-in-the-wild

Deploy SDR-Arena leaderboard

f9e2361 verified 2 months ago

preview code

raw

history blame contribute delete

5.22 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Phase 1 Quick Start Guide

Launch the App

python -m leaderboard.app
# or if running as main:
cd leaderboard && python app.py

Then open your browser to the Gradio URL (usually http://localhost:7860)

What's New - At a Glance

Tab 1: Leaderboard (Unchanged)

Main ranking table sorted by quality score
Agent details selector
Metrics overview cards

Tab 2: Comparison & Analysis ✨ NEW

Radar Chart: See 5 metrics for 2+ agents simultaneously
Bar Chart: Focus on one metric across agents
Comparison Table: Exact numbers for all metrics
Strategy Heatmap: See which agents use which strategies

How to use:

Check boxes to select agents
Charts update automatically
Scroll down to see strategy heatmap
Use metric dropdown to change bar chart focus

Tab 3: Explorer ✨ ENHANCED

Prompt Browser: Select a benchmark prompt
Result Cards: See how each agent approached it
Search Queries: See what searches each agent performed
Quality Badges: See performance indicators

How to use:

Select a prompt from dropdown
View details on the left
Scroll through agent outputs on the right
Click agent dropdown to see their full outputs

Tab 4: Upload Results (Unchanged)

Submit your agent's results for evaluation

Tab 5: About (Unchanged)

Benchmark methodology
Feature explanations
Submission guide

Key Features Explained

Strategy Heatmap

What it shows: Which architectural strategies each agent uses

Strategy	Meaning
Multi-perspective	Research from multiple angles
Iterative	Repeat steps to refine results
Tool-calling	Coordinate multiple tools/APIs
Web-search	Live search integration
Knowledge-curation	Synthesize information intelligently
Chain-of-thought	Reasoning steps visible
Context-management	Smart token/context handling

Read it: ✓ = uses, - = doesn't use

Color Reference

Color	Meaning	Example
🟢 Green	Success/High	High quality score
🟠 Amber	Medium	Mid-range performance
🔴 Red	Low/Failed	Low quality score
🟣 Purple	Primary	Interactive elements
⚫ Gray	Neutral	Text/secondary info

Common Questions

Q: Where's the diff view? A: Coming in Phase 2! Right now Explorer shows individual outputs.

Q: How do I replay prompts? A: Coming in Phase 2! Save this feature for later.

Q: Can I export the data? A: Tables are viewable. For data export, contact the team.

Q: Why are some agents faster? A: Look at the Comparison tab - fewer searches = faster, sometimes lower quality.

Q: How is quality calculated? A: LLM-as-judge coverage scoring - see About tab for details.

Performance Tips

Comparison tab: Select 2-3 agents for clearest view
Explorer: Results show first 5 prompts for speed
Mobile: All tabs are fully responsive
Accessibility: Tab key to navigate, Enter to activate

Keyboard Shortcuts

Key	Action
Tab	Navigate between elements
Enter	Activate button/dropdown
Space	Toggle checkbox
Arrow Keys	Navigate options in dropdown

Troubleshooting

Tab won't load

Refresh the page
Check browser console for errors
Try a different browser

Charts not showing

Make sure you selected 2+ agents
Wait a moment for calculation
Try different agents

Explorer shows blank

Select a prompt from the dropdown
Some prompts might not have all results
Try another prompt

Metrics look wrong

Check the About tab for metric definitions
Compare with main Leaderboard tab
Numbers should match

File Organization

/leaderboard/
├── app.py                 ← Main app with tabs
├── css.py                 ← All styling (new: 230+ lines)
├── data_loader.py         ← Data loading (unchanged)
└── tabs/
    ├── leaderboard_tab.py    ← Ranking table (unchanged)
    ├── comparison_tab.py     ← NEW: Charts & heatmap
    ├── explorer_tab.py       ← ENHANCED: Result browser
    ├── upload_tab.py         ← Submissions (unchanged)
    └── about_tab.py          ← Info (unchanged)

Support

For Technical Issues

Check IMPLEMENTATION_CHECKLIST.md
Review code comments
See PHASE_1_ENHANCEMENTS.md

For Usage Questions

Read USER_GUIDE_PHASE_1.md
Check About tab in app
See this Quick Start guide

For Bugs

Note the exact step to reproduce
Check console for error messages
Screenshot if visual issue
Report to team

What's Next

Phase 2 (Planned):

Diff highlighting for missed information
Replay & remix interface
Advanced search
Historical trends

Phase 3 (Planned):

Custom views
Export/reporting
More analytics

Version Info

Phase: 1 (Complete)
Date: 2/9/2026
Status: Ready for Testing
Backend: Unchanged (Gradio + Python)
Frontend: Enhanced with new tabs & styling

🚀 Ready to explore? Launch the app and try the new Comparison & Analysis tab!