SDR-Arena / QUICK_START.md
behavior-in-the-wild's picture
Deploy SDR-Arena leaderboard
f9e2361 verified

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Phase 1 Quick Start Guide

Launch the App

python -m leaderboard.app
# or if running as main:
cd leaderboard && python app.py

Then open your browser to the Gradio URL (usually http://localhost:7860)

What's New - At a Glance

Tab 1: Leaderboard (Unchanged)

  • Main ranking table sorted by quality score
  • Agent details selector
  • Metrics overview cards

Tab 2: Comparison & Analysis ✨ NEW

  • Radar Chart: See 5 metrics for 2+ agents simultaneously
  • Bar Chart: Focus on one metric across agents
  • Comparison Table: Exact numbers for all metrics
  • Strategy Heatmap: See which agents use which strategies

How to use:

  1. Check boxes to select agents
  2. Charts update automatically
  3. Scroll down to see strategy heatmap
  4. Use metric dropdown to change bar chart focus

Tab 3: Explorer ✨ ENHANCED

  • Prompt Browser: Select a benchmark prompt
  • Result Cards: See how each agent approached it
  • Search Queries: See what searches each agent performed
  • Quality Badges: See performance indicators

How to use:

  1. Select a prompt from dropdown
  2. View details on the left
  3. Scroll through agent outputs on the right
  4. Click agent dropdown to see their full outputs

Tab 4: Upload Results (Unchanged)

  • Submit your agent's results for evaluation

Tab 5: About (Unchanged)

  • Benchmark methodology
  • Feature explanations
  • Submission guide

Key Features Explained

Strategy Heatmap

What it shows: Which architectural strategies each agent uses

Strategy Meaning
Multi-perspective Research from multiple angles
Iterative Repeat steps to refine results
Tool-calling Coordinate multiple tools/APIs
Web-search Live search integration
Knowledge-curation Synthesize information intelligently
Chain-of-thought Reasoning steps visible
Context-management Smart token/context handling

Read it: βœ“ = uses, - = doesn't use


Color Reference

Color Meaning Example
🟒 Green Success/High High quality score
🟠 Amber Medium Mid-range performance
πŸ”΄ Red Low/Failed Low quality score
🟣 Purple Primary Interactive elements
⚫ Gray Neutral Text/secondary info

Common Questions

Q: Where's the diff view? A: Coming in Phase 2! Right now Explorer shows individual outputs.

Q: How do I replay prompts? A: Coming in Phase 2! Save this feature for later.

Q: Can I export the data? A: Tables are viewable. For data export, contact the team.

Q: Why are some agents faster? A: Look at the Comparison tab - fewer searches = faster, sometimes lower quality.

Q: How is quality calculated? A: LLM-as-judge coverage scoring - see About tab for details.


Performance Tips

  • Comparison tab: Select 2-3 agents for clearest view
  • Explorer: Results show first 5 prompts for speed
  • Mobile: All tabs are fully responsive
  • Accessibility: Tab key to navigate, Enter to activate

Keyboard Shortcuts

Key Action
Tab Navigate between elements
Enter Activate button/dropdown
Space Toggle checkbox
Arrow Keys Navigate options in dropdown

Troubleshooting

Tab won't load

  • Refresh the page
  • Check browser console for errors
  • Try a different browser

Charts not showing

  • Make sure you selected 2+ agents
  • Wait a moment for calculation
  • Try different agents

Explorer shows blank

  • Select a prompt from the dropdown
  • Some prompts might not have all results
  • Try another prompt

Metrics look wrong

  • Check the About tab for metric definitions
  • Compare with main Leaderboard tab
  • Numbers should match

File Organization

/leaderboard/
β”œβ”€β”€ app.py                 ← Main app with tabs
β”œβ”€β”€ css.py                 ← All styling (new: 230+ lines)
β”œβ”€β”€ data_loader.py         ← Data loading (unchanged)
└── tabs/
    β”œβ”€β”€ leaderboard_tab.py    ← Ranking table (unchanged)
    β”œβ”€β”€ comparison_tab.py     ← NEW: Charts & heatmap
    β”œβ”€β”€ explorer_tab.py       ← ENHANCED: Result browser
    β”œβ”€β”€ upload_tab.py         ← Submissions (unchanged)
    └── about_tab.py          ← Info (unchanged)

Support

For Technical Issues

  • Check IMPLEMENTATION_CHECKLIST.md
  • Review code comments
  • See PHASE_1_ENHANCEMENTS.md

For Usage Questions

  • Read USER_GUIDE_PHASE_1.md
  • Check About tab in app
  • See this Quick Start guide

For Bugs

  • Note the exact step to reproduce
  • Check console for error messages
  • Screenshot if visual issue
  • Report to team

What's Next

Phase 2 (Planned):

  • Diff highlighting for missed information
  • Replay & remix interface
  • Advanced search
  • Historical trends

Phase 3 (Planned):

  • Custom views
  • Export/reporting
  • More analytics

Version Info

  • Phase: 1 (Complete)
  • Date: 2/9/2026
  • Status: Ready for Testing
  • Backend: Unchanged (Gradio + Python)
  • Frontend: Enhanced with new tabs & styling

πŸš€ Ready to explore? Launch the app and try the new Comparison & Analysis tab!