Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.13.0
Phase 1 Quick Start Guide
Launch the App
python -m leaderboard.app
# or if running as main:
cd leaderboard && python app.py
Then open your browser to the Gradio URL (usually http://localhost:7860)
What's New - At a Glance
Tab 1: Leaderboard (Unchanged)
- Main ranking table sorted by quality score
- Agent details selector
- Metrics overview cards
Tab 2: Comparison & Analysis β¨ NEW
- Radar Chart: See 5 metrics for 2+ agents simultaneously
- Bar Chart: Focus on one metric across agents
- Comparison Table: Exact numbers for all metrics
- Strategy Heatmap: See which agents use which strategies
How to use:
- Check boxes to select agents
- Charts update automatically
- Scroll down to see strategy heatmap
- Use metric dropdown to change bar chart focus
Tab 3: Explorer β¨ ENHANCED
- Prompt Browser: Select a benchmark prompt
- Result Cards: See how each agent approached it
- Search Queries: See what searches each agent performed
- Quality Badges: See performance indicators
How to use:
- Select a prompt from dropdown
- View details on the left
- Scroll through agent outputs on the right
- Click agent dropdown to see their full outputs
Tab 4: Upload Results (Unchanged)
- Submit your agent's results for evaluation
Tab 5: About (Unchanged)
- Benchmark methodology
- Feature explanations
- Submission guide
Key Features Explained
Strategy Heatmap
What it shows: Which architectural strategies each agent uses
| Strategy | Meaning |
|---|---|
| Multi-perspective | Research from multiple angles |
| Iterative | Repeat steps to refine results |
| Tool-calling | Coordinate multiple tools/APIs |
| Web-search | Live search integration |
| Knowledge-curation | Synthesize information intelligently |
| Chain-of-thought | Reasoning steps visible |
| Context-management | Smart token/context handling |
Read it: β = uses, - = doesn't use
Color Reference
| Color | Meaning | Example |
|---|---|---|
| π’ Green | Success/High | High quality score |
| π Amber | Medium | Mid-range performance |
| π΄ Red | Low/Failed | Low quality score |
| π£ Purple | Primary | Interactive elements |
| β« Gray | Neutral | Text/secondary info |
Common Questions
Q: Where's the diff view? A: Coming in Phase 2! Right now Explorer shows individual outputs.
Q: How do I replay prompts? A: Coming in Phase 2! Save this feature for later.
Q: Can I export the data? A: Tables are viewable. For data export, contact the team.
Q: Why are some agents faster? A: Look at the Comparison tab - fewer searches = faster, sometimes lower quality.
Q: How is quality calculated? A: LLM-as-judge coverage scoring - see About tab for details.
Performance Tips
- Comparison tab: Select 2-3 agents for clearest view
- Explorer: Results show first 5 prompts for speed
- Mobile: All tabs are fully responsive
- Accessibility: Tab key to navigate, Enter to activate
Keyboard Shortcuts
| Key | Action |
|---|---|
| Tab | Navigate between elements |
| Enter | Activate button/dropdown |
| Space | Toggle checkbox |
| Arrow Keys | Navigate options in dropdown |
Troubleshooting
Tab won't load
- Refresh the page
- Check browser console for errors
- Try a different browser
Charts not showing
- Make sure you selected 2+ agents
- Wait a moment for calculation
- Try different agents
Explorer shows blank
- Select a prompt from the dropdown
- Some prompts might not have all results
- Try another prompt
Metrics look wrong
- Check the About tab for metric definitions
- Compare with main Leaderboard tab
- Numbers should match
File Organization
/leaderboard/
βββ app.py β Main app with tabs
βββ css.py β All styling (new: 230+ lines)
βββ data_loader.py β Data loading (unchanged)
βββ tabs/
βββ leaderboard_tab.py β Ranking table (unchanged)
βββ comparison_tab.py β NEW: Charts & heatmap
βββ explorer_tab.py β ENHANCED: Result browser
βββ upload_tab.py β Submissions (unchanged)
βββ about_tab.py β Info (unchanged)
Support
For Technical Issues
- Check IMPLEMENTATION_CHECKLIST.md
- Review code comments
- See PHASE_1_ENHANCEMENTS.md
For Usage Questions
- Read USER_GUIDE_PHASE_1.md
- Check About tab in app
- See this Quick Start guide
For Bugs
- Note the exact step to reproduce
- Check console for error messages
- Screenshot if visual issue
- Report to team
What's Next
Phase 2 (Planned):
- Diff highlighting for missed information
- Replay & remix interface
- Advanced search
- Historical trends
Phase 3 (Planned):
- Custom views
- Export/reporting
- More analytics
Version Info
- Phase: 1 (Complete)
- Date: 2/9/2026
- Status: Ready for Testing
- Backend: Unchanged (Gradio + Python)
- Frontend: Enhanced with new tabs & styling
π Ready to explore? Launch the app and try the new Comparison & Analysis tab!