agent_unit4

Sleeping

App Files Files Community

Ashokdll commited on Jun 4, 2025

Commit

3324365

verified ·

1 Parent(s): f1ad884

Update app.py

Browse files

Files changed (1) hide show

app.py +178 -18

app.py CHANGED Viewed

@@ -802,7 +802,100 @@ def create_gaia_app():
                 )
             # ===============================
-            # TAB 4: INFORMATION
             # ===============================
             with gr.Tab("ℹ️ Information"):
                 gr.Markdown("""
@@ -815,6 +908,15 @@ def create_gaia_app():
                 - **Web browsing**: Finding and using external information
                 - **Tool use**: Calculator, code execution, etc.
                 ## 🎯 How to Use This Space
                 ### 1. Model Setup
@@ -832,31 +934,89 @@ def create_gaia_app():
                 - Then try "GAIA Test Set" for real benchmark evaluation
                 - Download results in JSONL format for submission
                 ## 📊 Model Recommendations
-                | Model | Best For | Memory | Speed | Quality |
-                |-------|----------|---------|-------|---------|
-                | Fast & Light | Quick testing | Low | Fast | Good |
-                | Balanced | General use | Medium | Medium | Better |
-                | High Quality | Best results | High | Slow | Best |
-                | Instruction Following | Complex reasoning | High | Medium | Excellent |
                 ## 🔗 Resources
-                - [GAIA Paper](https://arxiv.org/abs/2311.12983)
-                - [GAIA Leaderboard](https://huggingface.co/spaces/gaia-benchmark/leaderboard)
-                - [Hugging Face Spaces Documentation](https://huggingface.co/docs/hub/spaces)
-                ## 🚀 Output Format
-                Results are saved in GAIA leaderboard format:
                 ```json
-                {"task_id": "gaia_001", "model_answer": "[FULL RESPONSE]", "reasoning_trace": "[REASONING]"}
                 ```
-                ## ⚡ Tips for Best Results
-                1. **Start Small**: Test with sample questions first
-                2. **Choose Right Model**: Balance speed vs quality for your needs
-                3. **Monitor GPU**: Larger models need GPU acceleration
-                4. **Download Results**: Save JSONL files for leaderboard submission
                 """)
         return app

                 )
             # ===============================
+            # TAB 4: FULL BENCHMARK (NEW)
+            # ===============================
+            with gr.Tab("🏆 Full Benchmark"):
+                gr.Markdown("## Official GAIA Leaderboard Benchmark")
+                with gr.Row():
+                    with gr.Column():
+                        gr.Markdown(get_leaderboard_info())
+                    with gr.Column():
+                        # Test questions preview
+                        test_preview_btn = gr.Button("🔍 Preview Test Questions", variant="secondary")
+                        test_preview_output = gr.Markdown(
+                            value="Click above to preview official test questions"
+                        )
+                        # Full benchmark
+                        gr.Markdown("### 🚀 Run Complete Benchmark")
+                        gr.Markdown("""
+                        **Warning**: This will evaluate your model on all ~300 official GAIA test questions.
+                        This process may take 1-3 hours depending on your model and hardware.
+                        """)
+                        full_benchmark_btn = gr.Button(
+                            "🏆 Start Full Benchmark (300 Questions)",
+                            variant="primary",
+                            size="lg"
+                        )
+                # Benchmark results
+                benchmark_status = gr.Textbox(
+                    label="📊 Benchmark Status",
+                    value="Ready to run benchmark",
+                    interactive=False
+                )
+                with gr.Row():
+                    with gr.Column():
+                        benchmark_report = gr.Markdown(
+                            label="📈 Benchmark Report",
+                            value="Run benchmark to see detailed results"
+                        )
+                    with gr.Column():
+                        # Download files
+                        submission_file = gr.File(
+                            label="💾 Download Submission File (JSONL)",
+                            visible=False
+                        )
+                        metadata_file = gr.File(
+                            label="📋 Download Metadata File",
+                            visible=False
+                        )
+                        gr.Markdown("""
+                        ### 📤 Leaderboard Submission
+                        1. Download the JSONL file above
+                        2. Visit [GAIA Leaderboard](https://huggingface.co/spaces/gaia-benchmark/leaderboard)
+                        3. Upload your submission file
+                        4. View your model's ranking!
+                        """)
+                # Event handlers
+                test_preview_btn.click(
+                    fn=load_test_questions_interface,
+                    outputs=[test_preview_output]
+                )
+                def full_benchmark_with_files(*args):
+                    status, report, sub_file, meta_file = run_leaderboard_benchmark_interface(*args)
+                    return (
+                        status,
+                        report,
+                        sub_file,
+                        meta_file,
+                        gr.update(visible=True),  # Show submission file
+                        gr.update(visible=True)   # Show metadata file
+                    )
+                full_benchmark_btn.click(
+                    fn=full_benchmark_with_files,
+                    outputs=[
+                        benchmark_status,
+                        benchmark_report,
+                        submission_file,
+                        metadata_file,
+                        submission_file,  # Update visibility
+                        metadata_file     # Update visibility
+                    ]
+                )
+            # ===============================
+            # TAB 5: INFORMATION (UPDATED)
             # ===============================
             with gr.Tab("ℹ️ Information"):
                 gr.Markdown("""
                 - **Web browsing**: Finding and using external information
                 - **Tool use**: Calculator, code execution, etc.
+                ## 🏆 GAIA Public Leaderboard
+                GAIA provides a **public leaderboard hosted on Hugging Face** where you can:
+                - Test your models against **300 official testing questions**
+                - Compare performance with state-of-the-art systems
+                - Track progress in AI reasoning capabilities
+                - Contribute to research community benchmarks
+                **Leaderboard URL**: [https://huggingface.co/spaces/gaia-benchmark/leaderboard](https://huggingface.co/spaces/gaia-benchmark/leaderboard)
                 ## 🎯 How to Use This Space
                 ### 1. Model Setup
                 - Then try "GAIA Test Set" for real benchmark evaluation
                 - Download results in JSONL format for submission
+                ### 4. Full Benchmark (NEW!)
+                - Run complete evaluation on all 300 official test questions
+                - Get leaderboard-ready submission files
+                - Upload directly to GAIA leaderboard for ranking
                 ## 📊 Model Recommendations
+                | Model | Best For | Memory | Speed | Quality | Leaderboard Ready |
+                |-------|----------|---------|-------|---------|------------------|
+                | Fast & Light | Quick testing | Low | Fast | Good | ✅ |
+                | Balanced | General use | Medium | Medium | Better | ✅ |
+                | High Quality | Best results | High | Slow | Best | ✅ |
+                | Instruction Following | Complex reasoning | High | Medium | Excellent | ✅ |
+                ## 🏅 Benchmark Performance Expectations
+                Based on current leaderboard standings, expect these performance ranges:
+                | Difficulty Level | Top Models | Good Models | Baseline Models |
+                |------------------|------------|-------------|-----------------|
+                | **Level 1** (Basic) | 85-95% | 70-85% | 50-70% |
+                | **Level 2** (Intermediate) | 65-80% | 45-65% | 25-45% |
+                | **Level 3** (Advanced) | 35-60% | 20-35% | 10-20% |
+                | **Overall Average** | 65-75% | 45-65% | 30-45% |
+                ## 🚀 Continuous Benchmarking Workflow
+                1. **Development**: Test with sample questions
+                2. **Validation**: Run batch evaluation (10-50 questions)
+                3. **Benchmarking**: Full evaluation (300 questions)
+                4. **Submission**: Upload to leaderboard
+                5. **Analysis**: Compare with other models
+                6. **Iteration**: Improve and re-benchmark
                 ## 🔗 Resources
+                - [GAIA Paper](https://arxiv.org/abs/2311.12983) - Original research paper
+                - [GAIA Leaderboard](https://huggingface.co/spaces/gaia-benchmark/leaderboard) - Official rankings
+                - [GAIA Dataset](https://huggingface.co/datasets/gaia-benchmark/GAIA) - Training/validation data
+                - [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces) - Deployment documentation
+                ## 📋 Submission Format
+                Results are saved in official GAIA leaderboard format:
                 ```json
+                {"task_id": "gaia_001", "model_answer": "[FULL RESPONSE]", "reasoning_trace": "[STEP-BY-STEP REASONING]"}
+                {"task_id": "gaia_002", "model_answer": "[FULL RESPONSE]", "reasoning_trace": "[STEP-BY-STEP REASONING]"}
                 ```
+                ## ⚡ Pro Tips for Best Results
+                ### Performance Optimization
+                1. **Start Small**: Always test with sample questions first
+                2. **Choose Wisely**: Balance speed vs quality based on your goals
+                3. **Monitor Resources**: Use GPU acceleration for larger models
+                4. **Validate Format**: Ensure JSONL files are properly formatted
+                ### Leaderboard Strategy
+                1. **Baseline First**: Get initial results with fast model
+                2. **Iterate Quickly**: Test improvements on small batches
+                3. **Full Benchmark**: Run complete evaluation when ready
+                4. **Compare Results**: Analyze performance across difficulty levels
+                5. **Document Approach**: Include model details and methodology
+                ### Common Pitfalls to Avoid
+                - Don't run full benchmark on untested models
+                - Ensure stable internet connection for long evaluations
+                - Verify submission file format before uploading
+                - Check GPU memory usage for large models
+                - Save intermediate results during long runs
+                ## 🎯 Getting Started Checklist
+                - [ ] Load and test a model in "Model Setup"
+                - [ ] Try example questions in "Single Question"
+                - [ ] Run small batch in "Batch Evaluation"
+                - [ ] Review test questions in "Full Benchmark"
+                - [ ] Run complete benchmark when ready
+                - [ ] Download submission files
+                - [ ] Upload to GAIA leaderboard
+                - [ ] Compare your results with others!
+                ---
+                **Ready to start benchmarking?** Begin with the Model Setup tab and work your way through each stage. Good luck! 🚀
                 """)
         return app