jarvis / README.md
jebaponselvasingh
first commit
0b90c85

A newer version of the Gradio SDK is available: 6.8.0

Upgrade

πŸ€– GAIA Benchmark Agent (LangGraph)

This is a LangGraph-powered agent for the HuggingFace Agents Course Final Assignment. The agent is designed to solve GAIA benchmark questions and achieve a 30%+ score on Level 1 questions to earn the course certificate.

🎯 Goal

Score 30% or higher (6+ correct out of 20 questions) on the GAIA Level 1 benchmark to earn your Certificate of Completion.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LangGraph Workflow                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚   β”‚  START  │────▢│  Agent  │────▢│   Should     β”‚     β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  Node   β”‚     β”‚  Continue?   β”‚     β”‚
β”‚                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                        β–²               β”‚    β”‚          β”‚
β”‚                        β”‚          Yes  β”‚    β”‚ No       β”‚
β”‚                        β”‚               β–Ό    β”‚          β”‚
β”‚                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”     β”‚
β”‚                   β”‚  Tool   │◀────│   Extract    β”‚     β”‚
β”‚                   β”‚  Node   β”‚     β”‚   Answer     β”‚     β”‚
β”‚                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                         β”‚              β”‚
β”‚                                         β–Ό              β”‚
β”‚                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚                                    β”‚   END   β”‚        β”‚
β”‚                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Available Tools

Tool Description Use Case
πŸ” web_search DuckDuckGo web search Current information, recent events, facts
πŸ“š wikipedia_search Wikipedia API Historical facts, biographies, definitions
🐍 python_executor Python REPL Calculations, data processing, analysis
πŸ“„ read_file File reader PDFs, text files, Excel spreadsheets
πŸ”’ calculator Math evaluator Quick mathematical calculations

πŸš€ Setup

Option 1: HuggingFace Spaces (Recommended for Certification)

  1. Fork/Duplicate this Space to your HuggingFace account

    • Go to the Space and click "Duplicate this Space"
    • Choose a name and make it Public (required for certification)
  2. Add API Key

    • Go to Space Settings > Secrets
    • Add a new secret: OPENAI_API_KEY with your OpenAI API key value
    • Click "Save secrets"
  3. Deploy

    • The Space will automatically build and deploy
    • Wait for the build to complete (usually 2-5 minutes)
  4. Test and Submit

    • Open the Space and test with a single question
    • Run the full benchmark
    • Submit to the leaderboard

Option 2: Local Development

# Clone the repository
git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE
cd YOUR_SPACE

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set environment variable
export OPENAI_API_KEY="sk-..."  # On Windows: set OPENAI_API_KEY=sk-...

# Run the app
python app.py

The app will be available at http://localhost:7860

πŸ“– Usage

1. Test Single Question

  • Click "Fetch & Solve Random Question" to test the agent on one question
  • Review the answer and validation status
  • This helps verify the agent is working correctly before running the full benchmark

2. Run Full Benchmark

  • Click "Run Agent on All Questions"
  • The process takes approximately 10-15 minutes
  • Progress is shown in real-time
  • Results are displayed in a table
  • Answers are automatically formatted for submission

3. Submit to Leaderboard

  • After running the benchmark, go to the "Submit to Leaderboard" tab
  • Enter your HuggingFace username
  • Enter your Space URL (must be public and end with /tree/main)
  • Answers JSON is auto-filled
  • Click "Submit to Leaderboard"
  • View your score and ranking

πŸŽ“ Tips for Better Scores

Answer Formatting (Critical!)

The GAIA benchmark uses exact string matching. Your answers must match the ground truth character-for-character.

βœ… DO:

  • Give just the number: "42"
  • Use exact spelling: "John Smith"
  • Comma-separated lists with NO spaces: "apple,banana,cherry"
  • Just "Yes" or "No" (capitalized)
  • Follow the date format specified in the question

❌ DON'T:

  • Include prefixes like "FINAL ANSWER:" or "The answer is:"
  • Add explanations or context
  • Use different capitalization or spelling
  • Add spaces in comma-separated lists
  • Include units unless specifically requested

Agent Strategy

  1. File Priority: If a file is available, the agent reads it first - answers are often in the file
  2. Tool Selection: The agent automatically chooses the best tool for each task
  3. Iteration Limit: The agent has up to 15 iterations to solve each question
  4. Error Handling: The agent gracefully handles errors and tries alternative approaches

Best Practices

  1. Test First: Always test with a single question before running the full benchmark
  2. Review Answers: Check the validation status for each answer
  3. Verify Format: Ensure answers don't contain prefixes or explanations
  4. Public Space: Keep your Space public so the code link works for verification
  5. API Key: Ensure your OpenAI API key has sufficient credits

βš™οΈ Configuration

Modifying the Agent

The agent can be customized in agent_enhanced.py:

  • Model: Change model_name in GAIAAgent.__init__() (default: "gpt-4o")
  • Temperature: Adjust temperature (default: 0 for deterministic)
  • Max Iterations: Change max_iterations (default: 15)
  • System Prompt: Modify SYSTEM_PROMPT for different instructions
  • Tools: Add or remove tools from the TOOLS list

Environment Variables

  • OPENAI_API_KEY: Required - Your OpenAI API key

πŸ› Troubleshooting

Common Issues

"Please provide your OpenAI API key"

  • Ensure OPENAI_API_KEY is set in Space Secrets (for HF Spaces) or environment variables (for local)

"Failed to fetch questions from API"

  • Check your internet connection
  • Verify the API URL is accessible: https://agents-course-unit4-scoring.hf.space
  • The API may be temporarily unavailable - try again later

"Agent error: ..."

  • Check that your OpenAI API key is valid and has credits
  • Verify the model name is correct (e.g., "gpt-4o")
  • Review the error message for specific issues

"Submission error: ..."

  • Ensure your Space URL is correct and public
  • Verify the URL ends with /tree/main (auto-added if missing)
  • Check that answers JSON is properly formatted
  • Ensure your HuggingFace username is correct

Low Scores (< 30%)

  • Review answer formatting - exact matching is critical
  • Check that answers don't contain prefixes or explanations
  • Verify file reading is working (some questions require file analysis)
  • Consider increasing max_iterations for complex questions
  • Test with single questions to identify patterns

Getting Help

πŸ“ Project Structure

certification/
β”œβ”€β”€ app.py                 # Gradio interface and main entry point
β”œβ”€β”€ agent_enhanced.py      # LangGraph agent implementation
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ README.md             # This file
└── .gitignore            # Git ignore rules

πŸ”— Important Links

πŸ“Š Scoring

  • Target: 30%+ (6+ correct out of 20 questions)
  • Evaluation: Exact string matching
  • Questions: 20 Level 1 questions from GAIA validation set
  • Submission: Via the API endpoint /submit

πŸ† Certification

Once you achieve 30% or higher:

  1. Your score will appear on the Student Leaderboard
  2. You'll earn the Certificate of Completion
  3. Share your achievement!

πŸ“ License

MIT License

πŸ™ Acknowledgments