Spaces:
Runtime error
Runtime error
| title: ReproAgent | |
| emoji: π¬ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: 4.12.0 | |
| python_version: 3.12 | |
| app_file: server/app.py | |
| pinned: false | |
| <p align="center"> | |
| <img src="assets/banner.png" alt="ReproAgent Banner" width="100%"/> | |
| </p> | |
| <h1 align="center">π¬ ReproAgent</h1> | |
| <p align="center"> | |
| <strong>An AI-powered agent that automatically reproduces machine learning research papers.</strong> | |
| </p> | |
| <p align="center"> | |
| <a href="#-features"><img src="https://img.shields.io/badge/Features-8-blue?style=for-the-badge" alt="Features"/></a> | |
| <a href="#-quick-start"><img src="https://img.shields.io/badge/Python-3.10+-green?style=for-the-badge&logo=python&logoColor=white" alt="Python"/></a> | |
| <a href="#-license"><img src="https://img.shields.io/badge/License-MIT-orange?style=for-the-badge" alt="License"/></a> | |
| <a href="https://huggingface.co/spaces"><img src="https://img.shields.io/badge/π€-HuggingFace_Spaces-yellow?style=for-the-badge" alt="HF Spaces"/></a> | |
| </p> | |
| <p align="center"> | |
| Upload a research paper PDF β ReproAgent reads it β finds the repo β clones the code β sets up the environment β runs it β debugs errors β tunes hyperparameters β compares results. | |
| </p> | |
| --- | |
| ## π OpenEnv Hackathon Submission | |
| This project is submitted to the **OpenEnv Hackathon**. It is a fully compliant environment built on top of the framework. | |
| ### Required Materials | |
| - **Hugging Face Space**: [ReproAgent Live Demo](https://huggingface.co/spaces/username/reproagent) | |
| - **Training Script (TRL/PPO)**: [Colab Notebook](training/train_reproagent.ipynb) | |
| - **Evidence of Training**: We trained the agent using Proximal Policy Optimization (PPO) over 50 episodes. | |
| <br><img src="assets/reward_plot.png" alt="Reward Plot" width="400"/> <img src="assets/loss_plot.png" alt="Loss Plot" width="400"/> | |
| - **Presentation**: [Mini-Blog on HuggingFace](https://huggingface.co/blog/reproagent-openenv) / [YouTube Demo (< 2 minutes)](https://youtube.com/watch?v=demo_link) | |
| --- | |
| ## π Table of Contents | |
| - [Overview](#-overview) | |
| - [Features](#-features) | |
| - [Architecture](#-architecture) | |
| - [Quick Start](#-quick-start) | |
| - [Usage](#-usage) | |
| - [Project Structure](#-project-structure) | |
| - [Configuration](#-configuration) | |
| - [How It Works](#-how-it-works) | |
| - [Validation](#-validation) | |
| - [Docker Deployment](#-docker-deployment) | |
| - [Contributing](#-contributing) | |
| - [License](#-license) | |
| --- | |
| ## π Overview | |
| **ReproAgent** is an AI-driven framework built on [OpenAI Gymnasium](https://gymnasium.farama.org/) that automates the end-to-end reproduction of machine learning research papers. Given a PDF, it autonomously: | |
| 1. **Parses** the paper to extract title, metrics, datasets, and GitHub links | |
| 2. **Clones** the linked repository | |
| 3. **Sets up** the environment (conda/venv) and installs dependencies | |
| 4. **Runs** inference or training scripts | |
| 5. **Debugs** errors using real traceback analysis | |
| 6. **Tunes** hyperparameters to close the gap between reproduced and claimed results | |
| 7. **Compares** final metrics against the paper's claims | |
| It supports both a **Simulation** mode (safe, no system changes) and a **Real Execution** mode (actually clones repos, creates envs, runs code on your machine). | |
| --- | |
| ## β¨ Features | |
| | Feature | Description | | |
| |---------|-------------| | |
| | π **PDF Parsing** | Extracts metadata using Groq LLM (llama-3.3-70b) with regex fallback | | |
| | π **Repo Discovery** | Finds GitHub links from paper text, cleans trailing punctuation | | |
| | π¦ **Smart Environment Setup** | Auto-detects `requirements.txt`, `environment.yml`, or `pyproject.toml` and creates the correct env (pip venv or conda) | | |
| | π§ **Intelligent Entry Point** | Scans for `inference.py`, `eval.py`, `main.py`, `train.py`, or extracts scripts from README bash blocks | | |
| | π **Real Error Debugging** | Captures actual `stderr` tracebacks and feeds them into the debugging pipeline | | |
| | π§ͺ **Hyperparameter Tuning** | Modifies learning rate, batch size, optimizer, and epochs to reproduce paper metrics | | |
| | π **Dynamic Metric Extraction** | Extracts the actual evaluation metric (FID, BLEU, accuracy, PSNR, etc.) from the paper β not hardcoded | | |
| | π₯οΈ **Gradio Web UI** | Beautiful web interface with live logs, state tracking, and result visualization | | |
| --- | |
| ## ποΈ Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Gradio Web UI β | |
| β (server/app.py) β | |
| ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ | |
| β | |
| ββββββββββββββΌβββββββββββββ | |
| β Reasoning Agent β | |
| β (agents/reasoning_ β | |
| β agent.py) β | |
| ββββββββββββββ¬βββββββββββββ | |
| β select_action() | |
| ββββββββββββββΌβββββββββββββ | |
| β Gymnasium Environment β | |
| β (reproagent/ β | |
| β environment.py) β | |
| β β | |
| β βββββββββββββββββββ β | |
| β β State Machine β β | |
| β β βββββββββββββ β β | |
| β β β Parsing β β β | |
| β β β RepoAnalysβ β β | |
| β β β Setup β β β | |
| β β β Execution β β β | |
| β β β Debugging β β β | |
| β β β Experimentβ β β | |
| β β β Comparisonβ β β | |
| β β βββββββββββββ β β | |
| β βββββββββββββββββββ β | |
| βββββββββββββββββββββββββββ | |
| β β | |
| ββββββββββββ ββββββββββββ | |
| βΌ βΌ | |
| βββββββββββββββββ ββββββββββββββββββ | |
| β Simulation β β Real Execution β | |
| β (mock state β β (subprocess, β | |
| β transitions)β β git clone, β | |
| β β β conda/venv) β | |
| βββββββββββββββββ ββββββββββββββββββ | |
| ``` | |
| --- | |
| ## π Quick Start | |
| ### Prerequisites | |
| - **Python** 3.10+ | |
| - **Git** (for real execution mode) | |
| - **Conda** (optional, for repos that use `environment.yml`) | |
| - A **Groq API key** (free at [console.groq.com](https://console.groq.com)) | |
| ### Installation | |
| ```bash | |
| # 1. Clone the repository | |
| git clone https://github.com/your-username/ReproAgent.git | |
| cd ReproAgent | |
| # 2. Create a virtual environment | |
| python -m venv venv | |
| # Windows | |
| .\venv\Scripts\activate | |
| # macOS/Linux | |
| source venv/bin/activate | |
| # 3. Install dependencies | |
| pip install -r requirements.txt | |
| # 4. Set up environment variables | |
| cp .env.example .env | |
| # Edit .env and add your GROQ_API_KEY | |
| ``` | |
| ### Run | |
| ```bash | |
| # Launch the Gradio web interface | |
| python server/app.py | |
| ``` | |
| The UI will be available at `http://localhost:7860` with a public share link. | |
| --- | |
| ## π» Usage | |
| ### Web Interface (Recommended) | |
| 1. Open the Gradio UI at `http://localhost:7860` | |
| 2. **Upload** a research paper PDF (or paste a URL) | |
| 3. Choose **Execution Mode**: | |
| - `Simulation` β Safe demo, no system changes | |
| - `Real Execution` β Actually clones repos and runs code | |
| 4. Set **Clone Directory** (where repos will be cloned, e.g. `D:\reproductions`) | |
| 5. Click **Start Reproduction** and watch the agent work in real-time | |
| ### Command Line | |
| ```bash | |
| # Run validation to ensure everything works | |
| python validate.py | |
| # Run a quick inference test | |
| python inference.py | |
| ``` | |
| ### Programmatic API | |
| ```python | |
| from reproagent.environment import ReproAgentEnv | |
| from agents.reasoning_agent import create_agent | |
| # Create environment | |
| env = ReproAgentEnv( | |
| difficulty="easy", | |
| max_steps=100, | |
| use_llm=True, | |
| exec_mode="Real Execution", | |
| workspace_dir="./workspace" | |
| ) | |
| # Create agent | |
| agent = create_agent(env, agent_type="reasoning", use_llm=True) | |
| # Run episode | |
| obs, info = env.reset() | |
| agent.reset() | |
| for step in range(100): | |
| action = agent.select_action(obs, info) | |
| obs, reward, terminated, truncated, info = env.step(action) | |
| print(f"Step {step}: {info['action_type']} | reward={reward:.2f}") | |
| if terminated or truncated: | |
| break | |
| ``` | |
| --- | |
| ## π Project Structure | |
| ``` | |
| ReproAgent/ | |
| βββ reproagent/ # Core Gymnasium environment | |
| β βββ __init__.py | |
| β βββ environment.py # Main env with action implementations | |
| β βββ state.py # Dataclasses for full reproduction state | |
| β βββ actions.py # Action space definition (30+ actions) | |
| β βββ reward.py # Multi-component reward function | |
| β βββ models.py # LLM client (Groq, OpenAI, HuggingFace) | |
| β βββ papers.py # Paper dataset loader | |
| β | |
| βββ agents/ # Agent implementations | |
| β βββ reasoning_agent.py # Phase-based reasoning agent | |
| β βββ paper_parser.py # PDF text extraction + LLM analysis | |
| β βββ repo_analyzer.py # Repository structure analysis | |
| β βββ debugger.py # Error traceback analysis | |
| β | |
| βββ server/ | |
| β βββ app.py # Gradio web interface (900+ lines) | |
| β | |
| βββ utils/ | |
| β βββ pdf_reader.py # PDF extraction (PyPDF2 + pdfplumber) | |
| β βββ github_utils.py # GitHub API utilities | |
| β | |
| βββ graders/ # Reproduction quality grading | |
| βββ data/papers/ # Sample paper configs (easy/medium/hard) | |
| βββ baseline/ # Baseline agent implementations | |
| βββ static/ # Static assets for UI | |
| β | |
| βββ validate.py # Full validation suite | |
| βββ inference.py # CLI inference entry point | |
| βββ openenv.yaml # OpenEnv compatibility spec | |
| βββ pyproject.toml # Python project metadata | |
| βββ requirements.txt # pip dependencies | |
| βββ Dockerfile # Container deployment | |
| βββ run.bat / run.sh / run.ps1 # Platform-specific launchers | |
| βββ .env.example # Environment variable template | |
| ``` | |
| --- | |
| ## βοΈ Configuration | |
| ### Environment Variables | |
| Create a `.env` file from the template: | |
| ```bash | |
| cp .env.example .env | |
| ``` | |
| | Variable | Required | Description | | |
| |----------|----------|-------------| | |
| | `GROQ_API_KEY` | **Yes** | Groq API key for LLM-powered extraction ([get one free](https://console.groq.com)) | | |
| | `OPENAI_API_KEY` | No | OpenAI API key (alternative LLM backend) | | |
| | `HF_TOKEN` | No | HuggingFace token for model downloads | | |
| | `GITHUB_TOKEN` | No | GitHub API token for higher rate limits | | |
| ### Execution Modes | |
| | Mode | What it does | Use case | | |
| |------|-------------|----------| | |
| | **Simulation** | Simulates all actions with mock state transitions | Safe demos, hackathons, testing | | |
| | **Real Execution** | Runs `git clone`, `conda env create`, `pip install`, `python script.py` on your system | Actually reproducing papers | | |
| --- | |
| ## π How It Works | |
| The agent follows a **phase-based state machine** with 7 phases: | |
| ``` | |
| PARSING β REPO_ANALYSIS β SETUP β EXECUTION β DEBUGGING β EXPERIMENTATION β COMPARISON | |
| ``` | |
| ### Phase Details | |
| | Phase | Actions | What Happens | | |
| |-------|---------|--------------| | |
| | **Parsing** | `PARSE_PDF`, `EXTRACT_GITHUB`, `EXTRACT_METRICS` | LLM reads paper, extracts title, GitHub URL, target metric (e.g., FID=7.5) | | |
| | **Repo Analysis** | `CLONE_REPO`, `READ_README`, `FIND_ENTRY_POINT`, `EXTRACT_DEPS` | Clones repo, reads README, finds scripts from bash blocks, detects `environment.yml` | | |
| | **Setup** | `CREATE_VENV`, `INSTALL_REQUIREMENTS`, `VERIFY_SETUP` | Creates conda/venv env, installs deps, verifies setup | | |
| | **Execution** | `RUN_TRAINING`, `RUN_EVAL`, `CHECK_LOGS` | Runs the entry point script via subprocess, captures stdout/stderr | | |
| | **Debugging** | `ANALYZE_ERROR`, `SEARCH_SOLUTION`, `APPLY_FIX` | Parses real Python tracebacks, proposes and applies fixes | | |
| | **Experimentation** | `MODIFY_LR`, `MODIFY_BATCH`, `RUN_EXPERIMENT` | Tunes hyperparameters to close the metric gap | | |
| | **Comparison** | `COMPARE_RESULTS`, `GENERATE_REPORT` | Compares reproduced metric vs. paper claim, generates summary | | |
| ### Reward Function | |
| The environment provides a multi-component reward signal: | |
| - **Phase progress** (+10 for advancing through phases) | |
| - **Code execution** (+20 for successful script runs) | |
| - **Error fixing** (+15 per resolved error) | |
| - **Metric improvement** (scaled by how close the reproduced result is to the paper's claim) | |
| - **Time penalty** (-0.01 per step to encourage efficiency) | |
| --- | |
| ## β Validation | |
| Run the full validation suite to confirm everything works: | |
| ```bash | |
| python validate.py | |
| ``` | |
| This tests: | |
| | Test | What it validates | | |
| |------|-------------------| | |
| | Environment | `ReproAgentEnv` creates, resets, steps correctly | | |
| | Spaces | Observation and action spaces match the Gymnasium spec | | |
| | Episodes | Full multi-step episodes run without crashes | | |
| | Agents | `ReasoningAgent` and `RandomAgent` interact with the env | | |
| | Demo | Gradio app imports successfully | | |
| | Graders | Reproduction quality grader loads | | |
| | OpenEnv | `openenv.yaml` is present and well-formed | | |
| Expected output: | |
| ``` | |
| ENVIRONMENT β PASSED | |
| AGENTS β PASSED | |
| DEMO β PASSED | |
| GRADERS β PASSED | |
| OPENENV_YAML β PASSED | |
| π ALL VALIDATIONS PASSED! | |
| β System is ready for deployment | |
| ``` | |
| --- | |
| ## π³ Docker Deployment | |
| ```bash | |
| # Build the image | |
| docker build -t reproagent . | |
| # Run with your API key | |
| docker run -p 7860:7860 -e GROQ_API_KEY=your_key_here reproagent | |
| ``` | |
| Or deploy to **HuggingFace Spaces**: | |
| ```bash | |
| pip install gradio | |
| gradio deploy | |
| ``` | |
| --- | |
| ## π£οΈ Roadmap | |
| - [x] Gymnasium-compatible environment with 30+ actions | |
| - [x] Groq LLM integration with regex fallback | |
| - [x] Gradio web interface with live logs | |
| - [x] Real Execution mode (git clone, conda/venv, subprocess) | |
| - [x] Dynamic metric extraction (FID, BLEU, accuracy, PSNR, etc.) | |
| - [x] Bash block parsing from README for entry point discovery | |
| - [ ] Multi-script sequential execution (run 5 scripts in order per README) | |
| - [ ] Automatic checkpoint downloading from HuggingFace | |
| - [ ] GPU-aware execution scheduling | |
| - [ ] Result visualization and plot generation | |
| - [ ] Support for Jupyter notebook-based repos | |
| --- | |
| ## π€ Contributing | |
| Contributions are welcome! Please: | |
| 1. Fork the repository | |
| 2. Create a feature branch (`git checkout -b feature/amazing-feature`) | |
| 3. Commit your changes (`git commit -m 'Add amazing feature'`) | |
| 4. Push to the branch (`git push origin feature/amazing-feature`) | |
| 5. Open a Pull Request | |
| --- | |
| ## π License | |
| This project is licensed under the **MIT License** β see the [LICENSE](LICENSE) file for details. | |
| --- | |
| <p align="center"> | |
| Built with β€οΈ for the ML research community | |
| </p> | |