--- title: ReproAgent emoji: π¬ colorFrom: blue colorTo: indigo sdk: gradio sdk_version: 4.12.0 python_version: 3.12 app_file: server/app.py pinned: false ---
An AI-powered agent that automatically reproduces machine learning research papers.
Upload a research paper PDF β ReproAgent reads it β finds the repo β clones the code β sets up the environment β runs it β debugs errors β tunes hyperparameters β compares results.
--- ## π OpenEnv Hackathon Submission This project is submitted to the **OpenEnv Hackathon**. It is a fully compliant environment built on top of the framework. ### Required Materials - **Hugging Face Space**: [ReproAgent Live Demo](https://huggingface.co/spaces/username/reproagent) - **Training Script (TRL/PPO)**: [Colab Notebook](training/train_reproagent.ipynb) - **Evidence of Training**: We trained the agent using Proximal Policy Optimization (PPO) over 50 episodes.
- **Presentation**: [Mini-Blog on HuggingFace](https://huggingface.co/blog/reproagent-openenv) / [YouTube Demo (< 2 minutes)](https://youtube.com/watch?v=demo_link)
---
## π Table of Contents
- [Overview](#-overview)
- [Features](#-features)
- [Architecture](#-architecture)
- [Quick Start](#-quick-start)
- [Usage](#-usage)
- [Project Structure](#-project-structure)
- [Configuration](#-configuration)
- [How It Works](#-how-it-works)
- [Validation](#-validation)
- [Docker Deployment](#-docker-deployment)
- [Contributing](#-contributing)
- [License](#-license)
---
## π Overview
**ReproAgent** is an AI-driven framework built on [OpenAI Gymnasium](https://gymnasium.farama.org/) that automates the end-to-end reproduction of machine learning research papers. Given a PDF, it autonomously:
1. **Parses** the paper to extract title, metrics, datasets, and GitHub links
2. **Clones** the linked repository
3. **Sets up** the environment (conda/venv) and installs dependencies
4. **Runs** inference or training scripts
5. **Debugs** errors using real traceback analysis
6. **Tunes** hyperparameters to close the gap between reproduced and claimed results
7. **Compares** final metrics against the paper's claims
It supports both a **Simulation** mode (safe, no system changes) and a **Real Execution** mode (actually clones repos, creates envs, runs code on your machine).
---
## β¨ Features
| Feature | Description |
|---------|-------------|
| π **PDF Parsing** | Extracts metadata using Groq LLM (llama-3.3-70b) with regex fallback |
| π **Repo Discovery** | Finds GitHub links from paper text, cleans trailing punctuation |
| π¦ **Smart Environment Setup** | Auto-detects `requirements.txt`, `environment.yml`, or `pyproject.toml` and creates the correct env (pip venv or conda) |
| π§ **Intelligent Entry Point** | Scans for `inference.py`, `eval.py`, `main.py`, `train.py`, or extracts scripts from README bash blocks |
| π **Real Error Debugging** | Captures actual `stderr` tracebacks and feeds them into the debugging pipeline |
| π§ͺ **Hyperparameter Tuning** | Modifies learning rate, batch size, optimizer, and epochs to reproduce paper metrics |
| π **Dynamic Metric Extraction** | Extracts the actual evaluation metric (FID, BLEU, accuracy, PSNR, etc.) from the paper β not hardcoded |
| π₯οΈ **Gradio Web UI** | Beautiful web interface with live logs, state tracking, and result visualization |
---
## ποΈ Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Gradio Web UI β
β (server/app.py) β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β Reasoning Agent β
β (agents/reasoning_ β
β agent.py) β
ββββββββββββββ¬βββββββββββββ
β select_action()
ββββββββββββββΌβββββββββββββ
β Gymnasium Environment β
β (reproagent/ β
β environment.py) β
β β
β βββββββββββββββββββ β
β β State Machine β β
β β βββββββββββββ β β
β β β Parsing β β β
β β β RepoAnalysβ β β
β β β Setup β β β
β β β Execution β β β
β β β Debugging β β β
β β β Experimentβ β β
β β β Comparisonβ β β
β β βββββββββββββ β β
β βββββββββββββββββββ β
βββββββββββββββββββββββββββ
β β
ββββββββββββ ββββββββββββ
βΌ βΌ
βββββββββββββββββ ββββββββββββββββββ
β Simulation β β Real Execution β
β (mock state β β (subprocess, β
β transitions)β β git clone, β
β β β conda/venv) β
βββββββββββββββββ ββββββββββββββββββ
```
---
## π Quick Start
### Prerequisites
- **Python** 3.10+
- **Git** (for real execution mode)
- **Conda** (optional, for repos that use `environment.yml`)
- A **Groq API key** (free at [console.groq.com](https://console.groq.com))
### Installation
```bash
# 1. Clone the repository
git clone https://github.com/your-username/ReproAgent.git
cd ReproAgent
# 2. Create a virtual environment
python -m venv venv
# Windows
.\venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set up environment variables
cp .env.example .env
# Edit .env and add your GROQ_API_KEY
```
### Run
```bash
# Launch the Gradio web interface
python server/app.py
```
The UI will be available at `http://localhost:7860` with a public share link.
---
## π» Usage
### Web Interface (Recommended)
1. Open the Gradio UI at `http://localhost:7860`
2. **Upload** a research paper PDF (or paste a URL)
3. Choose **Execution Mode**:
- `Simulation` β Safe demo, no system changes
- `Real Execution` β Actually clones repos and runs code
4. Set **Clone Directory** (where repos will be cloned, e.g. `D:\reproductions`)
5. Click **Start Reproduction** and watch the agent work in real-time
### Command Line
```bash
# Run validation to ensure everything works
python validate.py
# Run a quick inference test
python inference.py
```
### Programmatic API
```python
from reproagent.environment import ReproAgentEnv
from agents.reasoning_agent import create_agent
# Create environment
env = ReproAgentEnv(
difficulty="easy",
max_steps=100,
use_llm=True,
exec_mode="Real Execution",
workspace_dir="./workspace"
)
# Create agent
agent = create_agent(env, agent_type="reasoning", use_llm=True)
# Run episode
obs, info = env.reset()
agent.reset()
for step in range(100):
action = agent.select_action(obs, info)
obs, reward, terminated, truncated, info = env.step(action)
print(f"Step {step}: {info['action_type']} | reward={reward:.2f}")
if terminated or truncated:
break
```
---
## π Project Structure
```
ReproAgent/
βββ reproagent/ # Core Gymnasium environment
β βββ __init__.py
β βββ environment.py # Main env with action implementations
β βββ state.py # Dataclasses for full reproduction state
β βββ actions.py # Action space definition (30+ actions)
β βββ reward.py # Multi-component reward function
β βββ models.py # LLM client (Groq, OpenAI, HuggingFace)
β βββ papers.py # Paper dataset loader
β
βββ agents/ # Agent implementations
β βββ reasoning_agent.py # Phase-based reasoning agent
β βββ paper_parser.py # PDF text extraction + LLM analysis
β βββ repo_analyzer.py # Repository structure analysis
β βββ debugger.py # Error traceback analysis
β
βββ server/
β βββ app.py # Gradio web interface (900+ lines)
β
βββ utils/
β βββ pdf_reader.py # PDF extraction (PyPDF2 + pdfplumber)
β βββ github_utils.py # GitHub API utilities
β
βββ graders/ # Reproduction quality grading
βββ data/papers/ # Sample paper configs (easy/medium/hard)
βββ baseline/ # Baseline agent implementations
βββ static/ # Static assets for UI
β
βββ validate.py # Full validation suite
βββ inference.py # CLI inference entry point
βββ openenv.yaml # OpenEnv compatibility spec
βββ pyproject.toml # Python project metadata
βββ requirements.txt # pip dependencies
βββ Dockerfile # Container deployment
βββ run.bat / run.sh / run.ps1 # Platform-specific launchers
βββ .env.example # Environment variable template
```
---
## βοΈ Configuration
### Environment Variables
Create a `.env` file from the template:
```bash
cp .env.example .env
```
| Variable | Required | Description |
|----------|----------|-------------|
| `GROQ_API_KEY` | **Yes** | Groq API key for LLM-powered extraction ([get one free](https://console.groq.com)) |
| `OPENAI_API_KEY` | No | OpenAI API key (alternative LLM backend) |
| `HF_TOKEN` | No | HuggingFace token for model downloads |
| `GITHUB_TOKEN` | No | GitHub API token for higher rate limits |
### Execution Modes
| Mode | What it does | Use case |
|------|-------------|----------|
| **Simulation** | Simulates all actions with mock state transitions | Safe demos, hackathons, testing |
| **Real Execution** | Runs `git clone`, `conda env create`, `pip install`, `python script.py` on your system | Actually reproducing papers |
---
## π How It Works
The agent follows a **phase-based state machine** with 7 phases:
```
PARSING β REPO_ANALYSIS β SETUP β EXECUTION β DEBUGGING β EXPERIMENTATION β COMPARISON
```
### Phase Details
| Phase | Actions | What Happens |
|-------|---------|--------------|
| **Parsing** | `PARSE_PDF`, `EXTRACT_GITHUB`, `EXTRACT_METRICS` | LLM reads paper, extracts title, GitHub URL, target metric (e.g., FID=7.5) |
| **Repo Analysis** | `CLONE_REPO`, `READ_README`, `FIND_ENTRY_POINT`, `EXTRACT_DEPS` | Clones repo, reads README, finds scripts from bash blocks, detects `environment.yml` |
| **Setup** | `CREATE_VENV`, `INSTALL_REQUIREMENTS`, `VERIFY_SETUP` | Creates conda/venv env, installs deps, verifies setup |
| **Execution** | `RUN_TRAINING`, `RUN_EVAL`, `CHECK_LOGS` | Runs the entry point script via subprocess, captures stdout/stderr |
| **Debugging** | `ANALYZE_ERROR`, `SEARCH_SOLUTION`, `APPLY_FIX` | Parses real Python tracebacks, proposes and applies fixes |
| **Experimentation** | `MODIFY_LR`, `MODIFY_BATCH`, `RUN_EXPERIMENT` | Tunes hyperparameters to close the metric gap |
| **Comparison** | `COMPARE_RESULTS`, `GENERATE_REPORT` | Compares reproduced metric vs. paper claim, generates summary |
### Reward Function
The environment provides a multi-component reward signal:
- **Phase progress** (+10 for advancing through phases)
- **Code execution** (+20 for successful script runs)
- **Error fixing** (+15 per resolved error)
- **Metric improvement** (scaled by how close the reproduced result is to the paper's claim)
- **Time penalty** (-0.01 per step to encourage efficiency)
---
## β
Validation
Run the full validation suite to confirm everything works:
```bash
python validate.py
```
This tests:
| Test | What it validates |
|------|-------------------|
| Environment | `ReproAgentEnv` creates, resets, steps correctly |
| Spaces | Observation and action spaces match the Gymnasium spec |
| Episodes | Full multi-step episodes run without crashes |
| Agents | `ReasoningAgent` and `RandomAgent` interact with the env |
| Demo | Gradio app imports successfully |
| Graders | Reproduction quality grader loads |
| OpenEnv | `openenv.yaml` is present and well-formed |
Expected output:
```
ENVIRONMENT β
PASSED
AGENTS β
PASSED
DEMO β
PASSED
GRADERS β
PASSED
OPENENV_YAML β
PASSED
π ALL VALIDATIONS PASSED!
β
System is ready for deployment
```
---
## π³ Docker Deployment
```bash
# Build the image
docker build -t reproagent .
# Run with your API key
docker run -p 7860:7860 -e GROQ_API_KEY=your_key_here reproagent
```
Or deploy to **HuggingFace Spaces**:
```bash
pip install gradio
gradio deploy
```
---
## π£οΈ Roadmap
- [x] Gymnasium-compatible environment with 30+ actions
- [x] Groq LLM integration with regex fallback
- [x] Gradio web interface with live logs
- [x] Real Execution mode (git clone, conda/venv, subprocess)
- [x] Dynamic metric extraction (FID, BLEU, accuracy, PSNR, etc.)
- [x] Bash block parsing from README for entry point discovery
- [ ] Multi-script sequential execution (run 5 scripts in order per README)
- [ ] Automatic checkpoint downloading from HuggingFace
- [ ] GPU-aware execution scheduling
- [ ] Result visualization and plot generation
- [ ] Support for Jupyter notebook-based repos
---
## π€ Contributing
Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
---
## π License
This project is licensed under the **MIT License** β see the [LICENSE](LICENSE) file for details.
---
Built with β€οΈ for the ML research community