Spaces:
Runtime error
A newer version of the Gradio SDK is available: 6.17.3
title: ReproAgent
emoji: π¬
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 4.12.0
python_version: 3.12
app_file: server/app.py
pinned: false
π¬ ReproAgent
An AI-powered agent that automatically reproduces machine learning research papers.
Upload a research paper PDF β ReproAgent reads it β finds the repo β clones the code β sets up the environment β runs it β debugs errors β tunes hyperparameters β compares results.
π OpenEnv Hackathon Submission
This project is submitted to the OpenEnv Hackathon. It is a fully compliant environment built on top of the framework.
Required Materials
- Hugging Face Space: ReproAgent Live Demo
- Training Script (TRL/PPO): Colab Notebook
- Evidence of Training: We trained the agent using Proximal Policy Optimization (PPO) over 50 episodes.

- Presentation: Mini-Blog on HuggingFace / YouTube Demo (< 2 minutes)
π Table of Contents
- Overview
- Features
- Architecture
- Quick Start
- Usage
- Project Structure
- Configuration
- How It Works
- Validation
- Docker Deployment
- Contributing
- License
π Overview
ReproAgent is an AI-driven framework built on OpenAI Gymnasium that automates the end-to-end reproduction of machine learning research papers. Given a PDF, it autonomously:
- Parses the paper to extract title, metrics, datasets, and GitHub links
- Clones the linked repository
- Sets up the environment (conda/venv) and installs dependencies
- Runs inference or training scripts
- Debugs errors using real traceback analysis
- Tunes hyperparameters to close the gap between reproduced and claimed results
- Compares final metrics against the paper's claims
It supports both a Simulation mode (safe, no system changes) and a Real Execution mode (actually clones repos, creates envs, runs code on your machine).
β¨ Features
| Feature | Description |
|---|---|
| π PDF Parsing | Extracts metadata using Groq LLM (llama-3.3-70b) with regex fallback |
| π Repo Discovery | Finds GitHub links from paper text, cleans trailing punctuation |
| π¦ Smart Environment Setup | Auto-detects requirements.txt, environment.yml, or pyproject.toml and creates the correct env (pip venv or conda) |
| π§ Intelligent Entry Point | Scans for inference.py, eval.py, main.py, train.py, or extracts scripts from README bash blocks |
| π Real Error Debugging | Captures actual stderr tracebacks and feeds them into the debugging pipeline |
| π§ͺ Hyperparameter Tuning | Modifies learning rate, batch size, optimizer, and epochs to reproduce paper metrics |
| π Dynamic Metric Extraction | Extracts the actual evaluation metric (FID, BLEU, accuracy, PSNR, etc.) from the paper β not hardcoded |
| π₯οΈ Gradio Web UI | Beautiful web interface with live logs, state tracking, and result visualization |
ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Gradio Web UI β
β (server/app.py) β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β Reasoning Agent β
β (agents/reasoning_ β
β agent.py) β
ββββββββββββββ¬βββββββββββββ
β select_action()
ββββββββββββββΌβββββββββββββ
β Gymnasium Environment β
β (reproagent/ β
β environment.py) β
β β
β βββββββββββββββββββ β
β β State Machine β β
β β βββββββββββββ β β
β β β Parsing β β β
β β β RepoAnalysβ β β
β β β Setup β β β
β β β Execution β β β
β β β Debugging β β β
β β β Experimentβ β β
β β β Comparisonβ β β
β β βββββββββββββ β β
β βββββββββββββββββββ β
βββββββββββββββββββββββββββ
β β
ββββββββββββ ββββββββββββ
βΌ βΌ
βββββββββββββββββ ββββββββββββββββββ
β Simulation β β Real Execution β
β (mock state β β (subprocess, β
β transitions)β β git clone, β
β β β conda/venv) β
βββββββββββββββββ ββββββββββββββββββ
π Quick Start
Prerequisites
- Python 3.10+
- Git (for real execution mode)
- Conda (optional, for repos that use
environment.yml) - A Groq API key (free at console.groq.com)
Installation
# 1. Clone the repository
git clone https://github.com/your-username/ReproAgent.git
cd ReproAgent
# 2. Create a virtual environment
python -m venv venv
# Windows
.\venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set up environment variables
cp .env.example .env
# Edit .env and add your GROQ_API_KEY
Run
# Launch the Gradio web interface
python server/app.py
The UI will be available at http://localhost:7860 with a public share link.
π» Usage
Web Interface (Recommended)
- Open the Gradio UI at
http://localhost:7860 - Upload a research paper PDF (or paste a URL)
- Choose Execution Mode:
Simulationβ Safe demo, no system changesReal Executionβ Actually clones repos and runs code
- Set Clone Directory (where repos will be cloned, e.g.
D:\reproductions) - Click Start Reproduction and watch the agent work in real-time
Command Line
# Run validation to ensure everything works
python validate.py
# Run a quick inference test
python inference.py
Programmatic API
from reproagent.environment import ReproAgentEnv
from agents.reasoning_agent import create_agent
# Create environment
env = ReproAgentEnv(
difficulty="easy",
max_steps=100,
use_llm=True,
exec_mode="Real Execution",
workspace_dir="./workspace"
)
# Create agent
agent = create_agent(env, agent_type="reasoning", use_llm=True)
# Run episode
obs, info = env.reset()
agent.reset()
for step in range(100):
action = agent.select_action(obs, info)
obs, reward, terminated, truncated, info = env.step(action)
print(f"Step {step}: {info['action_type']} | reward={reward:.2f}")
if terminated or truncated:
break
π Project Structure
ReproAgent/
βββ reproagent/ # Core Gymnasium environment
β βββ __init__.py
β βββ environment.py # Main env with action implementations
β βββ state.py # Dataclasses for full reproduction state
β βββ actions.py # Action space definition (30+ actions)
β βββ reward.py # Multi-component reward function
β βββ models.py # LLM client (Groq, OpenAI, HuggingFace)
β βββ papers.py # Paper dataset loader
β
βββ agents/ # Agent implementations
β βββ reasoning_agent.py # Phase-based reasoning agent
β βββ paper_parser.py # PDF text extraction + LLM analysis
β βββ repo_analyzer.py # Repository structure analysis
β βββ debugger.py # Error traceback analysis
β
βββ server/
β βββ app.py # Gradio web interface (900+ lines)
β
βββ utils/
β βββ pdf_reader.py # PDF extraction (PyPDF2 + pdfplumber)
β βββ github_utils.py # GitHub API utilities
β
βββ graders/ # Reproduction quality grading
βββ data/papers/ # Sample paper configs (easy/medium/hard)
βββ baseline/ # Baseline agent implementations
βββ static/ # Static assets for UI
β
βββ validate.py # Full validation suite
βββ inference.py # CLI inference entry point
βββ openenv.yaml # OpenEnv compatibility spec
βββ pyproject.toml # Python project metadata
βββ requirements.txt # pip dependencies
βββ Dockerfile # Container deployment
βββ run.bat / run.sh / run.ps1 # Platform-specific launchers
βββ .env.example # Environment variable template
βοΈ Configuration
Environment Variables
Create a .env file from the template:
cp .env.example .env
| Variable | Required | Description |
|---|---|---|
GROQ_API_KEY |
Yes | Groq API key for LLM-powered extraction (get one free) |
OPENAI_API_KEY |
No | OpenAI API key (alternative LLM backend) |
HF_TOKEN |
No | HuggingFace token for model downloads |
GITHUB_TOKEN |
No | GitHub API token for higher rate limits |
Execution Modes
| Mode | What it does | Use case |
|---|---|---|
| Simulation | Simulates all actions with mock state transitions | Safe demos, hackathons, testing |
| Real Execution | Runs git clone, conda env create, pip install, python script.py on your system |
Actually reproducing papers |
π How It Works
The agent follows a phase-based state machine with 7 phases:
PARSING β REPO_ANALYSIS β SETUP β EXECUTION β DEBUGGING β EXPERIMENTATION β COMPARISON
Phase Details
| Phase | Actions | What Happens |
|---|---|---|
| Parsing | PARSE_PDF, EXTRACT_GITHUB, EXTRACT_METRICS |
LLM reads paper, extracts title, GitHub URL, target metric (e.g., FID=7.5) |
| Repo Analysis | CLONE_REPO, READ_README, FIND_ENTRY_POINT, EXTRACT_DEPS |
Clones repo, reads README, finds scripts from bash blocks, detects environment.yml |
| Setup | CREATE_VENV, INSTALL_REQUIREMENTS, VERIFY_SETUP |
Creates conda/venv env, installs deps, verifies setup |
| Execution | RUN_TRAINING, RUN_EVAL, CHECK_LOGS |
Runs the entry point script via subprocess, captures stdout/stderr |
| Debugging | ANALYZE_ERROR, SEARCH_SOLUTION, APPLY_FIX |
Parses real Python tracebacks, proposes and applies fixes |
| Experimentation | MODIFY_LR, MODIFY_BATCH, RUN_EXPERIMENT |
Tunes hyperparameters to close the metric gap |
| Comparison | COMPARE_RESULTS, GENERATE_REPORT |
Compares reproduced metric vs. paper claim, generates summary |
Reward Function
The environment provides a multi-component reward signal:
- Phase progress (+10 for advancing through phases)
- Code execution (+20 for successful script runs)
- Error fixing (+15 per resolved error)
- Metric improvement (scaled by how close the reproduced result is to the paper's claim)
- Time penalty (-0.01 per step to encourage efficiency)
β Validation
Run the full validation suite to confirm everything works:
python validate.py
This tests:
| Test | What it validates |
|---|---|
| Environment | ReproAgentEnv creates, resets, steps correctly |
| Spaces | Observation and action spaces match the Gymnasium spec |
| Episodes | Full multi-step episodes run without crashes |
| Agents | ReasoningAgent and RandomAgent interact with the env |
| Demo | Gradio app imports successfully |
| Graders | Reproduction quality grader loads |
| OpenEnv | openenv.yaml is present and well-formed |
Expected output:
ENVIRONMENT β
PASSED
AGENTS β
PASSED
DEMO β
PASSED
GRADERS β
PASSED
OPENENV_YAML β
PASSED
π ALL VALIDATIONS PASSED!
β
System is ready for deployment
π³ Docker Deployment
# Build the image
docker build -t reproagent .
# Run with your API key
docker run -p 7860:7860 -e GROQ_API_KEY=your_key_here reproagent
Or deploy to HuggingFace Spaces:
pip install gradio
gradio deploy
π£οΈ Roadmap
- Gymnasium-compatible environment with 30+ actions
- Groq LLM integration with regex fallback
- Gradio web interface with live logs
- Real Execution mode (git clone, conda/venv, subprocess)
- Dynamic metric extraction (FID, BLEU, accuracy, PSNR, etc.)
- Bash block parsing from README for entry point discovery
- Multi-script sequential execution (run 5 scripts in order per README)
- Automatic checkpoint downloading from HuggingFace
- GPU-aware execution scheduling
- Result visualization and plot generation
- Support for Jupyter notebook-based repos
π€ Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
π License
This project is licensed under the MIT License β see the LICENSE file for details.
Built with β€οΈ for the ML research community