Spaces:

Pandaisop
/

codesensei-env

Sleeping

App Files Files Community

vineetshukla.work@gmail.com commited on Apr 8

Commit

f3f5cb0

1 Parent(s): 1caebb9

docs: rewrite README, clean up repo structure

Browse files

Files changed (2) hide show

.gitignore +1 -0
README.md +63 -123

.gitignore CHANGED Viewed

@@ -9,3 +9,4 @@ build/
 *.log
 .DS_Store
 Thumbs.db

 *.log
 .DS_Store
 Thumbs.db
+codesensei_unwanted/

README.md CHANGED Viewed

@@ -6,162 +6,102 @@ colorTo: blue
 sdk: docker
 app_port: 7860
 license: mit
-short_description: GRPO-trained LLM code debugging environment (OpenEnv)
 ---
-# 🧠 CodeSensei — GRPO-Trained Code Debugger
-> **Teaching an LLM to think like a debugger through Reinforcement Learning.**
-[![OpenEnv](https://img.shields.io/badge/Built%20with-OpenEnv-blue)](https://github.com/meta-pytorch/OpenEnv)
-[![TRL](https://img.shields.io/badge/Training-TRL%20GRPO-green)](https://huggingface.co/docs/trl)
-[![HF Spaces](https://img.shields.io/badge/Deploy-HF%20Spaces-yellow)](https://huggingface.co/spaces)
-[![License](https://img.shields.io/badge/License-MIT-purple)](LICENSE)
----
-## 🎯 What is CodeSensei?
-CodeSensei is a **custom OpenEnv RL environment** that teaches a language model to debug Python code using **GRPO (Group Relative Policy Optimization)** from HuggingFace TRL.
-The LLM receives buggy Python functions, proposes fixes, and gets rewarded based on test results — learning to debug through trial and error.
-### ✨ Key Features
-- 🏗️ **Custom OpenEnv Integration** — Full 3-method environment (`reset`, `step`, `state`)
-- 🎯 **4-Signal Reward System** — Correctness, progress, syntax, repetition
-- 🔒 **Sandboxed Execution** — LLM-generated code runs in restricted subprocesses
-- 🌐 **WebSocket First** — Designed for HF Spaces deployment
-- 💰 **100% Free** — Colab T4 + HF Spaces free tier
-- 📊 **Live Demo** — Gradio app with baseline vs fine-tuned comparison
----
-## 🏗️ Architecture
-```
-┌──────────────────────────────────┐     ┌────────────────────────────┐
-│  Google Colab (Free T4 GPU)      │     │  HF Space (codesensei-env) │
-│                                  │ WS  │                            │
-│  GRPOTrainer → rollout_func() ───┼────►│  FastAPI + CodeDebugEnv    │
-│  Qwen3-1.7B + vLLM              │     │  Sandbox + Test Runner     │
-│                                  │     │                            │
-│  → push checkpoint every 5 steps │     └────────────────────────────┘
-└──────────────┬───────────────────┘
-               │
-               ▼
-     🤗 HF Hub (model + checkpoints)
-               │
-               ▼
-     ┌─────────────────────────────┐
-     │  HF Space (codesensei-demo) │
-     │  Gradio: baseline vs GRPO   │
-     └─────────────────────────────┘
-```
----
-## 📁 Project Structure
 ```
-codesensei/
-├── env/                         # OpenEnv Environment
-│   ├── models.py                # Typed Action/Observation/State
-│   ├── client.py                # WebSocket client
 │   └── server/
-│       ├── environment.py       # Core reset/step/state logic
-│       ├── sandbox.py           # Restricted Python execution
-│       ├── test_runner.py       # Test evaluation
-│       └── app.py               # FastAPI server
 ├── training/
-│   └── colab_train.py           # GRPO training notebook
-├── demo/
-│   └── app.py                   # Gradio comparison demo
-├── Dockerfile                   # HF Spaces deployment
-├── requirements.txt             # Server dependencies
-└── README.md
 ```
----
-## 🚀 Quick Start
-### 1. Run Environment Locally
 ```bash
 pip install -r requirements.txt
 uvicorn env.server.app:app --host 0.0.0.0 --port 7860
 ```
-### 2. Deploy to HF Spaces
-```bash
-# Push to HF Spaces (Docker-based)
-huggingface-cli repo create codesensei-env --type space --space-sdk docker
-git remote add hf https://huggingface.co/spaces/YOUR-USERNAME/codesensei-env
-git push hf main
-```
-### 3. Train on Colab
-1. Open `training/colab_train.py` in Google Colab
-2. Set GPU runtime → T4
-3. Update `CODESENSEI_ENV_URL` to your HF Space
-4. Run all cells
-5. If session drops → re-run cell 10, it resumes from checkpoint
-### 4. Run Demo
 ```bash
-cd demo
-pip install -r requirements.txt
-python app.py
 ```
----
-## 🎯 Reward System
-| Signal | Condition | Value | Purpose |
-|---|---|---|---|
-| Correctness | All tests pass | +2.0 | Primary goal |
-| Progress | More tests pass than before | +0.5 | Incremental improvement |
-| Stagnation | No improvement | -0.3 | Prevent plateaus |
-| Runtime Error | Code crashes | -0.5 | Penalize regressions |
-| Syntax Error | Invalid Python | -1.0 | Force valid output |
-| Repetition | Same fix submitted | -0.5 | Force exploration |
----
-## 🛠️ Tech Stack
-| Component | Technology | Cost |
 |---|---|---|
-| Environment | OpenEnv + FastAPI | Free |
-| Training | TRL + GRPO + vLLM | Free |
-| GPU | Google Colab T4 | Free |
-| Model | Qwen3-1.7B | Free |
-| Deployment | HF Spaces | Free |
-| Demo | Gradio | Free |
-| **Total** | | **$0** |
----
-## 📈 Training Details
-- **Model:** Qwen/Qwen3-1.7B
-- **Algorithm:** GRPO (Group Relative Policy Optimization)
-- **Dataset:** 500 buggy Python functions
-- **Max Attempts:** 6 per episode
-- **Checkpoint:** Every 5 steps → pushed to HF Hub
-- **Session Resilience:** Auto-resume from checkpoint on Colab crash
----
-## 📄 License
-MIT License — see [LICENSE](LICENSE) for details.
----
-Built for the **OpenEnv Hackathon** 🏆

 sdk: docker
 app_port: 7860
 license: mit
+short_description: RL environment for teaching LLMs to debug Python code
 ---
+# CodeSensei
+An RL environment built on OpenEnv that trains LLMs to fix buggy Python code. The model gets a broken function, proposes a fix, runs tests, and learns from the results — basically the same loop a developer goes through when debugging, but automated with reinforcement learning.
+## How it works
+1. The environment picks a buggy Python function from the dataset
+2. The LLM reads the code + failing test output
+3. It proposes a corrected version
+4. We run the tests in a sandboxed subprocess
+5. A multi-signal reward tells the model what went well (or didn't)
+6. Repeat for up to 6 attempts per bug
+The reward isn't just pass/fail — it accounts for partial progress, syntax validity, code variety, and whether the model is actually improving or just submitting the same thing over and over.
+## Reward breakdown
+| Signal | When | Value |
+|---|---|---|
+| All tests pass | Bug fully fixed | +2.0 |
+| More tests pass than before | Making progress | +0.5 |
+| No improvement over previous best | Stuck | -0.3 |
+| Code crashes at runtime | Regression | -0.5 |
+| Syntax error | Invalid Python | -1.0 |
+| Duplicate submission | Same fix as before | -0.5 |
+## Project layout
 ```
+├── inference.py             # main inference script (OpenEnv submission)
+├── openenv.yaml             # environment spec
+├── Dockerfile
+├── requirements.txt
+├── env/
+│   ├── client.py            # async client with from_docker_image()
+│   ├── models.py            # Action, Observation, State dataclasses
+│   ├── data/
+│   │   └── bug_dataset.json # 10 bugs with test suites
 │   └── server/
+│       ├── app.py           # FastAPI — /reset, /step, /health, /ws
+│       ├── environment.py   # core logic (reset/step/state)
+│       ├── sandbox.py       # restricted code execution
+│       └── test_runner.py   # runs tests against proposed fixes
+├── server/
+│   └── app.py               # entry point for openenv validate
 ├── training/
+│   └── colab_train.py       # GRPO training (Colab T4)
+└── demo/
+    └── app.py               # Gradio demo
 ```
+## Running locally
 ```bash
 pip install -r requirements.txt
 uvicorn env.server.app:app --host 0.0.0.0 --port 7860
 ```
+Then hit `POST /reset` with `{}` to start an episode, and `POST /step` with your fix to iterate.
+## Inference
+The inference script uses the OpenAI-compatible client pointed at HuggingFace's inference router. It connects to the environment via `from_docker_image()`, runs the debug loop, and logs everything in the required `[START]`/`[STEP]`/`[END]` format.
 ```bash
+export HF_TOKEN="your_token"
+python inference.py
 ```
+Default model is `Qwen/Qwen2.5-Coder-32B-Instruct` (free via HF router). You can swap it by setting `MODEL_NAME`.
+## Training
+Open `training/colab_train.py` in Google Colab with a T4 runtime. It uses GRPO from HuggingFace TRL with QLoRA (4-bit quantization + LoRA adapters) so the whole thing fits in 15GB VRAM. Checkpoints get pushed to HF Hub automatically.
+## API endpoints
+| Method | Path | What it does |
 |---|---|---|
+| POST | `/reset` | Start a new debugging episode |
+| POST | `/step` | Submit a proposed fix |
+| GET | `/state?session_id=X` | Get current episode state |
+| GET | `/health` | Health check |
+| WS | `/ws` | WebSocket interface |
+## Tech used
+- **Environment:** FastAPI + OpenEnv protocol
+- **Training:** TRL GRPO + QLoRA on Qwen2.5-Coder-32B-Instruct
+- **Inference:** OpenAI Python client → HuggingFace router (free tier)
+- **Deployment:** Docker on HF Spaces
+- **Security:** Code execution in sandboxed subprocesses with restricted builtins
+## License
+MIT