meta-rl-dsa-solver / README.md
s-shah4
Add V1 env
4433dc8
---
title: ADAPT DSA Tutor OpenEnv
sdk: docker
pinned: false
app_port: 7860
base_path: /web
tags:
- openenv
- reinforcement-learning
- code-generation
---
# ADAPT DSA Tutor OpenEnv
ADAPT, the Adversarial DSA Tutor, is an OpenEnv-compliant RLVR environment for training code-generation agents on small DSA tasks. The agent receives a problem prompt, examples, and visible tests, then submits Python code. The environment runs the code against visible and hidden tests and returns reward, pass-rate metrics, execution status, and feedback.
This repo now focuses on the environment layer only. Verifier work and training scripts are owned separately.
## Why This Environment
The hackathon asks for OpenEnv environments that can improve LLM behavior through verifiable interaction. ADAPT targets a simple but useful skill loop:
```text
agent writes code -> environment executes it -> hidden tests and reward signals score it -> trainer improves the agent
```
The differentiator is curriculum-ready DSA practice: each episode carries a problem id and difficulty tier so training can track per-tier success instead of only aggregate reward.
## OpenEnv Interface
The environment uses the latest OpenEnv API shape:
- `AdaptEnvironment(Environment[AdaptAction, AdaptObservation, AdaptState])`
- `reset()` returns a typed observation.
- `step(action)` accepts an `AdaptAction` with a Python `code` string.
- `state` exposes episode id, step count, current problem id, difficulty, and recent metrics.
`openenv.yaml` points to:
```yaml
app: server.app:app
port: 7860
```
## Action
```python
{
"code": "n = int(input())\nprint(n * 2)"
}
```
## Observation
Reset and step observations include:
- problem statement
- input format
- constraints
- examples
- visible tests
- problem id
- difficulty tier
- feedback
- pass rate, visible pass rate, and hidden pass rate
- syntax/runtime/timeout status
- reward components
Hidden test inputs and expected outputs are never returned in observations.
## Reward
Reward is clipped to `[0.0, 1.0]` and combines multiple environment-level signals:
- correctness from visible and hidden pass rate
- syntax validity
- clean execution
- output format compliance
- timeout penalty
- runtime error penalty
- static safety rejection for dangerous imports such as `os`, `subprocess`, `socket`, `pathlib`, and `shutil`
If `verifier.verifier.verify(code, test_cases)` exists, the environment can use it as an optional reward augmentation. If the verifier is absent, the environment still works using executor-derived reward.
## Local Setup
Use Python `3.10+`.
```powershell
cd C:\Users\kaust\PycharmProjects\meta-rl-dsa-solver
python -m venv .venv
.\.venv\Scripts\pip install -e .
```
For this local machine, the existing checked-out OpenEnv repo can also be used during development:
```powershell
$env:PYTHONPATH="C:\Users\kaust\PycharmProjects\OpenEnv\src;$PWD"
```
## Smoke Tests
Run the local smoke test:
```powershell
python test.py
```
Check syntax:
```powershell
python -m py_compile models.py env\adapt_env.py env\executor.py env\test_cases.py server\app.py
```
Start the OpenEnv server:
```powershell
uvicorn server.app:app --host 0.0.0.0 --port 7860
```
Useful endpoints:
- `GET /health`
- `GET /schema`
- `POST /reset`
- `POST /step`
- `GET /state`
Example step request:
```powershell
curl -X POST http://localhost:7860/step -H "Content-Type: application/json" -d "{\"action\":{\"code\":\"n=int(input())\nprint(n*2)\"}}"
```
Validate with OpenEnv once dependencies are installed:
```powershell
openenv validate .
```
## Hugging Face Spaces
This repo is Docker Space ready:
```powershell
openenv push --repo-id <your-hf-username>/adapt-dsa-tutor
```
Before final submission, add:
- live Hugging Face Space link
- training reward/loss plots from Disha's run
- before/after code example showing a problem the model failed before training and solved after training
- mini-blog or short video link
## Current Problem Bank
The environment includes a lightweight curated bank:
- `easy_double`
- `easy_sum_two`
- `medium_maximum`
- `medium_count_even`
- `hard_reverse_words`
This is intentionally small for submission-minimum stability. Later work can expand it to 30-50 tiered problems without changing the OpenEnv API.