Dishaaa25
/

meta-rl-dsa-solver

Reinforcement Learning

openenv

code-generation

Model card Files Files and versions

xet

Community

s-shah4 commited on Apr 25

Commit

1b7b2a4

1 Parent(s): c790f8d

Add ADAPT V0 environment

Browse files

Files changed (2) hide show

README.md +81 -1
environment.py +80 -0

README.md CHANGED Viewed

	@@ -1 +1,81 @@
1	- # meta-rl-dsa-solver

+# meta-rl-dsa-solver
+ADAPT (Adversarial DSA Tutor) is a minimal reinforcement learning environment for coding tasks. The current V0 environment is a pure Python class with no API dependency, so it can be used directly from a training loop with `env.reset()` and `env.step(...)`.
+## Current V0
+- Fixed DSA problem: given an integer `n`, return `n * 2`
+- Single test input: `5`
+- Expected output: `10`
+- Binary reward: `1.0` for correct output, `0.0` otherwise
+- Subprocess execution with a 2 second timeout
+## Run a Smoke Test
+From this directory:
+```powershell
+cd C:\Users\kaust\PycharmProjects\meta-rl-dsa-solver
+python3 -c "from environment import AdaptEnv; env=AdaptEnv(); print(env.reset()); print(env.step('n=int(input()); print(n*2)'))"
+```
+Expected reward:
+```text
+1.0
+```
+## Use in Python
+```python
+from environment import AdaptEnv
+env = AdaptEnv()
+obs = env.reset()
+print(obs)
+code = "n=int(input()); print(n*2)"
+result = env.step(code)
+print(result)
+assert result["reward"] == 1.0
+```
+## Check Failure Cases
+Wrong answer:
+```powershell
+python3 -c "from environment import AdaptEnv; env=AdaptEnv(); env.reset(); print(env.step('print(0)'))"
+```
+Timeout:
+```powershell
+python3 -c "from environment import AdaptEnv; env=AdaptEnv(); env.reset(); print(env.step('while True: pass'))"
+```
+## Environment Contract
+`reset()` returns:
+```python
+{
+    "problem": "Given an integer n, return n * 2",
+    "input": "5",
+}
+```
+`step(action: str)` returns:
+```python
+{
+    "observation": "<program output or error>",
+    "reward": 1.0,
+    "done": True,
+    "info": {},
+}
+```
+The implementation keeps the verifier pluggable so later versions can replace the single expected-output check with hidden tests, randomized inputs, or adaptive curriculum logic.

environment.py ADDED Viewed

	@@ -0,0 +1,80 @@

+from __future__ import annotations
+import subprocess
+import tempfile
+from pathlib import Path
+from typing import Any, Callable
+class AdaptEnv:
+    def __init__(
+        self,
+        verifier: Callable[[str, str], tuple[float, dict[str, Any]]] | None = None,
+    ):
+        self.verifier = verifier
+        self.problem = ""
+        self.current_input = ""
+        self.expected_output = ""
+        self.step_count = 0
+        self.last_output = ""
+    def reset(self) -> dict[str, str]:
+        self.problem = "Given an integer n, return n * 2"
+        self.current_input = "5"
+        self.expected_output = "10"
+        self.step_count = 0
+        self.last_output = ""
+        return {
+            "problem": self.problem,
+            "input": self.current_input,
+        }
+    def step(self, action: str) -> dict[str, Any]:
+        if not self.problem:
+            self.reset()
+        self.step_count += 1
+        output = self._run_code(action)
+        reward = self._compute_reward(output)
+        self.last_output = output
+        return {
+            "observation": output,
+            "reward": reward,
+            "done": True,
+            "info": {},
+        }
+    def _run_code(self, code: str) -> str:
+        with tempfile.TemporaryDirectory() as tmpdir:
+            file_path = Path(tmpdir) / "submission.py"
+            file_path.write_text(code, encoding="utf-8")
+            try:
+                result = subprocess.run(
+                    ["python3", str(file_path)],
+                    input=self.current_input,
+                    text=True,
+                    capture_output=True,
+                    timeout=2,
+                )
+            except subprocess.TimeoutExpired:
+                return "ERROR: timeout"
+            except Exception as exc:
+                return f"ERROR: {exc}"
+            if result.returncode != 0:
+                stderr = result.stderr.strip()
+                return f"ERROR: {stderr or 'runtime error'}"
+            return result.stdout.strip()
+    def _compute_reward(self, output: str) -> float:
+        if self.verifier is not None:
+            reward, _info = self.verifier(output, self.expected_output)
+            return reward
+        if output == self.expected_output:
+            return 1.0
+        return 0.0