Spaces:
Sleeping
Sleeping
Your Name commited on
Commit Β·
0786522
1
Parent(s): 66bc92c
fix(OpenEnv): global overhaul to ensure every sub-score, reward, and model default is strictly within (0.01, 0.99) to satisfy Phase 2 validation
Browse files- README.md +11 -14
- environment.py +1 -1
- grader.py +12 -7
- models.py +9 -9
- openenv.yaml +6 -7
README.md
CHANGED
|
@@ -41,9 +41,9 @@ Every mandatory requirement is implemented and verified:
|
|
| 41 |
| `openenv.yaml` spec file | β
| `openenv.yaml` |
|
| 42 |
| Typed Pydantic models | β
| `models.py` β 8 action types + Observation |
|
| 43 |
| Minimum 3 tasks (easy β medium β hard) | β
| 4 tasks (3 scored + 1 bonus) |
|
| 44 |
-
| Graders return score in `
|
| 45 |
-
| Deterministic, reproducible
|
| 46 |
-
| Dense reward with
|
| 47 |
| Baseline inference script named `inference.py` | β
| `inference.py` |
|
| 48 |
| `[START]` / `[STEP]` / `[END]` exact stdout format | β
| `inference.py` lines 100β140 |
|
| 49 |
| `API_BASE_URL` env var | β
| `inference.py` + `openenv.yaml` |
|
|
@@ -98,7 +98,7 @@ Current benchmarks (HumanEval, SWE-bench, MBPP) treat code generation as a **sin
|
|
| 98 |
**Real-world analog:** Junior developer fixing a reported production bug
|
| 99 |
- Off-by-one in `range()` silently drops the final chunk
|
| 100 |
- `chunk_list([1,2,3,4,5], 2)` returns `[[1,2],[3,4]]` instead of `[[1,2],[3,4],[5]]`
|
| 101 |
-
- **1 file Β· 7 tests Β· 20 step limit Β· grader score 0.
|
| 102 |
|
| 103 |
### π‘ Medium β `medium_refactor_stats`
|
| 104 |
**Real-world analog:** Mid-level developer splitting a growing module
|
|
@@ -164,7 +164,7 @@ TeamForgeEnv (environment.py)
|
|
| 164 |
β
|
| 165 |
βββ state() β plain dict (JSON-serialisable)
|
| 166 |
β
|
| 167 |
-
βββ grade() β EpisodeResult (score β [0.
|
| 168 |
βββ _detect_test_tampering() β AST anti-exploit
|
| 169 |
βββ _implementation_exists() β stub-detection guard
|
| 170 |
βββ score_tests() β subprocess pytest
|
|
@@ -196,17 +196,14 @@ TeamForge Score (aggregate) =
|
|
| 196 |
|
| 197 |
## β‘ Dense Reward Function
|
| 198 |
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
+
|
| 202 |
-
+ Ξpassing_tests Γ 0.04 # each newly-passing test (delta-based)
|
| 203 |
+ 0.05 Γ (lint_violations == 0) # clean code bonus
|
| 204 |
-
|
| 205 |
-
β 0.05 Γ repeated_failure # same action type failed twice
|
| 206 |
-
β 0.30 Γ test_file_modified # SEVERE: never touch test files
|
| 207 |
```
|
| 208 |
|
| 209 |
-
The delta-based test bonus provides a smooth gradient toward correctness
|
| 210 |
|
| 211 |
---
|
| 212 |
|
|
@@ -235,7 +232,7 @@ The delta-based test bonus provides a smooth gradient toward correctness β cri
|
|
| 235 |
[STEP] step=6 action=generate_review reward=0.08 done=false error=null
|
| 236 |
[STEP] step=7 action=self_reflect reward=0.06 done=false error=null
|
| 237 |
[STEP] step=8 action=commit reward=0.05 done=true error=null
|
| 238 |
-
[END] success=true steps=8 score=0.97 rewards=0.
|
| 239 |
```
|
| 240 |
|
| 241 |
---
|
|
|
|
| 41 |
| `openenv.yaml` spec file | β
| `openenv.yaml` |
|
| 42 |
| Typed Pydantic models | β
| `models.py` β 8 action types + Observation |
|
| 43 |
| Minimum 3 tasks (easy β medium β hard) | β
| 4 tasks (3 scored + 1 bonus) |
|
| 44 |
+
| Graders return score in `(0, 1)` | β
| `grader.py` β strictly 0.01 to 0.99 |
|
| 45 |
+
| Deterministic, reproducible | β
| Anti-exploit guards included |
|
| 46 |
+
| Dense reward with strictly `(0, 1)` range | β
| `reward.py` β delta-based per step |
|
| 47 |
| Baseline inference script named `inference.py` | β
| `inference.py` |
|
| 48 |
| `[START]` / `[STEP]` / `[END]` exact stdout format | β
| `inference.py` lines 100β140 |
|
| 49 |
| `API_BASE_URL` env var | β
| `inference.py` + `openenv.yaml` |
|
|
|
|
| 98 |
**Real-world analog:** Junior developer fixing a reported production bug
|
| 99 |
- Off-by-one in `range()` silently drops the final chunk
|
| 100 |
- `chunk_list([1,2,3,4,5], 2)` returns `[[1,2],[3,4]]` instead of `[[1,2],[3,4],[5]]`
|
| 101 |
+
- **1 file Β· 7 tests Β· 20 step limit Β· grader score 0.01β0.99**
|
| 102 |
|
| 103 |
### π‘ Medium β `medium_refactor_stats`
|
| 104 |
**Real-world analog:** Mid-level developer splitting a growing module
|
|
|
|
| 164 |
β
|
| 165 |
βββ state() β plain dict (JSON-serialisable)
|
| 166 |
β
|
| 167 |
+
βββ grade() β EpisodeResult (score β [0.01, 0.99])
|
| 168 |
βββ _detect_test_tampering() β AST anti-exploit
|
| 169 |
βββ _implementation_exists() β stub-detection guard
|
| 170 |
βββ score_tests() β subprocess pytest
|
|
|
|
| 196 |
|
| 197 |
## β‘ Dense Reward Function
|
| 198 |
|
| 199 |
+
r(t) = 0.01 # step baseline reward β must be > 0
|
| 200 |
+
+ action_type_bonus # +0.05 edit / +0.10 review / +0.10 commit
|
| 201 |
+
+ Ξpassing_tests Γ 0.05 # each newly-passing test (delta-based)
|
|
|
|
| 202 |
+ 0.05 Γ (lint_violations == 0) # clean code bonus
|
| 203 |
+
# Penalties (failures) now return a minimal baseline (0.01) rather than negative
|
|
|
|
|
|
|
| 204 |
```
|
| 205 |
|
| 206 |
+
The delta-based test bonus provides a smooth gradient toward correctness. All values are strictly clamped between 0.01 and 0.99 to satisfy Phase 2 validator constraints.
|
| 207 |
|
| 208 |
---
|
| 209 |
|
|
|
|
| 232 |
[STEP] step=6 action=generate_review reward=0.08 done=false error=null
|
| 233 |
[STEP] step=7 action=self_reflect reward=0.06 done=false error=null
|
| 234 |
[STEP] step=8 action=commit reward=0.05 done=true error=null
|
| 235 |
+
[END] success=true steps=8 score=0.97 rewards=0.05,0.05,0.15,0.40,0.10,0.15,0.10,0.20
|
| 236 |
```
|
| 237 |
|
| 238 |
---
|
environment.py
CHANGED
|
@@ -329,7 +329,7 @@ class TeamForgeEnv:
|
|
| 329 |
ln for ln in output.splitlines()
|
| 330 |
if re.match(r".+:\d+:\d+:", ln)
|
| 331 |
])
|
| 332 |
-
score = max(0.0, 1.0 - violations * 0.05)
|
| 333 |
self._last_lint_result = LintResult(
|
| 334 |
violations=violations,
|
| 335 |
output=output[:2000],
|
|
|
|
| 329 |
ln for ln in output.splitlines()
|
| 330 |
if re.match(r".+:\d+:\d+:", ln)
|
| 331 |
])
|
| 332 |
+
score = max(0.01, min(0.99, 1.0 - violations * 0.05))
|
| 333 |
self._last_lint_result = LintResult(
|
| 334 |
violations=violations,
|
| 335 |
output=output[:2000],
|
grader.py
CHANGED
|
@@ -96,9 +96,11 @@ def score_tests(repo_path: str, timeout: int = 60) -> tuple[float, str]:
|
|
| 96 |
|
| 97 |
total = passed + failed + errors
|
| 98 |
if total == 0:
|
| 99 |
-
return 0.
|
| 100 |
|
| 101 |
pass_rate = passed / total
|
|
|
|
|
|
|
| 102 |
return pass_rate, output
|
| 103 |
|
| 104 |
|
|
@@ -113,14 +115,15 @@ def score_lint(repo_path: str) -> tuple[float, str]:
|
|
| 113 |
output = result.stdout + result.stderr
|
| 114 |
|
| 115 |
if result.returncode == 0:
|
| 116 |
-
return
|
| 117 |
|
| 118 |
violations = len([
|
| 119 |
ln for ln in output.splitlines()
|
| 120 |
if re.match(r".+:\d+:\d+:", ln)
|
| 121 |
])
|
| 122 |
# Stricter: -0.07 per violation (was 0.05), floor at 0.2 not 0
|
| 123 |
-
|
|
|
|
| 124 |
return score, output
|
| 125 |
|
| 126 |
|
|
@@ -130,7 +133,7 @@ def score_review_quality(
|
|
| 130 |
) -> float:
|
| 131 |
"""Keyword-based review quality with minimum length requirement."""
|
| 132 |
if not reviews:
|
| 133 |
-
return 0.
|
| 134 |
|
| 135 |
combined = " ".join(r.text.lower() for r in reviews)
|
| 136 |
|
|
@@ -153,13 +156,14 @@ def score_review_quality(
|
|
| 153 |
code_words = re.findall(r'\b[a-z_]{3,}\(\)', combined)
|
| 154 |
specificity = min(0.1, len(set(code_words)) * 0.025)
|
| 155 |
|
| 156 |
-
|
|
|
|
| 157 |
|
| 158 |
|
| 159 |
def score_reflection_quality(reflections: List[ReflectionArtifact]) -> float:
|
| 160 |
"""Score reflections on depth and actionability."""
|
| 161 |
if not reflections:
|
| 162 |
-
return 0.
|
| 163 |
|
| 164 |
total = 0.0
|
| 165 |
for ref in reflections:
|
|
@@ -173,7 +177,8 @@ def score_reflection_quality(reflections: List[ReflectionArtifact]) -> float:
|
|
| 173 |
depth = min(1.0, depth + 0.2)
|
| 174 |
total += depth
|
| 175 |
|
| 176 |
-
|
|
|
|
| 177 |
|
| 178 |
|
| 179 |
def score_efficiency(total_steps: int, max_steps: int) -> float:
|
|
|
|
| 96 |
|
| 97 |
total = passed + failed + errors
|
| 98 |
if total == 0:
|
| 99 |
+
return 0.01, output
|
| 100 |
|
| 101 |
pass_rate = passed / total
|
| 102 |
+
# Strictly (0, 1)
|
| 103 |
+
pass_rate = max(0.01, min(0.99, pass_rate))
|
| 104 |
return pass_rate, output
|
| 105 |
|
| 106 |
|
|
|
|
| 115 |
output = result.stdout + result.stderr
|
| 116 |
|
| 117 |
if result.returncode == 0:
|
| 118 |
+
return 0.99, "No lint violations."
|
| 119 |
|
| 120 |
violations = len([
|
| 121 |
ln for ln in output.splitlines()
|
| 122 |
if re.match(r".+:\d+:\d+:", ln)
|
| 123 |
])
|
| 124 |
# Stricter: -0.07 per violation (was 0.05), floor at 0.2 not 0
|
| 125 |
+
# Strictly (0, 1)
|
| 126 |
+
score = max(0.01, min(0.99, 1.0 - violations * 0.07))
|
| 127 |
return score, output
|
| 128 |
|
| 129 |
|
|
|
|
| 133 |
) -> float:
|
| 134 |
"""Keyword-based review quality with minimum length requirement."""
|
| 135 |
if not reviews:
|
| 136 |
+
return 0.01
|
| 137 |
|
| 138 |
combined = " ".join(r.text.lower() for r in reviews)
|
| 139 |
|
|
|
|
| 156 |
code_words = re.findall(r'\b[a-z_]{3,}\(\)', combined)
|
| 157 |
specificity = min(0.1, len(set(code_words)) * 0.025)
|
| 158 |
|
| 159 |
+
# Strictly (0, 1)
|
| 160 |
+
return max(0.01, min(0.99, kw_score * 0.7 + length_bonus + specificity))
|
| 161 |
|
| 162 |
|
| 163 |
def score_reflection_quality(reflections: List[ReflectionArtifact]) -> float:
|
| 164 |
"""Score reflections on depth and actionability."""
|
| 165 |
if not reflections:
|
| 166 |
+
return 0.01
|
| 167 |
|
| 168 |
total = 0.0
|
| 169 |
for ref in reflections:
|
|
|
|
| 177 |
depth = min(1.0, depth + 0.2)
|
| 178 |
total += depth
|
| 179 |
|
| 180 |
+
# Strictly (0, 1)
|
| 181 |
+
return max(0.01, min(0.99, total / max(1, len(reflections))))
|
| 182 |
|
| 183 |
|
| 184 |
def score_efficiency(total_steps: int, max_steps: int) -> float:
|
models.py
CHANGED
|
@@ -129,7 +129,7 @@ class TestResult(BaseModel):
|
|
| 129 |
class LintResult(BaseModel):
|
| 130 |
violations: int = 0
|
| 131 |
output: str = ""
|
| 132 |
-
score: float =
|
| 133 |
|
| 134 |
|
| 135 |
class ReviewArtifact(BaseModel):
|
|
@@ -175,8 +175,8 @@ class Observation(BaseModel):
|
|
| 175 |
reflections: List[ReflectionArtifact] = Field(default_factory=list)
|
| 176 |
|
| 177 |
# Signals
|
| 178 |
-
reward: float = 0.
|
| 179 |
-
cumulative_reward: float = 0.
|
| 180 |
done: bool = False
|
| 181 |
info: Dict[str, Any] = Field(default_factory=dict)
|
| 182 |
|
|
@@ -188,11 +188,11 @@ class Observation(BaseModel):
|
|
| 188 |
class EpisodeResult(BaseModel):
|
| 189 |
task_id: str
|
| 190 |
total_steps: int
|
| 191 |
-
test_pass_rate: float = 0.
|
| 192 |
-
lint_score: float = 0.
|
| 193 |
-
efficiency_score: float = 0.
|
| 194 |
-
review_quality: float = 0.
|
| 195 |
-
reflection_quality: float = 0.
|
| 196 |
-
final_score: float = 0.
|
| 197 |
passed: bool = False
|
| 198 |
log: List[str] = Field(default_factory=list)
|
|
|
|
| 129 |
class LintResult(BaseModel):
|
| 130 |
violations: int = 0
|
| 131 |
output: str = ""
|
| 132 |
+
score: float = 0.99 # 0.99 = clean
|
| 133 |
|
| 134 |
|
| 135 |
class ReviewArtifact(BaseModel):
|
|
|
|
| 175 |
reflections: List[ReflectionArtifact] = Field(default_factory=list)
|
| 176 |
|
| 177 |
# Signals
|
| 178 |
+
reward: float = 0.01
|
| 179 |
+
cumulative_reward: float = 0.01
|
| 180 |
done: bool = False
|
| 181 |
info: Dict[str, Any] = Field(default_factory=dict)
|
| 182 |
|
|
|
|
| 188 |
class EpisodeResult(BaseModel):
|
| 189 |
task_id: str
|
| 190 |
total_steps: int
|
| 191 |
+
test_pass_rate: float = 0.01
|
| 192 |
+
lint_score: float = 0.01
|
| 193 |
+
efficiency_score: float = 0.01
|
| 194 |
+
review_quality: float = 0.01
|
| 195 |
+
reflection_quality: float = 0.01
|
| 196 |
+
final_score: float = 0.01
|
| 197 |
passed: bool = False
|
| 198 |
log: List[str] = Field(default_factory=list)
|
openenv.yaml
CHANGED
|
@@ -117,12 +117,11 @@ observation_space:
|
|
| 117 |
|
| 118 |
# ββ Reward βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 119 |
reward:
|
| 120 |
-
range: [
|
| 121 |
type: dense
|
| 122 |
description: >
|
| 123 |
Dense shaped reward. Positive for: correct plan steps, edits, passing tests,
|
| 124 |
-
clean lint, reviews, reflections, commits.
|
| 125 |
-
repeated failures, and test file modification (severe: -0.30).
|
| 126 |
|
| 127 |
# ββ Tasks ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 128 |
tasks:
|
|
@@ -131,28 +130,28 @@ tasks:
|
|
| 131 |
max_steps: 20
|
| 132 |
description: "Fix an off-by-one bug in utils/list_ops.py. All 7 tests must pass."
|
| 133 |
grader: grader.grade_episode
|
| 134 |
-
score_range: [0.
|
| 135 |
|
| 136 |
- id: medium_refactor_stats
|
| 137 |
difficulty: medium
|
| 138 |
max_steps: 30
|
| 139 |
description: "Refactor monolithic stats.py into a stats/ package. 15 tests must pass with full backward compatibility."
|
| 140 |
grader: grader.grade_episode
|
| 141 |
-
score_range: [0.
|
| 142 |
|
| 143 |
- id: hard_lru_cache_performance
|
| 144 |
difficulty: hard
|
| 145 |
max_steps: 40
|
| 146 |
description: "Implement O(1) LRU cache from a stub. 15 correctness tests + 1 performance test (10k ops < 200ms)."
|
| 147 |
grader: grader.grade_episode
|
| 148 |
-
score_range: [0.
|
| 149 |
|
| 150 |
- id: bonus_perf_regression_merge
|
| 151 |
difficulty: hard
|
| 152 |
max_steps: 40
|
| 153 |
description: "Diagnose and fix O(nΒ²) regression hidden inside a bad merge conflict resolution. Perf test: 50k docs < 500ms."
|
| 154 |
grader: grader.grade_episode
|
| 155 |
-
score_range: [0.
|
| 156 |
|
| 157 |
# ββ Infrastructure βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 158 |
runtime:
|
|
|
|
| 117 |
|
| 118 |
# ββ Reward βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 119 |
reward:
|
| 120 |
+
range: [0.01, 0.99]
|
| 121 |
type: dense
|
| 122 |
description: >
|
| 123 |
Dense shaped reward. Positive for: correct plan steps, edits, passing tests,
|
| 124 |
+
clean lint, reviews, reflections, commits. Always strictly between 0 and 1.
|
|
|
|
| 125 |
|
| 126 |
# ββ Tasks ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 127 |
tasks:
|
|
|
|
| 130 |
max_steps: 20
|
| 131 |
description: "Fix an off-by-one bug in utils/list_ops.py. All 7 tests must pass."
|
| 132 |
grader: grader.grade_episode
|
| 133 |
+
score_range: [0.01, 0.99]
|
| 134 |
|
| 135 |
- id: medium_refactor_stats
|
| 136 |
difficulty: medium
|
| 137 |
max_steps: 30
|
| 138 |
description: "Refactor monolithic stats.py into a stats/ package. 15 tests must pass with full backward compatibility."
|
| 139 |
grader: grader.grade_episode
|
| 140 |
+
score_range: [0.01, 0.99]
|
| 141 |
|
| 142 |
- id: hard_lru_cache_performance
|
| 143 |
difficulty: hard
|
| 144 |
max_steps: 40
|
| 145 |
description: "Implement O(1) LRU cache from a stub. 15 correctness tests + 1 performance test (10k ops < 200ms)."
|
| 146 |
grader: grader.grade_episode
|
| 147 |
+
score_range: [0.01, 0.99]
|
| 148 |
|
| 149 |
- id: bonus_perf_regression_merge
|
| 150 |
difficulty: hard
|
| 151 |
max_steps: 40
|
| 152 |
description: "Diagnose and fix O(nΒ²) regression hidden inside a bad merge conflict resolution. Perf test: 50k docs < 500ms."
|
| 153 |
grader: grader.grade_episode
|
| 154 |
+
score_range: [0.01, 0.99]
|
| 155 |
|
| 156 |
# ββ Infrastructure βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 157 |
runtime:
|