Spaces:

PrakashCider
/

teamforge

Sleeping

App Files Files Community

Your Name commited on Apr 11

Commit

0786522

1 Parent(s): 66bc92c

fix(OpenEnv): global overhaul to ensure every sub-score, reward, and model default is strictly within (0.01, 0.99) to satisfy Phase 2 validation

Browse files

Files changed (5) hide show

README.md +11 -14
environment.py +1 -1
grader.py +12 -7
models.py +9 -9
openenv.yaml +6 -7

README.md CHANGED Viewed

@@ -41,9 +41,9 @@ Every mandatory requirement is implemented and verified:
 | `openenv.yaml` spec file | ✅ | `openenv.yaml` |
 | Typed Pydantic models | ✅ | `models.py` — 8 action types + Observation |
 | Minimum 3 tasks (easy → medium → hard) | ✅ | 4 tasks (3 scored + 1 bonus) |
-| Graders return score in `[0.0, 1.0]` | ✅ | `grader.py` |
-| Deterministic, reproducible graders | ✅ | Anti-exploit guards included |
-| Dense reward with partial progress signals | ✅ | `reward.py` — delta-based per step |
 | Baseline inference script named `inference.py` | ✅ | `inference.py` |
 | `[START]` / `[STEP]` / `[END]` exact stdout format | ✅ | `inference.py` lines 100–140 |
 | `API_BASE_URL` env var | ✅ | `inference.py` + `openenv.yaml` |
@@ -98,7 +98,7 @@ Current benchmarks (HumanEval, SWE-bench, MBPP) treat code generation as a **sin
 **Real-world analog:** Junior developer fixing a reported production bug
 - Off-by-one in `range()` silently drops the final chunk
 - `chunk_list([1,2,3,4,5], 2)` returns `[[1,2],[3,4]]` instead of `[[1,2],[3,4],[5]]`
-- **1 file · 7 tests · 20 step limit · grader score 0.0–1.0**
 ### 🟡 Medium — `medium_refactor_stats`
 **Real-world analog:** Mid-level developer splitting a growing module
@@ -164,7 +164,7 @@ TeamForgeEnv (environment.py)
 │
 ├── state() → plain dict (JSON-serialisable)
 │
-└── grade() → EpisodeResult (score ∈ [0.0, 1.0])
     ├── _detect_test_tampering()   ← AST anti-exploit
     ├── _implementation_exists()   ← stub-detection guard
     ├── score_tests()              ← subprocess pytest
@@ -196,17 +196,14 @@ TeamForge Score (aggregate) =
 ## ⚡ Dense Reward Function
-```python
-r(t) = −0.005                         # step cost — efficiency pressure
-     + action_type_bonus               # +0.02 plan / +0.03 edit / +0.08 review
-     + Δpassing_tests × 0.04          # each newly-passing test (delta-based)
      + 0.05 × (lint_violations == 0)  # clean code bonus
-     − 0.10 × action_failed           # execution failure penalty
-     − 0.05 × repeated_failure        # same action type failed twice
-     − 0.30 × test_file_modified      # SEVERE: never touch test files
 ```
-The delta-based test bonus provides a smooth gradient toward correctness — critical for RL fine-tuning stability. Contrast with SWE-bench's sparse end-of-episode reward.
 ---
@@ -235,7 +232,7 @@ The delta-based test bonus provides a smooth gradient toward correctness — cri
 [STEP] step=6 action=generate_review reward=0.08 done=false error=null
 [STEP] step=7 action=self_reflect reward=0.06 done=false error=null
 [STEP] step=8 action=commit reward=0.05 done=true error=null
-[END] success=true steps=8 score=0.97 rewards=0.02,0.02,0.03,0.28,0.06,0.08,0.06,0.05
 ```
 ---

 | `openenv.yaml` spec file | ✅ | `openenv.yaml` |
 | Typed Pydantic models | ✅ | `models.py` — 8 action types + Observation |
 | Minimum 3 tasks (easy → medium → hard) | ✅ | 4 tasks (3 scored + 1 bonus) |
+| Graders return score in `(0, 1)` | ✅ | `grader.py` — strictly 0.01 to 0.99 |
+| Deterministic, reproducible | ✅ | Anti-exploit guards included |
+| Dense reward with strictly `(0, 1)` range | ✅ | `reward.py` — delta-based per step |
 | Baseline inference script named `inference.py` | ✅ | `inference.py` |
 | `[START]` / `[STEP]` / `[END]` exact stdout format | ✅ | `inference.py` lines 100–140 |
 | `API_BASE_URL` env var | ✅ | `inference.py` + `openenv.yaml` |
 **Real-world analog:** Junior developer fixing a reported production bug
 - Off-by-one in `range()` silently drops the final chunk
 - `chunk_list([1,2,3,4,5], 2)` returns `[[1,2],[3,4]]` instead of `[[1,2],[3,4],[5]]`
+- **1 file · 7 tests · 20 step limit · grader score 0.01–0.99**
 ### 🟡 Medium — `medium_refactor_stats`
 **Real-world analog:** Mid-level developer splitting a growing module
 │
 ├── state() → plain dict (JSON-serialisable)
 │
+└── grade() → EpisodeResult (score ∈ [0.01, 0.99])
     ├── _detect_test_tampering()   ← AST anti-exploit
     ├── _implementation_exists()   ← stub-detection guard
     ├── score_tests()              ← subprocess pytest
 ## ⚡ Dense Reward Function
+r(t) = 0.01                          # step baseline reward — must be > 0
+     + action_type_bonus               # +0.05 edit / +0.10 review / +0.10 commit
+     + Δpassing_tests × 0.05          # each newly-passing test (delta-based)
      + 0.05 × (lint_violations == 0)  # clean code bonus
+     # Penalties (failures) now return a minimal baseline (0.01) rather than negative
 ```
+The delta-based test bonus provides a smooth gradient toward correctness. All values are strictly clamped between 0.01 and 0.99 to satisfy Phase 2 validator constraints.
 ---
 [STEP] step=6 action=generate_review reward=0.08 done=false error=null
 [STEP] step=7 action=self_reflect reward=0.06 done=false error=null
 [STEP] step=8 action=commit reward=0.05 done=true error=null
+[END] success=true steps=8 score=0.97 rewards=0.05,0.05,0.15,0.40,0.10,0.15,0.10,0.20
 ```
 ---

environment.py CHANGED Viewed

@@ -329,7 +329,7 @@ class TeamForgeEnv:
             ln for ln in output.splitlines()
             if re.match(r".+:\d+:\d+:", ln)
         ])
-        score = max(0.0, 1.0 - violations * 0.05)
         self._last_lint_result = LintResult(
             violations=violations,
             output=output[:2000],

             ln for ln in output.splitlines()
             if re.match(r".+:\d+:\d+:", ln)
         ])
+        score = max(0.01, min(0.99, 1.0 - violations * 0.05))
         self._last_lint_result = LintResult(
             violations=violations,
             output=output[:2000],

grader.py CHANGED Viewed

@@ -96,9 +96,11 @@ def score_tests(repo_path: str, timeout: int = 60) -> tuple[float, str]:
     total = passed + failed + errors
     if total == 0:
-        return 0.0, output
     pass_rate = passed / total
     return pass_rate, output
@@ -113,14 +115,15 @@ def score_lint(repo_path: str) -> tuple[float, str]:
     output = result.stdout + result.stderr
     if result.returncode == 0:
-        return 1.0, "No lint violations."
     violations = len([
         ln for ln in output.splitlines()
         if re.match(r".+:\d+:\d+:", ln)
     ])
     # Stricter: -0.07 per violation (was 0.05), floor at 0.2 not 0
-    score = max(0.2, 1.0 - violations * 0.07)
     return score, output
@@ -130,7 +133,7 @@ def score_review_quality(
 ) -> float:
     """Keyword-based review quality with minimum length requirement."""
     if not reviews:
-        return 0.0
     combined = " ".join(r.text.lower() for r in reviews)
@@ -153,13 +156,14 @@ def score_review_quality(
     code_words   = re.findall(r'\b[a-z_]{3,}\(\)', combined)
     specificity  = min(0.1, len(set(code_words)) * 0.025)
-    return min(1.0, kw_score * 0.7 + length_bonus + specificity)
 def score_reflection_quality(reflections: List[ReflectionArtifact]) -> float:
     """Score reflections on depth and actionability."""
     if not reflections:
-        return 0.0
     total = 0.0
     for ref in reflections:
@@ -173,7 +177,8 @@ def score_reflection_quality(reflections: List[ReflectionArtifact]) -> float:
             depth = min(1.0, depth + 0.2)
         total += depth
-    return min(1.0, total / max(1, len(reflections)))
 def score_efficiency(total_steps: int, max_steps: int) -> float:

     total = passed + failed + errors
     if total == 0:
+        return 0.01, output
     pass_rate = passed / total
+    # Strictly (0, 1)
+    pass_rate = max(0.01, min(0.99, pass_rate))
     return pass_rate, output
     output = result.stdout + result.stderr
     if result.returncode == 0:
+        return 0.99, "No lint violations."
     violations = len([
         ln for ln in output.splitlines()
         if re.match(r".+:\d+:\d+:", ln)
     ])
     # Stricter: -0.07 per violation (was 0.05), floor at 0.2 not 0
+    # Strictly (0, 1)
+    score = max(0.01, min(0.99, 1.0 - violations * 0.07))
     return score, output
 ) -> float:
     """Keyword-based review quality with minimum length requirement."""
     if not reviews:
+        return 0.01
     combined = " ".join(r.text.lower() for r in reviews)
     code_words   = re.findall(r'\b[a-z_]{3,}\(\)', combined)
     specificity  = min(0.1, len(set(code_words)) * 0.025)
+    # Strictly (0, 1)
+    return max(0.01, min(0.99, kw_score * 0.7 + length_bonus + specificity))
 def score_reflection_quality(reflections: List[ReflectionArtifact]) -> float:
     """Score reflections on depth and actionability."""
     if not reflections:
+        return 0.01
     total = 0.0
     for ref in reflections:
             depth = min(1.0, depth + 0.2)
         total += depth
+    # Strictly (0, 1)
+    return max(0.01, min(0.99, total / max(1, len(reflections))))
 def score_efficiency(total_steps: int, max_steps: int) -> float:

models.py CHANGED Viewed

@@ -129,7 +129,7 @@ class TestResult(BaseModel):
 class LintResult(BaseModel):
     violations: int = 0
     output: str = ""
-    score: float = 1.0  # 1.0 = clean
 class ReviewArtifact(BaseModel):
@@ -175,8 +175,8 @@ class Observation(BaseModel):
     reflections: List[ReflectionArtifact] = Field(default_factory=list)
     # Signals
-    reward: float = 0.0
-    cumulative_reward: float = 0.0
     done: bool = False
     info: Dict[str, Any] = Field(default_factory=dict)
@@ -188,11 +188,11 @@ class Observation(BaseModel):
 class EpisodeResult(BaseModel):
     task_id: str
     total_steps: int
-    test_pass_rate: float = 0.0
-    lint_score: float = 0.0
-    efficiency_score: float = 0.0
-    review_quality: float = 0.0
-    reflection_quality: float = 0.0
-    final_score: float = 0.0
     passed: bool = False
     log: List[str] = Field(default_factory=list)

 class LintResult(BaseModel):
     violations: int = 0
     output: str = ""
+    score: float = 0.99  # 0.99 = clean
 class ReviewArtifact(BaseModel):
     reflections: List[ReflectionArtifact] = Field(default_factory=list)
     # Signals
+    reward: float = 0.01
+    cumulative_reward: float = 0.01
     done: bool = False
     info: Dict[str, Any] = Field(default_factory=dict)
 class EpisodeResult(BaseModel):
     task_id: str
     total_steps: int
+    test_pass_rate: float = 0.01
+    lint_score: float = 0.01
+    efficiency_score: float = 0.01
+    review_quality: float = 0.01
+    reflection_quality: float = 0.01
+    final_score: float = 0.01
     passed: bool = False
     log: List[str] = Field(default_factory=list)

openenv.yaml CHANGED Viewed

@@ -117,12 +117,11 @@ observation_space:
 # ── Reward ─────────────────────────────────────────────────────────────────────
 reward:
-  range: [-0.40, 0.38]
   type: dense
   description: >
     Dense shaped reward. Positive for: correct plan steps, edits, passing tests,
-    clean lint, reviews, reflections, commits. Negative for: action failures,
-    repeated failures, and test file modification (severe: -0.30).
 # ── Tasks ──────────────────────────────────────────────────────────────────────
 tasks:
@@ -131,28 +130,28 @@ tasks:
     max_steps: 20
     description: "Fix an off-by-one bug in utils/list_ops.py. All 7 tests must pass."
     grader: grader.grade_episode
-    score_range: [0.0, 1.0]
   - id: medium_refactor_stats
     difficulty: medium
     max_steps: 30
     description: "Refactor monolithic stats.py into a stats/ package. 15 tests must pass with full backward compatibility."
     grader: grader.grade_episode
-    score_range: [0.0, 1.0]
   - id: hard_lru_cache_performance
     difficulty: hard
     max_steps: 40
     description: "Implement O(1) LRU cache from a stub. 15 correctness tests + 1 performance test (10k ops < 200ms)."
     grader: grader.grade_episode
-    score_range: [0.0, 1.0]
   - id: bonus_perf_regression_merge
     difficulty: hard
     max_steps: 40
     description: "Diagnose and fix O(n²) regression hidden inside a bad merge conflict resolution. Perf test: 50k docs < 500ms."
     grader: grader.grade_episode
-    score_range: [0.0, 1.0]
 # ── Infrastructure ─────────────────────────────────────────────────────────────
 runtime:

 # ── Reward ─────────────────────────────────────────────────────────────────────
 reward:
+  range: [0.01, 0.99]
   type: dense
   description: >
     Dense shaped reward. Positive for: correct plan steps, edits, passing tests,
+    clean lint, reviews, reflections, commits. Always strictly between 0 and 1.
 # ── Tasks ──────────────────────────────────────────────────────────────────────
 tasks:
     max_steps: 20
     description: "Fix an off-by-one bug in utils/list_ops.py. All 7 tests must pass."
     grader: grader.grade_episode
+    score_range: [0.01, 0.99]
   - id: medium_refactor_stats
     difficulty: medium
     max_steps: 30
     description: "Refactor monolithic stats.py into a stats/ package. 15 tests must pass with full backward compatibility."
     grader: grader.grade_episode
+    score_range: [0.01, 0.99]
   - id: hard_lru_cache_performance
     difficulty: hard
     max_steps: 40
     description: "Implement O(1) LRU cache from a stub. 15 correctness tests + 1 performance test (10k ops < 200ms)."
     grader: grader.grade_episode
+    score_range: [0.01, 0.99]
   - id: bonus_perf_regression_merge
     difficulty: hard
     max_steps: 40
     description: "Diagnose and fix O(n²) regression hidden inside a bad merge conflict resolution. Perf test: 50k docs < 500ms."
     grader: grader.grade_episode
+    score_range: [0.01, 0.99]
 # ── Infrastructure ─────────────────────────────────────────────────────────────
 runtime: