docs: update reward signal documentation with structured tables and weight breakdowns
Browse files
Blog.MD
CHANGED
|
@@ -66,13 +66,24 @@ The document is a rigorous JSON schema defining the exact state of the enterpris
|
|
| 66 |
|
| 67 |
## 🏆 A Reward Signal That Actually Teaches
|
| 68 |
|
| 69 |
-
A great environment
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
---
|
| 78 |
|
|
|
|
| 66 |
|
| 67 |
## 🏆 A Reward Signal That Actually Teaches
|
| 68 |
|
| 69 |
+
A great environment has a reward function that:
|
| 70 |
+
|
| 71 |
+
| Requirement | How CORP-ENV Delivers |
|
| 72 |
+
|-------------|----------------------|
|
| 73 |
+
| **Provides a rich, informative signal** | Blends Phase Transitions, Conflict Identification, Resolution Logging, and Iterative Validation rather than a 0/1 final score. |
|
| 74 |
+
| **Captures something hard to measure** | Captures *how well* the model organizes chaos. The strict structure of the SWD and documented reasoning phases provide dense intermediate signals. |
|
| 75 |
+
| **Uses Rubric system thoughtfully** | The reward is a composable rubric scoring far beyond a monolithic `success/failure`. We prioritize programmatic validations of corporate rigor over ambiguous LLM judges. |
|
| 76 |
+
| **Is hard to game** | An agent attempting to exploit the reward by skipping directly to `finalize`, missing milestones, or submitting malformed JSON patches gets severely penalized and clamped. |
|
| 77 |
+
|
| 78 |
+
### Reward Breakdown (Terminal, at `finalize`)
|
| 79 |
+
|
| 80 |
+
| Component | Weight | Evaluation Method |
|
| 81 |
+
| :--- | :--- | :--- |
|
| 82 |
+
| **Completion** | 35% | Verifier |
|
| 83 |
+
| **SWD Coherence**| 25% | Structural |
|
| 84 |
+
| **Milestones** | 20% | On-time |
|
| 85 |
+
| **Reasoning** | 10% | Log entries |
|
| 86 |
+
| **LLM Judge** | 10% | 3 YES/NO Qs |
|
| 87 |
|
| 88 |
---
|
| 89 |
|
README.md
CHANGED
|
@@ -96,6 +96,27 @@ The SWD is a rigorous JSON schema defining the exact state of the enterprise epi
|
|
| 96 |
| `m1_budget_reallocation` | Medium | Budget conflict across dev / HR / finance. |
|
| 97 |
| `h1_acquisition_defence` | Hard | Acquisition defence with injected contradictory intel. |
|
| 98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
## Quick Start
|
| 100 |
|
| 101 |
```bash
|
|
|
|
| 96 |
| `m1_budget_reallocation` | Medium | Budget conflict across dev / HR / finance. |
|
| 97 |
| `h1_acquisition_defence` | Hard | Acquisition defence with injected contradictory intel. |
|
| 98 |
|
| 99 |
+
## 🏆 A Reward Signal That Actually Teaches
|
| 100 |
+
|
| 101 |
+
A great environment has a reward function that:
|
| 102 |
+
|
| 103 |
+
| Requirement | How CORP-ENV Delivers |
|
| 104 |
+
|-------------|----------------------|
|
| 105 |
+
| **Provides a rich, informative signal** | Blends Phase Transitions, Conflict Identification, Resolution Logging, and Iterative Validation rather than a 0/1 final score. |
|
| 106 |
+
| **Captures something hard to measure** | Evaluates *how well* the model organizes chaos. The strict structure of the SWD and documented reasoning phases provide dense intermediate signals. |
|
| 107 |
+
| **Uses Rubric system thoughtfully** | The final reward relies on programmatic validations of corporate rigor and is a composition of granular rubric items rather than monolithic `success/failure`. |
|
| 108 |
+
| **Is hard to game** | Attempting to skip to `finalize`, missing milestones, or submitting malformed JSON patches aggressively clamps the reward for agents trying to exploit it. |
|
| 109 |
+
|
| 110 |
+
### Reward Breakdown (Terminal, at `finalize`)
|
| 111 |
+
|
| 112 |
+
| Component | Weight | Evaluation Method |
|
| 113 |
+
| :--- | :--- | :--- |
|
| 114 |
+
| **Completion** | 35% | Verifier |
|
| 115 |
+
| **SWD Coherence**| 25% | Structural |
|
| 116 |
+
| **Milestones** | 20% | On-time |
|
| 117 |
+
| **Reasoning** | 10% | Log entries |
|
| 118 |
+
| **LLM Judge** | 10% | 3 YES/NO Qs |
|
| 119 |
+
|
| 120 |
## Quick Start
|
| 121 |
|
| 122 |
```bash
|