Navigam commited on
Commit
c829526
·
1 Parent(s): 5ee1234

docs: update reward signal documentation with structured tables and weight breakdowns

Browse files
Files changed (2) hide show
  1. Blog.MD +18 -7
  2. README.md +21 -0
Blog.MD CHANGED
@@ -66,13 +66,24 @@ The document is a rigorous JSON schema defining the exact state of the enterpris
66
 
67
  ## 🏆 A Reward Signal That Actually Teaches
68
 
69
- A great environment relies on a reward function that provides rich signals, not just an arbitrary 0 or 1 at the end of the tunnel. In CORP-ENV, if an agent just blindly guesses the final outcome without doing the work, it fails. The reward is meticulously designed:
70
-
71
- - **Rich, Informative, and Composable**: We use OpenEnv's Rubric system. The reward blends Phase Transitions, Conflict Identification, Resolution Logging, and Iterative Validation. It's a composable rubric scoring far beyond monolithic `success/failure`.
72
- - **Captures Something Hard**: It captures *how well* the model organizes chaos. The structure of the SWD and the presence of documented reasoning phases provide dense intermediate signals.
73
- - **Hard to Game**: An agent attempting to exploit the reward by skipping to `finalize` gets severely penalized. If it missed milestones, didn't talk to HR, or submitted malformed JSON patches, the reward is aggressively clamped.
74
-
75
- We explicitly removed any reliance on ambiguous LLM judges. Our rewards are pure, programmatic validations of corporate rigor.
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ---
78
 
 
66
 
67
  ## 🏆 A Reward Signal That Actually Teaches
68
 
69
+ A great environment has a reward function that:
70
+
71
+ | Requirement | How CORP-ENV Delivers |
72
+ |-------------|----------------------|
73
+ | **Provides a rich, informative signal** | Blends Phase Transitions, Conflict Identification, Resolution Logging, and Iterative Validation rather than a 0/1 final score. |
74
+ | **Captures something hard to measure** | Captures *how well* the model organizes chaos. The strict structure of the SWD and documented reasoning phases provide dense intermediate signals. |
75
+ | **Uses Rubric system thoughtfully** | The reward is a composable rubric scoring far beyond a monolithic `success/failure`. We prioritize programmatic validations of corporate rigor over ambiguous LLM judges. |
76
+ | **Is hard to game** | An agent attempting to exploit the reward by skipping directly to `finalize`, missing milestones, or submitting malformed JSON patches gets severely penalized and clamped. |
77
+
78
+ ### Reward Breakdown (Terminal, at `finalize`)
79
+
80
+ | Component | Weight | Evaluation Method |
81
+ | :--- | :--- | :--- |
82
+ | **Completion** | 35% | Verifier |
83
+ | **SWD Coherence**| 25% | Structural |
84
+ | **Milestones** | 20% | On-time |
85
+ | **Reasoning** | 10% | Log entries |
86
+ | **LLM Judge** | 10% | 3 YES/NO Qs |
87
 
88
  ---
89
 
README.md CHANGED
@@ -96,6 +96,27 @@ The SWD is a rigorous JSON schema defining the exact state of the enterprise epi
96
  | `m1_budget_reallocation` | Medium | Budget conflict across dev / HR / finance. |
97
  | `h1_acquisition_defence` | Hard | Acquisition defence with injected contradictory intel. |
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  ## Quick Start
100
 
101
  ```bash
 
96
  | `m1_budget_reallocation` | Medium | Budget conflict across dev / HR / finance. |
97
  | `h1_acquisition_defence` | Hard | Acquisition defence with injected contradictory intel. |
98
 
99
+ ## 🏆 A Reward Signal That Actually Teaches
100
+
101
+ A great environment has a reward function that:
102
+
103
+ | Requirement | How CORP-ENV Delivers |
104
+ |-------------|----------------------|
105
+ | **Provides a rich, informative signal** | Blends Phase Transitions, Conflict Identification, Resolution Logging, and Iterative Validation rather than a 0/1 final score. |
106
+ | **Captures something hard to measure** | Evaluates *how well* the model organizes chaos. The strict structure of the SWD and documented reasoning phases provide dense intermediate signals. |
107
+ | **Uses Rubric system thoughtfully** | The final reward relies on programmatic validations of corporate rigor and is a composition of granular rubric items rather than monolithic `success/failure`. |
108
+ | **Is hard to game** | Attempting to skip to `finalize`, missing milestones, or submitting malformed JSON patches aggressively clamps the reward for agents trying to exploit it. |
109
+
110
+ ### Reward Breakdown (Terminal, at `finalize`)
111
+
112
+ | Component | Weight | Evaluation Method |
113
+ | :--- | :--- | :--- |
114
+ | **Completion** | 35% | Verifier |
115
+ | **SWD Coherence**| 25% | Structural |
116
+ | **Milestones** | 20% | On-time |
117
+ | **Reasoning** | 10% | Log entries |
118
+ | **LLM Judge** | 10% | 3 YES/NO Qs |
119
+
120
  ## Quick Start
121
 
122
  ```bash