77ethers commited on
Commit
8cd6af2
·
verified ·
1 Parent(s): a0bb273

Add CarbonAlpha model card and training evidence

Browse files
Files changed (1) hide show
  1. README.md +284 -0
README.md ADDED
@@ -0,0 +1,284 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CarbonAlpha Model Card
2
+
3
+ ## Model Summary
4
+
5
+ CarbonAlpha is a climate-aware portfolio reasoning agent for the
6
+ `portfolio_env` OpenEnv environment. It reads one macro-news event, reasons
7
+ through first-order and second-order effects, and emits a constrained
8
+ `PortfolioAction`:
9
+
10
+ ```json
11
+ {
12
+ "weights": [w_tech, w_oil, w_green, w_real_estate, w_bonds],
13
+ "infra_commit": 0.0,
14
+ "carbon_offset_buy": 0.0,
15
+ "put_hedge": 0.0,
16
+ "tech_bet": "status_quo"
17
+ }
18
+ ```
19
+
20
+ Current best research model:
21
+
22
+ ```text
23
+ 77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1
24
+ ```
25
+
26
+ Base model:
27
+
28
+ ```text
29
+ unsloth/Qwen2.5-7B-Instruct
30
+ ```
31
+
32
+ Adapter lineage:
33
+
34
+ 1. SFT warm-start on 400 curriculum traces.
35
+ 2. GRPO Phase 1 for 100 steps.
36
+ 3. Holdout and manual macro-eval checks before promotion.
37
+
38
+ The live Space can load this adapter through the `MODEL_SUBFOLDER`
39
+ environment variable:
40
+
41
+ ```text
42
+ https://77ethers-carbonalpha-demo.hf.space/
43
+ ```
44
+
45
+ ## Intended Use
46
+
47
+ This model is intended for the CarbonAlpha walkthrough demo and OpenEnv
48
+ evaluation. It is not a financial advisor and should not be used to make real
49
+ investment decisions.
50
+
51
+ The useful behavior to evaluate is:
52
+
53
+ - strict `<think>...</think>` plus JSON formatting;
54
+ - valid portfolio weights and bounded interventions;
55
+ - recognition of macro regime shifts;
56
+ - carbon-budget awareness;
57
+ - performance against the environment's equal-weight baseline.
58
+
59
+ ## Training Data
60
+
61
+ The Qwen2.5 SFT warm-start used:
62
+
63
+ ```text
64
+ sft_traces/curriculum_400_e80_m160_h160.jsonl
65
+ ```
66
+
67
+ Trace mix:
68
+
69
+ - 80 easy traces;
70
+ - 160 medium / ambiguous traces;
71
+ - 160 hard traces.
72
+
73
+ The trace schema follows `sft_traces/merged_v6_aligned.jsonl`, with the same
74
+ prompt and completion contract used during inference.
75
+
76
+ ## Training Pipeline
77
+
78
+ ### SFT
79
+
80
+ SFT artifact:
81
+
82
+ ```text
83
+ 77ethers/CarbonAlpha/sft_qwen25_7b_curriculum400_v1
84
+ ```
85
+
86
+ Training script:
87
+
88
+ ```text
89
+ scripts/hf_sft_qwen25_7b.py
90
+ ```
91
+
92
+ Configuration:
93
+
94
+ - QLoRA over `unsloth/Qwen2.5-7B-Instruct`;
95
+ - LoRA rank 16;
96
+ - `lora_alpha=16`;
97
+ - 220 SFT steps;
98
+ - effective batch size 4;
99
+ - Hugging Face Jobs L40S.
100
+
101
+ SFT result:
102
+
103
+ - generation sanity: 5/5 valid actions;
104
+ - holdout: 5/5 valid;
105
+ - mean holdout regret: `+0.02796`;
106
+ - beats baseline on 3/5 holdout seeds.
107
+
108
+ ### GRPO
109
+
110
+ Best GRPO artifact:
111
+
112
+ ```text
113
+ 77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1
114
+ ```
115
+
116
+ Training script:
117
+
118
+ ```text
119
+ scripts/hf_grpo_qwen25_adapter.py
120
+ ```
121
+
122
+ GRPO configuration:
123
+
124
+ - warm-start from `sft_qwen25_7b_curriculum400_v1`;
125
+ - `use_vllm=False`;
126
+ - 100 GRPO steps;
127
+ - 128 generated Phase-1 prompts;
128
+ - 2 generations per prompt;
129
+ - batch size 2;
130
+ - learning rate `2e-6`;
131
+ - `loss_type="dapo"`;
132
+ - KL beta `0.02`.
133
+
134
+ Reward functions:
135
+
136
+ - format reward;
137
+ - action-contract reward;
138
+ - reasoning-shape reward;
139
+ - Phase-1 simulator regret reward;
140
+ - carbon-guard reward.
141
+
142
+ Important engineering choice: we avoided vLLM for the Qwen2.5 GRPO run because
143
+ earlier vLLM-based Qwen3 rollouts collapsed to one-token completions. The
144
+ plain-Transformers path was slower but healthier and easier to debug.
145
+
146
+ ## Evidence of Training
147
+
148
+ The 100-step GRPO run was launched as a Hugging Face Job:
149
+
150
+ ```text
151
+ https://huggingface.co/jobs/77ethers/69ed1ce0d70108f37acdeea3
152
+ ```
153
+
154
+ Raw evidence committed in this repo:
155
+
156
+ ```text
157
+ training_logs/qwen25_grpo_phase1_100_v1.log
158
+ training_logs/qwen25_grpo_phase1_100_v1_rows.jsonl
159
+ ```
160
+
161
+ The parsed JSONL contains 100 real GRPO metric rows extracted from the job log.
162
+
163
+ Loss and reward plots generated from those rows:
164
+
165
+ ![Qwen2.5 GRPO loss curve](assets/loss_curve.png)
166
+
167
+ ![Qwen2.5 GRPO reward curve](assets/reward_curve.png)
168
+
169
+ Additional rollout-health plot:
170
+
171
+ ![Qwen2.5 GRPO completion length health](assets/qwen25_grpo_phase1_100_completion_lengths.png)
172
+
173
+ The completion-length plot is included because one-token rollout collapse was
174
+ the main failure mode in earlier GRPO attempts. In this successful run,
175
+ completion lengths stayed well above the smoke threshold throughout training.
176
+
177
+ ## Evaluation
178
+
179
+ ### Holdout
180
+
181
+ Holdout seeds:
182
+
183
+ ```text
184
+ 100, 200, 300, 400, 500
185
+ ```
186
+
187
+ Best GRPO holdout results:
188
+
189
+ | Metric | Value |
190
+ |---|---:|
191
+ | Valid completions | 5/5 |
192
+ | Mean holdout regret | `+0.1058` |
193
+ | Beats baseline | 5/5 |
194
+ | Previous v6 SFT mean regret bar | `+0.034` |
195
+
196
+ Per-seed holdout:
197
+
198
+ | Seed | Shock | Regret |
199
+ |---:|---|---:|
200
+ | 100 | `hard_rare_earth_rotation` | `+0.0755` |
201
+ | 200 | `easy_tech_earnings` | `+0.1210` |
202
+ | 300 | `easy_tech_earnings` | `+0.1442` |
203
+ | 400 | `hard_deflation_pulse` | `+0.1527` |
204
+ | 500 | `ambig_ai_efficiency` | `+0.0358` |
205
+
206
+ ### Manual Macro Eval
207
+
208
+ Eval set:
209
+
210
+ ```text
211
+ evals/macro_eval_10.jsonl
212
+ ```
213
+
214
+ Report:
215
+
216
+ ```text
217
+ evals/macro_eval_10_grpo_report.json
218
+ ```
219
+
220
+ Summary:
221
+
222
+ - GRPO adapter: 10/10 valid JSON actions;
223
+ - GRPO adapter: 10/10 closed `<think>`;
224
+ - base model: 9/10 valid JSON actions;
225
+ - GRPO was stronger on rare-earth export controls, global deflation pulse, and
226
+ yen carry unwind.
227
+
228
+ Known weaknesses:
229
+
230
+ - `q02_oil_chokepoint_inflation`: the model understood the inflation regime
231
+ and hedged, but underweighted OIL despite the direct supply shock.
232
+ - `q04_ai_efficiency_paradox`: the model correctly liked TECH and cut
233
+ REAL_ESTATE, but gave GREEN too much weight despite lower data-center power
234
+ demand expectations.
235
+
236
+ These are targeted follow-up items, not hidden failures.
237
+
238
+ ## Comparison With Qwen3 Base Branch
239
+
240
+ We also tested an isolated Qwen3-4B-Base branch:
241
+
242
+ ```text
243
+ 77ethers/CarbonAlpha/grpo_qwen3_4b_base_smoke_v2
244
+ ```
245
+
246
+ Result:
247
+
248
+ - smoke gate passed mechanically;
249
+ - no one-token collapse;
250
+ - completions were too long, often near the 400-token cap;
251
+ - holdout: 4/5 valid;
252
+ - mean holdout regret: `-0.0229`;
253
+ - did not beat the Qwen2.5 GRPO model.
254
+
255
+ Conclusion: Qwen3 Base is a viable research branch, but the current production
256
+ candidate remains Qwen2.5-7B SFT plus GRPO.
257
+
258
+ ## Limitations
259
+
260
+ - The GRPO run is Phase 1 only, so it is strongest on easy-shock simulator
261
+ reward optimization.
262
+ - The model still has known second-order reasoning weaknesses in specific
263
+ macro setups.
264
+ - The reward environment is synthetic and should be interpreted as a benchmark,
265
+ not a market simulator.
266
+ - The model is private on Hugging Face and requires `HF_API_TOKEN` for loading.
267
+
268
+ ## Reproducibility
269
+
270
+ Final notebook:
271
+
272
+ ```text
273
+ notebooks/carbonalpha_final_pipeline.ipynb
274
+ ```
275
+
276
+ Colab link:
277
+
278
+ ```text
279
+ https://colab.research.google.com/github/capabl-machines/gridops/blob/round-2/notebooks/carbonalpha_final_pipeline.ipynb
280
+ ```
281
+
282
+ The notebook verifies artifacts, loads metrics from Hugging Face, runs an
283
+ environment smoke test, shows the manual eval set, and includes opt-in cells
284
+ to relaunch the exact HF Jobs training runs.