File size: 6,719 Bytes
43f214b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8cd6af2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
---
language:
- en
library_name: peft
base_model: unsloth/Qwen2.5-7B-Instruct
tags:
- grpo
- trl
- peft
- qwen2.5
- openenv
- portfolio-reasoning
- climate
license: other
pipeline_tag: text-generation
---

# CarbonAlpha Model Card

## Model Summary

CarbonAlpha is a climate-aware portfolio reasoning agent for the
`portfolio_env` OpenEnv environment. It reads one macro-news event, reasons
through first-order and second-order effects, and emits a constrained
`PortfolioAction`:

```json
{
  "weights": [w_tech, w_oil, w_green, w_real_estate, w_bonds],
  "infra_commit": 0.0,
  "carbon_offset_buy": 0.0,
  "put_hedge": 0.0,
  "tech_bet": "status_quo"
}
```

Current best research model:

```text
77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1
```

Base model:

```text
unsloth/Qwen2.5-7B-Instruct
```

Adapter lineage:

1. SFT warm-start on 400 curriculum traces.
2. GRPO Phase 1 for 100 steps.
3. Holdout and manual macro-eval checks before promotion.

The live Space can load this adapter through the `MODEL_SUBFOLDER`
environment variable:

```text
https://77ethers-carbonalpha-demo.hf.space/
```

## Intended Use

This model is intended for the CarbonAlpha walkthrough demo and OpenEnv
evaluation. It is not a financial advisor and should not be used to make real
investment decisions.

The useful behavior to evaluate is:

- strict `<think>...</think>` plus JSON formatting;
- valid portfolio weights and bounded interventions;
- recognition of macro regime shifts;
- carbon-budget awareness;
- performance against the environment's equal-weight baseline.

## Training Data

The Qwen2.5 SFT warm-start used:

```text
sft_traces/curriculum_400_e80_m160_h160.jsonl
```

Trace mix:

- 80 easy traces;
- 160 medium / ambiguous traces;
- 160 hard traces.

The trace schema follows `sft_traces/merged_v6_aligned.jsonl`, with the same
prompt and completion contract used during inference.

## Training Pipeline

### SFT

SFT artifact:

```text
77ethers/CarbonAlpha/sft_qwen25_7b_curriculum400_v1
```

Training script:

```text
scripts/hf_sft_qwen25_7b.py
```

Configuration:

- QLoRA over `unsloth/Qwen2.5-7B-Instruct`;
- LoRA rank 16;
- `lora_alpha=16`;
- 220 SFT steps;
- effective batch size 4;
- Hugging Face Jobs L40S.

SFT result:

- generation sanity: 5/5 valid actions;
- holdout: 5/5 valid;
- mean holdout regret: `+0.02796`;
- beats baseline on 3/5 holdout seeds.

### GRPO

Best GRPO artifact:

```text
77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1
```

Training script:

```text
scripts/hf_grpo_qwen25_adapter.py
```

GRPO configuration:

- warm-start from `sft_qwen25_7b_curriculum400_v1`;
- `use_vllm=False`;
- 100 GRPO steps;
- 128 generated Phase-1 prompts;
- 2 generations per prompt;
- batch size 2;
- learning rate `2e-6`;
- `loss_type="dapo"`;
- KL beta `0.02`.

Reward functions:

- format reward;
- action-contract reward;
- reasoning-shape reward;
- Phase-1 simulator regret reward;
- carbon-guard reward.

Important engineering choice: we avoided vLLM for the Qwen2.5 GRPO run because
earlier vLLM-based Qwen3 rollouts collapsed to one-token completions. The
plain-Transformers path was slower but healthier and easier to debug.

## Evidence of Training

The 100-step GRPO run was launched as a Hugging Face Job:

```text
https://huggingface.co/jobs/77ethers/69ed1ce0d70108f37acdeea3
```

Raw evidence committed in this repo:

```text
training_logs/qwen25_grpo_phase1_100_v1.log
training_logs/qwen25_grpo_phase1_100_v1_rows.jsonl
```

The parsed JSONL contains 100 real GRPO metric rows extracted from the job log.

Loss and reward plots generated from those rows:

![Qwen2.5 GRPO loss curve](assets/loss_curve.png)

![Qwen2.5 GRPO reward curve](assets/reward_curve.png)

Additional rollout-health plot:

![Qwen2.5 GRPO completion length health](assets/qwen25_grpo_phase1_100_completion_lengths.png)

The completion-length plot is included because one-token rollout collapse was
the main failure mode in earlier GRPO attempts. In this successful run,
completion lengths stayed well above the smoke threshold throughout training.

## Evaluation

### Holdout

Holdout seeds:

```text
100, 200, 300, 400, 500
```

Best GRPO holdout results:

| Metric | Value |
|---|---:|
| Valid completions | 5/5 |
| Mean holdout regret | `+0.1058` |
| Beats baseline | 5/5 |
| Previous v6 SFT mean regret bar | `+0.034` |

Per-seed holdout:

| Seed | Shock | Regret |
|---:|---|---:|
| 100 | `hard_rare_earth_rotation` | `+0.0755` |
| 200 | `easy_tech_earnings` | `+0.1210` |
| 300 | `easy_tech_earnings` | `+0.1442` |
| 400 | `hard_deflation_pulse` | `+0.1527` |
| 500 | `ambig_ai_efficiency` | `+0.0358` |

### Manual Macro Eval

Eval set:

```text
evals/macro_eval_10.jsonl
```

Report:

```text
evals/macro_eval_10_grpo_report.json
```

Summary:

- GRPO adapter: 10/10 valid JSON actions;
- GRPO adapter: 10/10 closed `<think>`;
- base model: 9/10 valid JSON actions;
- GRPO was stronger on rare-earth export controls, global deflation pulse, and
  yen carry unwind.

Known weaknesses:

- `q02_oil_chokepoint_inflation`: the model understood the inflation regime
  and hedged, but underweighted OIL despite the direct supply shock.
- `q04_ai_efficiency_paradox`: the model correctly liked TECH and cut
  REAL_ESTATE, but gave GREEN too much weight despite lower data-center power
  demand expectations.

These are targeted follow-up items, not hidden failures.

## Comparison With Qwen3 Base Branch

We also tested an isolated Qwen3-4B-Base branch:

```text
77ethers/CarbonAlpha/grpo_qwen3_4b_base_smoke_v2
```

Result:

- smoke gate passed mechanically;
- no one-token collapse;
- completions were too long, often near the 400-token cap;
- holdout: 4/5 valid;
- mean holdout regret: `-0.0229`;
- did not beat the Qwen2.5 GRPO model.

Conclusion: Qwen3 Base is a viable research branch, but the current production
candidate remains Qwen2.5-7B SFT plus GRPO.

## Limitations

- The GRPO run is Phase 1 only, so it is strongest on easy-shock simulator
  reward optimization.
- The model still has known second-order reasoning weaknesses in specific
  macro setups.
- The reward environment is synthetic and should be interpreted as a benchmark,
  not a market simulator.
- The model is private on Hugging Face and requires `HF_API_TOKEN` for loading.

## Reproducibility

Final notebook:

```text
notebooks/carbonalpha_final_pipeline.ipynb
```

Colab link:

```text
https://colab.research.google.com/github/capabl-machines/gridops/blob/round-2/notebooks/carbonalpha_final_pipeline.ipynb
```

The notebook verifies artifacts, loads metrics from Hugging Face, runs an
environment smoke test, shows the manual eval set, and includes opt-in cells
to relaunch the exact HF Jobs training runs.