Instructions to use 77ethers/CarbonAlpha with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use 77ethers/CarbonAlpha with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
File size: 6,719 Bytes
43f214b 8cd6af2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 | ---
language:
- en
library_name: peft
base_model: unsloth/Qwen2.5-7B-Instruct
tags:
- grpo
- trl
- peft
- qwen2.5
- openenv
- portfolio-reasoning
- climate
license: other
pipeline_tag: text-generation
---
# CarbonAlpha Model Card
## Model Summary
CarbonAlpha is a climate-aware portfolio reasoning agent for the
`portfolio_env` OpenEnv environment. It reads one macro-news event, reasons
through first-order and second-order effects, and emits a constrained
`PortfolioAction`:
```json
{
"weights": [w_tech, w_oil, w_green, w_real_estate, w_bonds],
"infra_commit": 0.0,
"carbon_offset_buy": 0.0,
"put_hedge": 0.0,
"tech_bet": "status_quo"
}
```
Current best research model:
```text
77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1
```
Base model:
```text
unsloth/Qwen2.5-7B-Instruct
```
Adapter lineage:
1. SFT warm-start on 400 curriculum traces.
2. GRPO Phase 1 for 100 steps.
3. Holdout and manual macro-eval checks before promotion.
The live Space can load this adapter through the `MODEL_SUBFOLDER`
environment variable:
```text
https://77ethers-carbonalpha-demo.hf.space/
```
## Intended Use
This model is intended for the CarbonAlpha walkthrough demo and OpenEnv
evaluation. It is not a financial advisor and should not be used to make real
investment decisions.
The useful behavior to evaluate is:
- strict `<think>...</think>` plus JSON formatting;
- valid portfolio weights and bounded interventions;
- recognition of macro regime shifts;
- carbon-budget awareness;
- performance against the environment's equal-weight baseline.
## Training Data
The Qwen2.5 SFT warm-start used:
```text
sft_traces/curriculum_400_e80_m160_h160.jsonl
```
Trace mix:
- 80 easy traces;
- 160 medium / ambiguous traces;
- 160 hard traces.
The trace schema follows `sft_traces/merged_v6_aligned.jsonl`, with the same
prompt and completion contract used during inference.
## Training Pipeline
### SFT
SFT artifact:
```text
77ethers/CarbonAlpha/sft_qwen25_7b_curriculum400_v1
```
Training script:
```text
scripts/hf_sft_qwen25_7b.py
```
Configuration:
- QLoRA over `unsloth/Qwen2.5-7B-Instruct`;
- LoRA rank 16;
- `lora_alpha=16`;
- 220 SFT steps;
- effective batch size 4;
- Hugging Face Jobs L40S.
SFT result:
- generation sanity: 5/5 valid actions;
- holdout: 5/5 valid;
- mean holdout regret: `+0.02796`;
- beats baseline on 3/5 holdout seeds.
### GRPO
Best GRPO artifact:
```text
77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1
```
Training script:
```text
scripts/hf_grpo_qwen25_adapter.py
```
GRPO configuration:
- warm-start from `sft_qwen25_7b_curriculum400_v1`;
- `use_vllm=False`;
- 100 GRPO steps;
- 128 generated Phase-1 prompts;
- 2 generations per prompt;
- batch size 2;
- learning rate `2e-6`;
- `loss_type="dapo"`;
- KL beta `0.02`.
Reward functions:
- format reward;
- action-contract reward;
- reasoning-shape reward;
- Phase-1 simulator regret reward;
- carbon-guard reward.
Important engineering choice: we avoided vLLM for the Qwen2.5 GRPO run because
earlier vLLM-based Qwen3 rollouts collapsed to one-token completions. The
plain-Transformers path was slower but healthier and easier to debug.
## Evidence of Training
The 100-step GRPO run was launched as a Hugging Face Job:
```text
https://huggingface.co/jobs/77ethers/69ed1ce0d70108f37acdeea3
```
Raw evidence committed in this repo:
```text
training_logs/qwen25_grpo_phase1_100_v1.log
training_logs/qwen25_grpo_phase1_100_v1_rows.jsonl
```
The parsed JSONL contains 100 real GRPO metric rows extracted from the job log.
Loss and reward plots generated from those rows:


Additional rollout-health plot:

The completion-length plot is included because one-token rollout collapse was
the main failure mode in earlier GRPO attempts. In this successful run,
completion lengths stayed well above the smoke threshold throughout training.
## Evaluation
### Holdout
Holdout seeds:
```text
100, 200, 300, 400, 500
```
Best GRPO holdout results:
| Metric | Value |
|---|---:|
| Valid completions | 5/5 |
| Mean holdout regret | `+0.1058` |
| Beats baseline | 5/5 |
| Previous v6 SFT mean regret bar | `+0.034` |
Per-seed holdout:
| Seed | Shock | Regret |
|---:|---|---:|
| 100 | `hard_rare_earth_rotation` | `+0.0755` |
| 200 | `easy_tech_earnings` | `+0.1210` |
| 300 | `easy_tech_earnings` | `+0.1442` |
| 400 | `hard_deflation_pulse` | `+0.1527` |
| 500 | `ambig_ai_efficiency` | `+0.0358` |
### Manual Macro Eval
Eval set:
```text
evals/macro_eval_10.jsonl
```
Report:
```text
evals/macro_eval_10_grpo_report.json
```
Summary:
- GRPO adapter: 10/10 valid JSON actions;
- GRPO adapter: 10/10 closed `<think>`;
- base model: 9/10 valid JSON actions;
- GRPO was stronger on rare-earth export controls, global deflation pulse, and
yen carry unwind.
Known weaknesses:
- `q02_oil_chokepoint_inflation`: the model understood the inflation regime
and hedged, but underweighted OIL despite the direct supply shock.
- `q04_ai_efficiency_paradox`: the model correctly liked TECH and cut
REAL_ESTATE, but gave GREEN too much weight despite lower data-center power
demand expectations.
These are targeted follow-up items, not hidden failures.
## Comparison With Qwen3 Base Branch
We also tested an isolated Qwen3-4B-Base branch:
```text
77ethers/CarbonAlpha/grpo_qwen3_4b_base_smoke_v2
```
Result:
- smoke gate passed mechanically;
- no one-token collapse;
- completions were too long, often near the 400-token cap;
- holdout: 4/5 valid;
- mean holdout regret: `-0.0229`;
- did not beat the Qwen2.5 GRPO model.
Conclusion: Qwen3 Base is a viable research branch, but the current production
candidate remains Qwen2.5-7B SFT plus GRPO.
## Limitations
- The GRPO run is Phase 1 only, so it is strongest on easy-shock simulator
reward optimization.
- The model still has known second-order reasoning weaknesses in specific
macro setups.
- The reward environment is synthetic and should be interpreted as a benchmark,
not a market simulator.
- The model is private on Hugging Face and requires `HF_API_TOKEN` for loading.
## Reproducibility
Final notebook:
```text
notebooks/carbonalpha_final_pipeline.ipynb
```
Colab link:
```text
https://colab.research.google.com/github/capabl-machines/gridops/blob/round-2/notebooks/carbonalpha_final_pipeline.ipynb
```
The notebook verifies artifacts, loads metrics from Hugging Face, runs an
environment smoke test, shows the manual eval set, and includes opt-in cells
to relaunch the exact HF Jobs training runs.
|