CarbonAlpha / README.md
77ethers's picture
Add model card metadata header
43f214b verified
---
language:
- en
library_name: peft
base_model: unsloth/Qwen2.5-7B-Instruct
tags:
- grpo
- trl
- peft
- qwen2.5
- openenv
- portfolio-reasoning
- climate
license: other
pipeline_tag: text-generation
---
# CarbonAlpha Model Card
## Model Summary
CarbonAlpha is a climate-aware portfolio reasoning agent for the
`portfolio_env` OpenEnv environment. It reads one macro-news event, reasons
through first-order and second-order effects, and emits a constrained
`PortfolioAction`:
```json
{
"weights": [w_tech, w_oil, w_green, w_real_estate, w_bonds],
"infra_commit": 0.0,
"carbon_offset_buy": 0.0,
"put_hedge": 0.0,
"tech_bet": "status_quo"
}
```
Current best research model:
```text
77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1
```
Base model:
```text
unsloth/Qwen2.5-7B-Instruct
```
Adapter lineage:
1. SFT warm-start on 400 curriculum traces.
2. GRPO Phase 1 for 100 steps.
3. Holdout and manual macro-eval checks before promotion.
The live Space can load this adapter through the `MODEL_SUBFOLDER`
environment variable:
```text
https://77ethers-carbonalpha-demo.hf.space/
```
## Intended Use
This model is intended for the CarbonAlpha walkthrough demo and OpenEnv
evaluation. It is not a financial advisor and should not be used to make real
investment decisions.
The useful behavior to evaluate is:
- strict `<think>...</think>` plus JSON formatting;
- valid portfolio weights and bounded interventions;
- recognition of macro regime shifts;
- carbon-budget awareness;
- performance against the environment's equal-weight baseline.
## Training Data
The Qwen2.5 SFT warm-start used:
```text
sft_traces/curriculum_400_e80_m160_h160.jsonl
```
Trace mix:
- 80 easy traces;
- 160 medium / ambiguous traces;
- 160 hard traces.
The trace schema follows `sft_traces/merged_v6_aligned.jsonl`, with the same
prompt and completion contract used during inference.
## Training Pipeline
### SFT
SFT artifact:
```text
77ethers/CarbonAlpha/sft_qwen25_7b_curriculum400_v1
```
Training script:
```text
scripts/hf_sft_qwen25_7b.py
```
Configuration:
- QLoRA over `unsloth/Qwen2.5-7B-Instruct`;
- LoRA rank 16;
- `lora_alpha=16`;
- 220 SFT steps;
- effective batch size 4;
- Hugging Face Jobs L40S.
SFT result:
- generation sanity: 5/5 valid actions;
- holdout: 5/5 valid;
- mean holdout regret: `+0.02796`;
- beats baseline on 3/5 holdout seeds.
### GRPO
Best GRPO artifact:
```text
77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1
```
Training script:
```text
scripts/hf_grpo_qwen25_adapter.py
```
GRPO configuration:
- warm-start from `sft_qwen25_7b_curriculum400_v1`;
- `use_vllm=False`;
- 100 GRPO steps;
- 128 generated Phase-1 prompts;
- 2 generations per prompt;
- batch size 2;
- learning rate `2e-6`;
- `loss_type="dapo"`;
- KL beta `0.02`.
Reward functions:
- format reward;
- action-contract reward;
- reasoning-shape reward;
- Phase-1 simulator regret reward;
- carbon-guard reward.
Important engineering choice: we avoided vLLM for the Qwen2.5 GRPO run because
earlier vLLM-based Qwen3 rollouts collapsed to one-token completions. The
plain-Transformers path was slower but healthier and easier to debug.
## Evidence of Training
The 100-step GRPO run was launched as a Hugging Face Job:
```text
https://huggingface.co/jobs/77ethers/69ed1ce0d70108f37acdeea3
```
Raw evidence committed in this repo:
```text
training_logs/qwen25_grpo_phase1_100_v1.log
training_logs/qwen25_grpo_phase1_100_v1_rows.jsonl
```
The parsed JSONL contains 100 real GRPO metric rows extracted from the job log.
Loss and reward plots generated from those rows:
![Qwen2.5 GRPO loss curve](assets/loss_curve.png)
![Qwen2.5 GRPO reward curve](assets/reward_curve.png)
Additional rollout-health plot:
![Qwen2.5 GRPO completion length health](assets/qwen25_grpo_phase1_100_completion_lengths.png)
The completion-length plot is included because one-token rollout collapse was
the main failure mode in earlier GRPO attempts. In this successful run,
completion lengths stayed well above the smoke threshold throughout training.
## Evaluation
### Holdout
Holdout seeds:
```text
100, 200, 300, 400, 500
```
Best GRPO holdout results:
| Metric | Value |
|---|---:|
| Valid completions | 5/5 |
| Mean holdout regret | `+0.1058` |
| Beats baseline | 5/5 |
| Previous v6 SFT mean regret bar | `+0.034` |
Per-seed holdout:
| Seed | Shock | Regret |
|---:|---|---:|
| 100 | `hard_rare_earth_rotation` | `+0.0755` |
| 200 | `easy_tech_earnings` | `+0.1210` |
| 300 | `easy_tech_earnings` | `+0.1442` |
| 400 | `hard_deflation_pulse` | `+0.1527` |
| 500 | `ambig_ai_efficiency` | `+0.0358` |
### Manual Macro Eval
Eval set:
```text
evals/macro_eval_10.jsonl
```
Report:
```text
evals/macro_eval_10_grpo_report.json
```
Summary:
- GRPO adapter: 10/10 valid JSON actions;
- GRPO adapter: 10/10 closed `<think>`;
- base model: 9/10 valid JSON actions;
- GRPO was stronger on rare-earth export controls, global deflation pulse, and
yen carry unwind.
Known weaknesses:
- `q02_oil_chokepoint_inflation`: the model understood the inflation regime
and hedged, but underweighted OIL despite the direct supply shock.
- `q04_ai_efficiency_paradox`: the model correctly liked TECH and cut
REAL_ESTATE, but gave GREEN too much weight despite lower data-center power
demand expectations.
These are targeted follow-up items, not hidden failures.
## Comparison With Qwen3 Base Branch
We also tested an isolated Qwen3-4B-Base branch:
```text
77ethers/CarbonAlpha/grpo_qwen3_4b_base_smoke_v2
```
Result:
- smoke gate passed mechanically;
- no one-token collapse;
- completions were too long, often near the 400-token cap;
- holdout: 4/5 valid;
- mean holdout regret: `-0.0229`;
- did not beat the Qwen2.5 GRPO model.
Conclusion: Qwen3 Base is a viable research branch, but the current production
candidate remains Qwen2.5-7B SFT plus GRPO.
## Limitations
- The GRPO run is Phase 1 only, so it is strongest on easy-shock simulator
reward optimization.
- The model still has known second-order reasoning weaknesses in specific
macro setups.
- The reward environment is synthetic and should be interpreted as a benchmark,
not a market simulator.
- The model is private on Hugging Face and requires `HF_API_TOKEN` for loading.
## Reproducibility
Final notebook:
```text
notebooks/carbonalpha_final_pipeline.ipynb
```
Colab link:
```text
https://colab.research.google.com/github/capabl-machines/gridops/blob/round-2/notebooks/carbonalpha_final_pipeline.ipynb
```
The notebook verifies artifacts, loads metrics from Hugging Face, runs an
environment smoke test, shows the manual eval set, and includes opt-in cells
to relaunch the exact HF Jobs training runs.