77ethers
/

CarbonAlpha

+# CarbonAlpha Model Card
+## Model Summary
+CarbonAlpha is a climate-aware portfolio reasoning agent for the
+`portfolio_env` OpenEnv environment. It reads one macro-news event, reasons
+through first-order and second-order effects, and emits a constrained
+`PortfolioAction`:
+```json
+{
+  "weights": [w_tech, w_oil, w_green, w_real_estate, w_bonds],
+  "infra_commit": 0.0,
+  "carbon_offset_buy": 0.0,
+  "put_hedge": 0.0,
+  "tech_bet": "status_quo"
+}
+```
+Current best research model:
+```text
+77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1
+```
+Base model:
+```text
+unsloth/Qwen2.5-7B-Instruct
+```
+Adapter lineage:
+1. SFT warm-start on 400 curriculum traces.
+2. GRPO Phase 1 for 100 steps.
+3. Holdout and manual macro-eval checks before promotion.
+The live Space can load this adapter through the `MODEL_SUBFOLDER`
+environment variable:
+```text
+https://77ethers-carbonalpha-demo.hf.space/
+```
+## Intended Use
+This model is intended for the CarbonAlpha walkthrough demo and OpenEnv
+evaluation. It is not a financial advisor and should not be used to make real
+investment decisions.
+The useful behavior to evaluate is:
+- strict `<think>...</think>` plus JSON formatting;
+- valid portfolio weights and bounded interventions;
+- recognition of macro regime shifts;
+- carbon-budget awareness;
+- performance against the environment's equal-weight baseline.
+## Training Data
+The Qwen2.5 SFT warm-start used:
+```text
+sft_traces/curriculum_400_e80_m160_h160.jsonl
+```
+Trace mix:
+- 80 easy traces;
+- 160 medium / ambiguous traces;
+- 160 hard traces.
+The trace schema follows `sft_traces/merged_v6_aligned.jsonl`, with the same
+prompt and completion contract used during inference.
+## Training Pipeline
+### SFT
+SFT artifact:
+```text
+77ethers/CarbonAlpha/sft_qwen25_7b_curriculum400_v1
+```
+Training script:
+```text
+scripts/hf_sft_qwen25_7b.py
+```
+Configuration:
+- QLoRA over `unsloth/Qwen2.5-7B-Instruct`;
+- LoRA rank 16;
+- `lora_alpha=16`;
+- 220 SFT steps;
+- effective batch size 4;
+- Hugging Face Jobs L40S.
+SFT result:
+- generation sanity: 5/5 valid actions;
+- holdout: 5/5 valid;
+- mean holdout regret: `+0.02796`;
+- beats baseline on 3/5 holdout seeds.
+### GRPO
+Best GRPO artifact:
+```text
+77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1
+```
+Training script:
+```text
+scripts/hf_grpo_qwen25_adapter.py
+```
+GRPO configuration:
+- warm-start from `sft_qwen25_7b_curriculum400_v1`;
+- `use_vllm=False`;
+- 100 GRPO steps;
+- 128 generated Phase-1 prompts;
+- 2 generations per prompt;
+- batch size 2;
+- learning rate `2e-6`;
+- `loss_type="dapo"`;
+- KL beta `0.02`.
+Reward functions:
+- format reward;
+- action-contract reward;
+- reasoning-shape reward;
+- Phase-1 simulator regret reward;
+- carbon-guard reward.
+Important engineering choice: we avoided vLLM for the Qwen2.5 GRPO run because
+earlier vLLM-based Qwen3 rollouts collapsed to one-token completions. The
+plain-Transformers path was slower but healthier and easier to debug.
+## Evidence of Training
+The 100-step GRPO run was launched as a Hugging Face Job:
+```text
+https://huggingface.co/jobs/77ethers/69ed1ce0d70108f37acdeea3
+```
+Raw evidence committed in this repo:
+```text
+training_logs/qwen25_grpo_phase1_100_v1.log
+training_logs/qwen25_grpo_phase1_100_v1_rows.jsonl
+```
+The parsed JSONL contains 100 real GRPO metric rows extracted from the job log.
+Loss and reward plots generated from those rows:
+![Qwen2.5 GRPO loss curve](assets/loss_curve.png)
+![Qwen2.5 GRPO reward curve](assets/reward_curve.png)
+Additional rollout-health plot:
+![Qwen2.5 GRPO completion length health](assets/qwen25_grpo_phase1_100_completion_lengths.png)
+The completion-length plot is included because one-token rollout collapse was
+the main failure mode in earlier GRPO attempts. In this successful run,
+completion lengths stayed well above the smoke threshold throughout training.
+## Evaluation
+### Holdout
+Holdout seeds:
+```text
+100, 200, 300, 400, 500
+```
+Best GRPO holdout results:
+| Metric | Value |
+|---|---:|
+| Valid completions | 5/5 |
+| Mean holdout regret | `+0.1058` |
+| Beats baseline | 5/5 |
+| Previous v6 SFT mean regret bar | `+0.034` |
+Per-seed holdout:
+| Seed | Shock | Regret |
+|---:|---|---:|
+| 100 | `hard_rare_earth_rotation` | `+0.0755` |
+| 200 | `easy_tech_earnings` | `+0.1210` |
+| 300 | `easy_tech_earnings` | `+0.1442` |
+| 400 | `hard_deflation_pulse` | `+0.1527` |
+| 500 | `ambig_ai_efficiency` | `+0.0358` |
+### Manual Macro Eval
+Eval set:
+```text
+evals/macro_eval_10.jsonl
+```
+Report:
+```text
+evals/macro_eval_10_grpo_report.json
+```
+Summary:
+- GRPO adapter: 10/10 valid JSON actions;
+- GRPO adapter: 10/10 closed `<think>`;
+- base model: 9/10 valid JSON actions;
+- GRPO was stronger on rare-earth export controls, global deflation pulse, and
+  yen carry unwind.
+Known weaknesses:
+- `q02_oil_chokepoint_inflation`: the model understood the inflation regime
+  and hedged, but underweighted OIL despite the direct supply shock.
+- `q04_ai_efficiency_paradox`: the model correctly liked TECH and cut
+  REAL_ESTATE, but gave GREEN too much weight despite lower data-center power
+  demand expectations.
+These are targeted follow-up items, not hidden failures.
+## Comparison With Qwen3 Base Branch
+We also tested an isolated Qwen3-4B-Base branch:
+```text
+77ethers/CarbonAlpha/grpo_qwen3_4b_base_smoke_v2
+```
+Result:
+- smoke gate passed mechanically;
+- no one-token collapse;
+- completions were too long, often near the 400-token cap;
+- holdout: 4/5 valid;
+- mean holdout regret: `-0.0229`;
+- did not beat the Qwen2.5 GRPO model.
+Conclusion: Qwen3 Base is a viable research branch, but the current production
+candidate remains Qwen2.5-7B SFT plus GRPO.
+## Limitations
+- The GRPO run is Phase 1 only, so it is strongest on easy-shock simulator
+  reward optimization.
+- The model still has known second-order reasoning weaknesses in specific
+  macro setups.
+- The reward environment is synthetic and should be interpreted as a benchmark,
+  not a market simulator.
+- The model is private on Hugging Face and requires `HF_API_TOKEN` for loading.
+## Reproducibility
+Final notebook:
+```text
+notebooks/carbonalpha_final_pipeline.ipynb
+```
+Colab link:
+```text
+https://colab.research.google.com/github/capabl-machines/gridops/blob/round-2/notebooks/carbonalpha_final_pipeline.ipynb
+```
+The notebook verifies artifacts, loads metrics from Hugging Face, runs an
+environment smoke test, shows the manual eval set, and includes opt-in cells
+to relaunch the exact HF Jobs training runs.