--- language: - en library_name: peft base_model: unsloth/Qwen2.5-7B-Instruct tags: - grpo - trl - peft - qwen2.5 - openenv - portfolio-reasoning - climate license: other pipeline_tag: text-generation --- # CarbonAlpha Model Card ## Model Summary CarbonAlpha is a climate-aware portfolio reasoning agent for the `portfolio_env` OpenEnv environment. It reads one macro-news event, reasons through first-order and second-order effects, and emits a constrained `PortfolioAction`: ```json { "weights": [w_tech, w_oil, w_green, w_real_estate, w_bonds], "infra_commit": 0.0, "carbon_offset_buy": 0.0, "put_hedge": 0.0, "tech_bet": "status_quo" } ``` Current best research model: ```text 77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1 ``` Base model: ```text unsloth/Qwen2.5-7B-Instruct ``` Adapter lineage: 1. SFT warm-start on 400 curriculum traces. 2. GRPO Phase 1 for 100 steps. 3. Holdout and manual macro-eval checks before promotion. The live Space can load this adapter through the `MODEL_SUBFOLDER` environment variable: ```text https://77ethers-carbonalpha-demo.hf.space/ ``` ## Intended Use This model is intended for the CarbonAlpha walkthrough demo and OpenEnv evaluation. It is not a financial advisor and should not be used to make real investment decisions. The useful behavior to evaluate is: - strict `...` plus JSON formatting; - valid portfolio weights and bounded interventions; - recognition of macro regime shifts; - carbon-budget awareness; - performance against the environment's equal-weight baseline. ## Training Data The Qwen2.5 SFT warm-start used: ```text sft_traces/curriculum_400_e80_m160_h160.jsonl ``` Trace mix: - 80 easy traces; - 160 medium / ambiguous traces; - 160 hard traces. The trace schema follows `sft_traces/merged_v6_aligned.jsonl`, with the same prompt and completion contract used during inference. ## Training Pipeline ### SFT SFT artifact: ```text 77ethers/CarbonAlpha/sft_qwen25_7b_curriculum400_v1 ``` Training script: ```text scripts/hf_sft_qwen25_7b.py ``` Configuration: - QLoRA over `unsloth/Qwen2.5-7B-Instruct`; - LoRA rank 16; - `lora_alpha=16`; - 220 SFT steps; - effective batch size 4; - Hugging Face Jobs L40S. SFT result: - generation sanity: 5/5 valid actions; - holdout: 5/5 valid; - mean holdout regret: `+0.02796`; - beats baseline on 3/5 holdout seeds. ### GRPO Best GRPO artifact: ```text 77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1 ``` Training script: ```text scripts/hf_grpo_qwen25_adapter.py ``` GRPO configuration: - warm-start from `sft_qwen25_7b_curriculum400_v1`; - `use_vllm=False`; - 100 GRPO steps; - 128 generated Phase-1 prompts; - 2 generations per prompt; - batch size 2; - learning rate `2e-6`; - `loss_type="dapo"`; - KL beta `0.02`. Reward functions: - format reward; - action-contract reward; - reasoning-shape reward; - Phase-1 simulator regret reward; - carbon-guard reward. Important engineering choice: we avoided vLLM for the Qwen2.5 GRPO run because earlier vLLM-based Qwen3 rollouts collapsed to one-token completions. The plain-Transformers path was slower but healthier and easier to debug. ## Evidence of Training The 100-step GRPO run was launched as a Hugging Face Job: ```text https://huggingface.co/jobs/77ethers/69ed1ce0d70108f37acdeea3 ``` Raw evidence committed in this repo: ```text training_logs/qwen25_grpo_phase1_100_v1.log training_logs/qwen25_grpo_phase1_100_v1_rows.jsonl ``` The parsed JSONL contains 100 real GRPO metric rows extracted from the job log. Loss and reward plots generated from those rows: ![Qwen2.5 GRPO loss curve](assets/loss_curve.png) ![Qwen2.5 GRPO reward curve](assets/reward_curve.png) Additional rollout-health plot: ![Qwen2.5 GRPO completion length health](assets/qwen25_grpo_phase1_100_completion_lengths.png) The completion-length plot is included because one-token rollout collapse was the main failure mode in earlier GRPO attempts. In this successful run, completion lengths stayed well above the smoke threshold throughout training. ## Evaluation ### Holdout Holdout seeds: ```text 100, 200, 300, 400, 500 ``` Best GRPO holdout results: | Metric | Value | |---|---:| | Valid completions | 5/5 | | Mean holdout regret | `+0.1058` | | Beats baseline | 5/5 | | Previous v6 SFT mean regret bar | `+0.034` | Per-seed holdout: | Seed | Shock | Regret | |---:|---|---:| | 100 | `hard_rare_earth_rotation` | `+0.0755` | | 200 | `easy_tech_earnings` | `+0.1210` | | 300 | `easy_tech_earnings` | `+0.1442` | | 400 | `hard_deflation_pulse` | `+0.1527` | | 500 | `ambig_ai_efficiency` | `+0.0358` | ### Manual Macro Eval Eval set: ```text evals/macro_eval_10.jsonl ``` Report: ```text evals/macro_eval_10_grpo_report.json ``` Summary: - GRPO adapter: 10/10 valid JSON actions; - GRPO adapter: 10/10 closed ``; - base model: 9/10 valid JSON actions; - GRPO was stronger on rare-earth export controls, global deflation pulse, and yen carry unwind. Known weaknesses: - `q02_oil_chokepoint_inflation`: the model understood the inflation regime and hedged, but underweighted OIL despite the direct supply shock. - `q04_ai_efficiency_paradox`: the model correctly liked TECH and cut REAL_ESTATE, but gave GREEN too much weight despite lower data-center power demand expectations. These are targeted follow-up items, not hidden failures. ## Comparison With Qwen3 Base Branch We also tested an isolated Qwen3-4B-Base branch: ```text 77ethers/CarbonAlpha/grpo_qwen3_4b_base_smoke_v2 ``` Result: - smoke gate passed mechanically; - no one-token collapse; - completions were too long, often near the 400-token cap; - holdout: 4/5 valid; - mean holdout regret: `-0.0229`; - did not beat the Qwen2.5 GRPO model. Conclusion: Qwen3 Base is a viable research branch, but the current production candidate remains Qwen2.5-7B SFT plus GRPO. ## Limitations - The GRPO run is Phase 1 only, so it is strongest on easy-shock simulator reward optimization. - The model still has known second-order reasoning weaknesses in specific macro setups. - The reward environment is synthetic and should be interpreted as a benchmark, not a market simulator. - The model is private on Hugging Face and requires `HF_API_TOKEN` for loading. ## Reproducibility Final notebook: ```text notebooks/carbonalpha_final_pipeline.ipynb ``` Colab link: ```text https://colab.research.google.com/github/capabl-machines/gridops/blob/round-2/notebooks/carbonalpha_final_pipeline.ipynb ``` The notebook verifies artifacts, loads metrics from Hugging Face, runs an environment smoke test, shows the manual eval set, and includes opt-in cells to relaunch the exact HF Jobs training runs.