Add model card metadata header

43f214b verified about 1 month ago

6.72 kB

	---
	language:
	- en
	library_name: peft
	base_model: unsloth/Qwen2.5-7B-Instruct
	tags:
	- grpo
	- trl
	- peft
	- qwen2.5
	- openenv
	- portfolio-reasoning
	- climate
	license: other
	pipeline_tag: text-generation
	---

	# CarbonAlpha Model Card

	## Model Summary

	CarbonAlpha is a climate-aware portfolio reasoning agent for the
	`portfolio_env` OpenEnv environment. It reads one macro-news event, reasons
	through first-order and second-order effects, and emits a constrained
	`PortfolioAction`:

	```json
	{
	"weights": [w_tech, w_oil, w_green, w_real_estate, w_bonds],
	"infra_commit": 0.0,
	"carbon_offset_buy": 0.0,
	"put_hedge": 0.0,
	"tech_bet": "status_quo"
	}
	```

	Current best research model:

	```text
	77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1
	```

	Base model:

	```text
	unsloth/Qwen2.5-7B-Instruct
	```

	Adapter lineage:

	1. SFT warm-start on 400 curriculum traces.
	2. GRPO Phase 1 for 100 steps.
	3. Holdout and manual macro-eval checks before promotion.

	The live Space can load this adapter through the `MODEL_SUBFOLDER`
	environment variable:

	```text
	https://77ethers-carbonalpha-demo.hf.space/
	```

	## Intended Use

	This model is intended for the CarbonAlpha walkthrough demo and OpenEnv
	evaluation. It is not a financial advisor and should not be used to make real
	investment decisions.

	The useful behavior to evaluate is:

	- strict `<think>...</think>` plus JSON formatting;
	- valid portfolio weights and bounded interventions;
	- recognition of macro regime shifts;
	- carbon-budget awareness;
	- performance against the environment's equal-weight baseline.

	## Training Data

	The Qwen2.5 SFT warm-start used:

	```text
	sft_traces/curriculum_400_e80_m160_h160.jsonl
	```

	Trace mix:

	- 80 easy traces;
	- 160 medium / ambiguous traces;
	- 160 hard traces.

	The trace schema follows `sft_traces/merged_v6_aligned.jsonl`, with the same
	prompt and completion contract used during inference.

	## Training Pipeline

	### SFT

	SFT artifact:

	```text
	77ethers/CarbonAlpha/sft_qwen25_7b_curriculum400_v1
	```

	Training script:

	```text
	scripts/hf_sft_qwen25_7b.py
	```

	Configuration:

	- QLoRA over `unsloth/Qwen2.5-7B-Instruct`;
	- LoRA rank 16;
	- `lora_alpha=16`;
	- 220 SFT steps;
	- effective batch size 4;
	- Hugging Face Jobs L40S.

	SFT result:

	- generation sanity: 5/5 valid actions;
	- holdout: 5/5 valid;
	- mean holdout regret: `+0.02796`;
	- beats baseline on 3/5 holdout seeds.

	### GRPO

	Best GRPO artifact:

	```text
	77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1
	```

	Training script:

	```text
	scripts/hf_grpo_qwen25_adapter.py
	```

	GRPO configuration:

	- warm-start from `sft_qwen25_7b_curriculum400_v1`;
	- `use_vllm=False`;
	- 100 GRPO steps;
	- 128 generated Phase-1 prompts;
	- 2 generations per prompt;
	- batch size 2;
	- learning rate `2e-6`;
	- `loss_type="dapo"`;
	- KL beta `0.02`.

	Reward functions:

	- format reward;
	- action-contract reward;
	- reasoning-shape reward;
	- Phase-1 simulator regret reward;
	- carbon-guard reward.

	Important engineering choice: we avoided vLLM for the Qwen2.5 GRPO run because
	earlier vLLM-based Qwen3 rollouts collapsed to one-token completions. The
	plain-Transformers path was slower but healthier and easier to debug.

	## Evidence of Training

	The 100-step GRPO run was launched as a Hugging Face Job:

	```text
	https://huggingface.co/jobs/77ethers/69ed1ce0d70108f37acdeea3
	```

	Raw evidence committed in this repo:

	```text
	training_logs/qwen25_grpo_phase1_100_v1.log
	training_logs/qwen25_grpo_phase1_100_v1_rows.jsonl
	```

	The parsed JSONL contains 100 real GRPO metric rows extracted from the job log.

	Loss and reward plots generated from those rows:

	![Qwen2.5 GRPO loss curve](assets/loss_curve.png)

	![Qwen2.5 GRPO reward curve](assets/reward_curve.png)

	Additional rollout-health plot:

	![Qwen2.5 GRPO completion length health](assets/qwen25_grpo_phase1_100_completion_lengths.png)

	The completion-length plot is included because one-token rollout collapse was
	the main failure mode in earlier GRPO attempts. In this successful run,
	completion lengths stayed well above the smoke threshold throughout training.

	## Evaluation

	### Holdout

	Holdout seeds:

	```text
	100, 200, 300, 400, 500
	```

	Best GRPO holdout results:

	\| Metric \| Value \|
	\|---\|---:\|
	\| Valid completions \| 5/5 \|
	\| Mean holdout regret \| `+0.1058` \|
	\| Beats baseline \| 5/5 \|
	\| Previous v6 SFT mean regret bar \| `+0.034` \|

	Per-seed holdout:

	\| Seed \| Shock \| Regret \|
	\|---:\|---\|---:\|
	\| 100 \| `hard_rare_earth_rotation` \| `+0.0755` \|
	\| 200 \| `easy_tech_earnings` \| `+0.1210` \|
	\| 300 \| `easy_tech_earnings` \| `+0.1442` \|
	\| 400 \| `hard_deflation_pulse` \| `+0.1527` \|
	\| 500 \| `ambig_ai_efficiency` \| `+0.0358` \|

	### Manual Macro Eval

	Eval set:

	```text
	evals/macro_eval_10.jsonl
	```

	Report:

	```text
	evals/macro_eval_10_grpo_report.json
	```

	Summary:

	- GRPO adapter: 10/10 valid JSON actions;
	- GRPO adapter: 10/10 closed `<think>`;
	- base model: 9/10 valid JSON actions;
	- GRPO was stronger on rare-earth export controls, global deflation pulse, and
	yen carry unwind.

	Known weaknesses:

	- `q02_oil_chokepoint_inflation`: the model understood the inflation regime
	and hedged, but underweighted OIL despite the direct supply shock.
	- `q04_ai_efficiency_paradox`: the model correctly liked TECH and cut
	REAL_ESTATE, but gave GREEN too much weight despite lower data-center power
	demand expectations.

	These are targeted follow-up items, not hidden failures.

	## Comparison With Qwen3 Base Branch

	We also tested an isolated Qwen3-4B-Base branch:

	```text
	77ethers/CarbonAlpha/grpo_qwen3_4b_base_smoke_v2
	```

	Result:

	- smoke gate passed mechanically;
	- no one-token collapse;
	- completions were too long, often near the 400-token cap;
	- holdout: 4/5 valid;
	- mean holdout regret: `-0.0229`;
	- did not beat the Qwen2.5 GRPO model.

	Conclusion: Qwen3 Base is a viable research branch, but the current production
	candidate remains Qwen2.5-7B SFT plus GRPO.

	## Limitations

	- The GRPO run is Phase 1 only, so it is strongest on easy-shock simulator
	reward optimization.
	- The model still has known second-order reasoning weaknesses in specific
	macro setups.
	- The reward environment is synthetic and should be interpreted as a benchmark,
	not a market simulator.
	- The model is private on Hugging Face and requires `HF_API_TOKEN` for loading.

	## Reproducibility

	Final notebook:

	```text
	notebooks/carbonalpha_final_pipeline.ipynb
	```

	Colab link:

	```text
	https://colab.research.google.com/github/capabl-machines/gridops/blob/round-2/notebooks/carbonalpha_final_pipeline.ipynb
	```

	The notebook verifies artifacts, loads metrics from Hugging Face, runs an
	environment smoke test, shows the manual eval set, and includes opt-in cells
	to relaunch the exact HF Jobs training runs.