Weights of Model Trained Via Unsloth (#1)

7a60548 about 1 month ago

6.25 kB

	---
	base_model: unsloth/Qwen2.5-3B-Instruct-bnb-4bit
	library_name: peft
	pipeline_tag: text-generation
	model_name: methanol-apc
	tags:
	- base_model:adapter:unsloth/Qwen2.5-3B-Instruct-bnb-4bit
	- grpo
	- lora
	- peft
	- trl
	- unsloth
	- reinforcement-learning
	- process-control
	- methanol
	license: apache-2.0
	---

	# Model Card for methanol-apc

	LoRA adapter for [`unsloth/Qwen2.5-3B-Instruct-bnb-4bit`](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct-bnb-4bit), fine-tuned with GRPO ([Group Relative Policy Optimization](https://huggingface.co/papers/2402.03300)) using [Unsloth](https://github.com/unslothai/unsloth) to act as an autonomous Advanced Process Control (APC) operator for a methanol synthesis reactor.

	The agent reads simulated sensor readings (temperature, pressure, H₂/CO ratio, catalyst health, …) and emits a JSON control action — feed rates, cooling water flow, and compressor power — that is scored by the [`methanol-apc` OpenEnv environment](https://huggingface.co/spaces/glitchfilter/methanol-apc-env).

	- Model on Hugging Face: [glitchfilter/methanol-apc](https://huggingface.co/glitchfilter/methanol-apc)
	- Environment: [glitchfilter/methanol-apc-env (HF Space)](https://huggingface.co/spaces/glitchfilter/methanol-apc-env) · [Bhavneet1492/openenv-methanol-apc (GitHub)](https://github.com/Bhavneet1492/openenv-methanol-apc)
	- Base model: [unsloth/Qwen2.5-3B-Instruct-bnb-4bit](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct-bnb-4bit)

	## Quick start

	```python
	from unsloth import FastLanguageModel
	from peft import PeftModel

	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name="unsloth/Qwen2.5-3B-Instruct-bnb-4bit",
	max_seq_length=2048,
	load_in_4bit=True,
	)
	model = PeftModel.from_pretrained(model, "glitchfilter/methanol-apc")
	FastLanguageModel.for_inference(model)

	system_prompt = (
	"You are an AI controller for a methanol synthesis reactor. "
	"Output a JSON control action with fields: "
	'{"feed_rate_h2": <0-10>, "feed_rate_co": <0-5>, '
	'"cooling_water_flow": <0-100>, "compressor_power": <0-100>}.'
	)
	sensors = "T=248.3°C P=85.0bar H2=4.50mol/s CO=2.20mol/s ratio=2.05 cool=55L/min cat_health=98%"

	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": f"Current sensor readings:\n{sensors}\n\nProvide control action as JSON:"},
	]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

	import torch
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	with torch.no_grad():
	out = model.generate(**inputs, max_new_tokens=128, temperature=0.3, do_sample=True,
	pad_token_id=tokenizer.eos_token_id)
	print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	## Training procedure

	Trained with GRPO accelerated by Unsloth's 4-bit quantized base model and LoRA adapters.

	Pipeline: `LLM generates JSON action` → `reward fn parses & scores` → `env.step()` → `multi-component reward` → `GRPO update`.

	### Key design choices

	- Curriculum learning over three task types:
	- `startup` (40%) — easy: ramp reactor to operating temperature
	- `optimization` (35%) — medium: maximize profit at steady state
	- `disturbance_rejection` (25%) — hard: handle cooling system failures
	- Multi-component reward combining:
	1. Physics reward from `env.step` (× 0.55)
	2. Format-compliance bonus for valid JSON actions (+0.10)
	3. Action-quality score grounded in stoichiometry / cooling adequacy ([−0.30, +0.20])
	4. 3-step lookahead penalty to surface delayed thermal-runaway consequences ([−0.20, 0])
	- Deterministic replay: each prompt stores `(task, seed, num_warmup)` so all GRPO group completions evaluate against an identical environment state.

	### Hyperparameters

	\| \| \|
	\|---\|---\|
	\| Base model \| `unsloth/Qwen2.5-3B-Instruct-bnb-4bit` (4-bit) \|
	\| LoRA `r` / `alpha` / dropout \| 16 / 32 / 0 \|
	\| LoRA target modules \| `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` \|
	\| Max sequence length \| 2048 \|
	\| Max completion length \| 120 tokens \|
	\| Train steps \| 200 \|
	\| Per-device batch × grad accum \| 2 × 4 \|
	\| GRPO group size (`num_generations`) \| 8 \|
	\| Learning rate \| 5e-6 \|
	\| Warmup ratio \| 0.05 \|
	\| Max grad norm \| 1.0 \|
	\| Sampling temperature \| 0.7 \|
	\| KL coefficient \| 0.05 \|
	\| Precision \| fp16 (bf16 where supported) \|
	\| Gradient checkpointing \| Unsloth \|
	\| Prompt dataset size \| 300 \|

	### Framework versions

	- PEFT 0.18.1
	- Unsloth (`git+https://github.com/unslothai/unsloth.git`)
	- TRL ≥ 0.15
	- `openenv-core[core]` ≥ 0.2.2

	## Evaluation

	The trained agent is compared against a random-action baseline on the `optimization` task (5 episodes × 15 steps). Plots are produced by the training notebook and saved to [plots/](plots/):

	\| Plot \| File \|
	\|---\|---\|
	\| Training loss \| [plots/loss_curve.png](plots/loss_curve.png) \|
	\| Reward per step (trained) \| [plots/reward_curve.png](plots/reward_curve.png) \|
	\| Baseline vs trained \| [plots/baseline_vs_trained.png](plots/baseline_vs_trained.png) \|

	## Intended use & limitations

	This adapter is a research artifact demonstrating GRPO-based fine-tuning for closed-loop chemical-process control on a simulated environment. It is not suitable for, and must not be deployed against, any real industrial reactor or safety-critical system. The simulator is a simplified model of methanol synthesis (ICI low-pressure process, Cu/ZnO/Al₂O₃ catalyst) and does not capture the full dynamics, instrumentation, or failure modes of a physical plant.

	## Citations

	GRPO:

	```bibtex
	@article{shao2024deepseekmath,
	title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
	author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
	year = {2024},
	eprint = {arXiv:2402.03300}
	}
	```

	Unsloth:

	```bibtex
	@software{unsloth2024,
	title = {{Unsloth: 2x faster, 50\% less memory LLM finetuning}},
	author = {Daniel Han and Michael Han and {Unsloth team}},
	url = {https://github.com/unslothai/unsloth},
	year = {2024}
	}
	```