methanol-apc / README.md
glitchfilter's picture
Weights of Model Trained Via Unsloth (#1)
7a60548
---
base_model: unsloth/Qwen2.5-3B-Instruct-bnb-4bit
library_name: peft
pipeline_tag: text-generation
model_name: methanol-apc
tags:
- base_model:adapter:unsloth/Qwen2.5-3B-Instruct-bnb-4bit
- grpo
- lora
- peft
- trl
- unsloth
- reinforcement-learning
- process-control
- methanol
license: apache-2.0
---
# Model Card for methanol-apc
LoRA adapter for [`unsloth/Qwen2.5-3B-Instruct-bnb-4bit`](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct-bnb-4bit), fine-tuned with **GRPO** ([Group Relative Policy Optimization](https://huggingface.co/papers/2402.03300)) using [Unsloth](https://github.com/unslothai/unsloth) to act as an autonomous **Advanced Process Control (APC)** operator for a methanol synthesis reactor.
The agent reads simulated sensor readings (temperature, pressure, H₂/CO ratio, catalyst health, …) and emits a JSON control action — feed rates, cooling water flow, and compressor power — that is scored by the [`methanol-apc` OpenEnv environment](https://huggingface.co/spaces/glitchfilter/methanol-apc-env).
- **Model on Hugging Face:** [glitchfilter/methanol-apc](https://huggingface.co/glitchfilter/methanol-apc)
- **Environment:** [glitchfilter/methanol-apc-env (HF Space)](https://huggingface.co/spaces/glitchfilter/methanol-apc-env) · [Bhavneet1492/openenv-methanol-apc (GitHub)](https://github.com/Bhavneet1492/openenv-methanol-apc)
- **Base model:** [unsloth/Qwen2.5-3B-Instruct-bnb-4bit](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct-bnb-4bit)
## Quick start
```python
from unsloth import FastLanguageModel
from peft import PeftModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-3B-Instruct-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
model = PeftModel.from_pretrained(model, "glitchfilter/methanol-apc")
FastLanguageModel.for_inference(model)
system_prompt = (
"You are an AI controller for a methanol synthesis reactor. "
"Output a JSON control action with fields: "
'{"feed_rate_h2": <0-10>, "feed_rate_co": <0-5>, '
'"cooling_water_flow": <0-100>, "compressor_power": <0-100>}.'
)
sensors = "T=248.3°C P=85.0bar H2=4.50mol/s CO=2.20mol/s ratio=2.05 cool=55L/min cat_health=98%"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Current sensor readings:\n{sensors}\n\nProvide control action as JSON:"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
import torch
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=128, temperature=0.3, do_sample=True,
pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
## Training procedure
Trained with **GRPO** accelerated by Unsloth's 4-bit quantized base model and LoRA adapters.
**Pipeline:** `LLM generates JSON action` → `reward fn parses & scores` → `env.step()` → `multi-component reward` → `GRPO update`.
### Key design choices
- **Curriculum learning** over three task types:
- `startup` (40%) — easy: ramp reactor to operating temperature
- `optimization` (35%) — medium: maximize profit at steady state
- `disturbance_rejection` (25%) — hard: handle cooling system failures
- **Multi-component reward** combining:
1. Physics reward from `env.step` (× 0.55)
2. Format-compliance bonus for valid JSON actions (+0.10)
3. Action-quality score grounded in stoichiometry / cooling adequacy ([−0.30, +0.20])
4. 3-step lookahead penalty to surface delayed thermal-runaway consequences ([−0.20, 0])
- **Deterministic replay**: each prompt stores `(task, seed, num_warmup)` so all GRPO group completions evaluate against an identical environment state.
### Hyperparameters
| | |
|---|---|
| Base model | `unsloth/Qwen2.5-3B-Instruct-bnb-4bit` (4-bit) |
| LoRA `r` / `alpha` / dropout | 16 / 32 / 0 |
| LoRA target modules | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` |
| Max sequence length | 2048 |
| Max completion length | 120 tokens |
| Train steps | 200 |
| Per-device batch × grad accum | 2 × 4 |
| GRPO group size (`num_generations`) | 8 |
| Learning rate | 5e-6 |
| Warmup ratio | 0.05 |
| Max grad norm | 1.0 |
| Sampling temperature | 0.7 |
| KL coefficient | 0.05 |
| Precision | fp16 (bf16 where supported) |
| Gradient checkpointing | Unsloth |
| Prompt dataset size | 300 |
### Framework versions
- PEFT 0.18.1
- Unsloth (`git+https://github.com/unslothai/unsloth.git`)
- TRL ≥ 0.15
- `openenv-core[core]` ≥ 0.2.2
## Evaluation
The trained agent is compared against a random-action baseline on the `optimization` task (5 episodes × 15 steps). Plots are produced by the training notebook and saved to [plots/](plots/):
| Plot | File |
|---|---|
| Training loss | [plots/loss_curve.png](plots/loss_curve.png) |
| Reward per step (trained) | [plots/reward_curve.png](plots/reward_curve.png) |
| Baseline vs trained | [plots/baseline_vs_trained.png](plots/baseline_vs_trained.png) |
## Intended use & limitations
This adapter is a **research artifact** demonstrating GRPO-based fine-tuning for closed-loop chemical-process control on a *simulated* environment. It is **not** suitable for, and must not be deployed against, any real industrial reactor or safety-critical system. The simulator is a simplified model of methanol synthesis (ICI low-pressure process, Cu/ZnO/Al₂O₃ catalyst) and does not capture the full dynamics, instrumentation, or failure modes of a physical plant.
## Citations
GRPO:
```bibtex
@article{shao2024deepseekmath,
title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
year = {2024},
eprint = {arXiv:2402.03300}
}
```
Unsloth:
```bibtex
@software{unsloth2024,
title = {{Unsloth: 2x faster, 50\% less memory LLM finetuning}},
author = {Daniel Han and Michael Han and {Unsloth team}},
url = {https://github.com/unslothai/unsloth},
year = {2024}
}
```