Model Card for methanol-apc
LoRA adapter for unsloth/Qwen2.5-3B-Instruct-bnb-4bit, fine-tuned with GRPO (Group Relative Policy Optimization) using Unsloth to act as an autonomous Advanced Process Control (APC) operator for a methanol synthesis reactor.
The agent reads simulated sensor readings (temperature, pressure, H₂/CO ratio, catalyst health, …) and emits a JSON control action — feed rates, cooling water flow, and compressor power — that is scored by the methanol-apc OpenEnv environment.
- Model on Hugging Face: glitchfilter/methanol-apc
- Environment: glitchfilter/methanol-apc-env (HF Space) · Bhavneet1492/openenv-methanol-apc (GitHub)
- Base model: unsloth/Qwen2.5-3B-Instruct-bnb-4bit
Quick start
from unsloth import FastLanguageModel
from peft import PeftModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-3B-Instruct-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
model = PeftModel.from_pretrained(model, "glitchfilter/methanol-apc")
FastLanguageModel.for_inference(model)
system_prompt = (
"You are an AI controller for a methanol synthesis reactor. "
"Output a JSON control action with fields: "
'{"feed_rate_h2": <0-10>, "feed_rate_co": <0-5>, '
'"cooling_water_flow": <0-100>, "compressor_power": <0-100>}.'
)
sensors = "T=248.3°C P=85.0bar H2=4.50mol/s CO=2.20mol/s ratio=2.05 cool=55L/min cat_health=98%"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Current sensor readings:\n{sensors}\n\nProvide control action as JSON:"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
import torch
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=128, temperature=0.3, do_sample=True,
pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Training procedure
Trained with GRPO accelerated by Unsloth's 4-bit quantized base model and LoRA adapters.
Pipeline: LLM generates JSON action → reward fn parses & scores → env.step() → multi-component reward → GRPO update.
Key design choices
- Curriculum learning over three task types:
startup(40%) — easy: ramp reactor to operating temperatureoptimization(35%) — medium: maximize profit at steady statedisturbance_rejection(25%) — hard: handle cooling system failures
- Multi-component reward combining:
- Physics reward from
env.step(× 0.55) - Format-compliance bonus for valid JSON actions (+0.10)
- Action-quality score grounded in stoichiometry / cooling adequacy ([−0.30, +0.20])
- 3-step lookahead penalty to surface delayed thermal-runaway consequences ([−0.20, 0])
- Physics reward from
- Deterministic replay: each prompt stores
(task, seed, num_warmup)so all GRPO group completions evaluate against an identical environment state.
Hyperparameters
| Base model | unsloth/Qwen2.5-3B-Instruct-bnb-4bit (4-bit) |
LoRA r / alpha / dropout |
16 / 32 / 0 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Max sequence length | 2048 |
| Max completion length | 120 tokens |
| Train steps | 200 |
| Per-device batch × grad accum | 2 × 4 |
GRPO group size (num_generations) |
8 |
| Learning rate | 5e-6 |
| Warmup ratio | 0.05 |
| Max grad norm | 1.0 |
| Sampling temperature | 0.7 |
| KL coefficient | 0.05 |
| Precision | fp16 (bf16 where supported) |
| Gradient checkpointing | Unsloth |
| Prompt dataset size | 300 |
Framework versions
- PEFT 0.18.1
- Unsloth (
git+https://github.com/unslothai/unsloth.git) - TRL ≥ 0.15
openenv-core[core]≥ 0.2.2
Evaluation
The trained agent is compared against a random-action baseline on the optimization task (5 episodes × 15 steps). Plots are produced by the training notebook and saved to plots/:
| Plot | File |
|---|---|
| Training loss | plots/loss_curve.png |
| Reward per step (trained) | plots/reward_curve.png |
| Baseline vs trained | plots/baseline_vs_trained.png |
Intended use & limitations
This adapter is a research artifact demonstrating GRPO-based fine-tuning for closed-loop chemical-process control on a simulated environment. It is not suitable for, and must not be deployed against, any real industrial reactor or safety-critical system. The simulator is a simplified model of methanol synthesis (ICI low-pressure process, Cu/ZnO/Al₂O₃ catalyst) and does not capture the full dynamics, instrumentation, or failure modes of a physical plant.
Citations
GRPO:
@article{shao2024deepseekmath,
title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
year = {2024},
eprint = {arXiv:2402.03300}
}
Unsloth:
@software{unsloth2024,
title = {{Unsloth: 2x faster, 50\% less memory LLM finetuning}},
author = {Daniel Han and Michael Han and {Unsloth team}},
url = {https://github.com/unslothai/unsloth},
year = {2024}
}
- Downloads last month
- 12
Model tree for glitchfilter/methanol-apc
Base model
Qwen/Qwen2.5-3B