Model Card for methanol-apc

LoRA adapter for unsloth/Qwen2.5-3B-Instruct-bnb-4bit, fine-tuned with GRPO (Group Relative Policy Optimization) using Unsloth to act as an autonomous Advanced Process Control (APC) operator for a methanol synthesis reactor.

The agent reads simulated sensor readings (temperature, pressure, H₂/CO ratio, catalyst health, …) and emits a JSON control action — feed rates, cooling water flow, and compressor power — that is scored by the methanol-apc OpenEnv environment.

Model on Hugging Face: glitchfilter/methanol-apc
Environment: glitchfilter/methanol-apc-env (HF Space) · Bhavneet1492/openenv-methanol-apc (GitHub)
Base model: unsloth/Qwen2.5-3B-Instruct-bnb-4bit

Quick start

from unsloth import FastLanguageModel
from peft import PeftModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-3B-Instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)
model = PeftModel.from_pretrained(model, "glitchfilter/methanol-apc")
FastLanguageModel.for_inference(model)

system_prompt = (
    "You are an AI controller for a methanol synthesis reactor. "
    "Output a JSON control action with fields: "
    '{"feed_rate_h2": <0-10>, "feed_rate_co": <0-5>, '
    '"cooling_water_flow": <0-100>, "compressor_power": <0-100>}.'
)
sensors = "T=248.3°C P=85.0bar H2=4.50mol/s CO=2.20mol/s ratio=2.05 cool=55L/min cat_health=98%"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": f"Current sensor readings:\n{sensors}\n\nProvide control action as JSON:"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

import torch
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=128, temperature=0.3, do_sample=True,
                         pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Training procedure

Trained with GRPO accelerated by Unsloth's 4-bit quantized base model and LoRA adapters.

Pipeline: LLM generates JSON action → reward fn parses & scores → env.step() → multi-component reward → GRPO update.

Key design choices

Curriculum learning over three task types:
- startup (40%) — easy: ramp reactor to operating temperature
- optimization (35%) — medium: maximize profit at steady state
- disturbance_rejection (25%) — hard: handle cooling system failures
Multi-component reward combining:
1. Physics reward from env.step (× 0.55)
2. Format-compliance bonus for valid JSON actions (+0.10)
3. Action-quality score grounded in stoichiometry / cooling adequacy ([−0.30, +0.20])
4. 3-step lookahead penalty to surface delayed thermal-runaway consequences ([−0.20, 0])
Deterministic replay: each prompt stores (task, seed, num_warmup) so all GRPO group completions evaluate against an identical environment state.

Hyperparameters


Base model	`unsloth/Qwen2.5-3B-Instruct-bnb-4bit` (4-bit)
LoRA `r` / `alpha` / dropout	16 / 32 / 0
LoRA target modules	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
Max sequence length	2048
Max completion length	120 tokens
Train steps	200
Per-device batch × grad accum	2 × 4
GRPO group size (`num_generations`)	8
Learning rate	5e-6
Warmup ratio	0.05
Max grad norm	1.0
Sampling temperature	0.7
KL coefficient	0.05
Precision	fp16 (bf16 where supported)
Gradient checkpointing	Unsloth
Prompt dataset size	300

Framework versions

PEFT 0.18.1
Unsloth (git+https://github.com/unslothai/unsloth.git)
TRL ≥ 0.15
openenv-core[core] ≥ 0.2.2

Evaluation

The trained agent is compared against a random-action baseline on the optimization task (5 episodes × 15 steps). Plots are produced by the training notebook and saved to plots/:

Plot	File
Training loss	plots/loss_curve.png
Reward per step (trained)	plots/reward_curve.png
Baseline vs trained	plots/baseline_vs_trained.png

Intended use & limitations

This adapter is a research artifact demonstrating GRPO-based fine-tuning for closed-loop chemical-process control on a simulated environment. It is not suitable for, and must not be deployed against, any real industrial reactor or safety-critical system. The simulator is a simplified model of methanol synthesis (ICI low-pressure process, Cu/ZnO/Al₂O₃ catalyst) and does not capture the full dynamics, instrumentation, or failure modes of a physical plant.

Citations

GRPO:

@article{shao2024deepseekmath,
  title  = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
  author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
  year   = {2024},
  eprint = {arXiv:2402.03300}
}

Unsloth:

@software{unsloth2024,
  title  = {{Unsloth: 2x faster, 50\% less memory LLM finetuning}},
  author = {Daniel Han and Michael Han and {Unsloth team}},
  url    = {https://github.com/unslothai/unsloth},
  year   = {2024}
}

Downloads last month: 12

Model tree for glitchfilter/methanol-apc

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Quantized

unsloth/Qwen2.5-3B-Instruct-bnb-4bit

Adapter

(19)

this model

Space using glitchfilter/methanol-apc 1

Paper for glitchfilter/methanol-apc

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5, 2024 • 145