Model Card for methanol-apc

LoRA adapter for unsloth/Qwen2.5-3B-Instruct-bnb-4bit, fine-tuned with GRPO (Group Relative Policy Optimization) using Unsloth to act as an autonomous Advanced Process Control (APC) operator for a methanol synthesis reactor.

The agent reads simulated sensor readings (temperature, pressure, H₂/CO ratio, catalyst health, …) and emits a JSON control action — feed rates, cooling water flow, and compressor power — that is scored by the methanol-apc OpenEnv environment.

Quick start

from unsloth import FastLanguageModel
from peft import PeftModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-3B-Instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)
model = PeftModel.from_pretrained(model, "glitchfilter/methanol-apc")
FastLanguageModel.for_inference(model)

system_prompt = (
    "You are an AI controller for a methanol synthesis reactor. "
    "Output a JSON control action with fields: "
    '{"feed_rate_h2": <0-10>, "feed_rate_co": <0-5>, '
    '"cooling_water_flow": <0-100>, "compressor_power": <0-100>}.'
)
sensors = "T=248.3°C P=85.0bar H2=4.50mol/s CO=2.20mol/s ratio=2.05 cool=55L/min cat_health=98%"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": f"Current sensor readings:\n{sensors}\n\nProvide control action as JSON:"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

import torch
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=128, temperature=0.3, do_sample=True,
                         pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Training procedure

Trained with GRPO accelerated by Unsloth's 4-bit quantized base model and LoRA adapters.

Pipeline: LLM generates JSON actionreward fn parses & scoresenv.step()multi-component rewardGRPO update.

Key design choices

  • Curriculum learning over three task types:
    • startup (40%) — easy: ramp reactor to operating temperature
    • optimization (35%) — medium: maximize profit at steady state
    • disturbance_rejection (25%) — hard: handle cooling system failures
  • Multi-component reward combining:
    1. Physics reward from env.step (× 0.55)
    2. Format-compliance bonus for valid JSON actions (+0.10)
    3. Action-quality score grounded in stoichiometry / cooling adequacy ([−0.30, +0.20])
    4. 3-step lookahead penalty to surface delayed thermal-runaway consequences ([−0.20, 0])
  • Deterministic replay: each prompt stores (task, seed, num_warmup) so all GRPO group completions evaluate against an identical environment state.

Hyperparameters

Base model unsloth/Qwen2.5-3B-Instruct-bnb-4bit (4-bit)
LoRA r / alpha / dropout 16 / 32 / 0
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Max sequence length 2048
Max completion length 120 tokens
Train steps 200
Per-device batch × grad accum 2 × 4
GRPO group size (num_generations) 8
Learning rate 5e-6
Warmup ratio 0.05
Max grad norm 1.0
Sampling temperature 0.7
KL coefficient 0.05
Precision fp16 (bf16 where supported)
Gradient checkpointing Unsloth
Prompt dataset size 300

Framework versions

  • PEFT 0.18.1
  • Unsloth (git+https://github.com/unslothai/unsloth.git)
  • TRL ≥ 0.15
  • openenv-core[core] ≥ 0.2.2

Evaluation

The trained agent is compared against a random-action baseline on the optimization task (5 episodes × 15 steps). Plots are produced by the training notebook and saved to plots/:

Plot File
Training loss plots/loss_curve.png
Reward per step (trained) plots/reward_curve.png
Baseline vs trained plots/baseline_vs_trained.png

Intended use & limitations

This adapter is a research artifact demonstrating GRPO-based fine-tuning for closed-loop chemical-process control on a simulated environment. It is not suitable for, and must not be deployed against, any real industrial reactor or safety-critical system. The simulator is a simplified model of methanol synthesis (ICI low-pressure process, Cu/ZnO/Al₂O₃ catalyst) and does not capture the full dynamics, instrumentation, or failure modes of a physical plant.

Citations

GRPO:

@article{shao2024deepseekmath,
  title  = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
  author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
  year   = {2024},
  eprint = {arXiv:2402.03300}
}

Unsloth:

@software{unsloth2024,
  title  = {{Unsloth: 2x faster, 50\% less memory LLM finetuning}},
  author = {Daniel Han and Michael Han and {Unsloth team}},
  url    = {https://github.com/unslothai/unsloth},
  year   = {2024}
}
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for glitchfilter/methanol-apc

Base model

Qwen/Qwen2.5-3B
Adapter
(19)
this model

Space using glitchfilter/methanol-apc 1

Paper for glitchfilter/methanol-apc