---
base_model: unsloth/Qwen2.5-3B-Instruct-bnb-4bit
library_name: peft
pipeline_tag: text-generation
model_name: methanol-apc
tags:
  - base_model:adapter:unsloth/Qwen2.5-3B-Instruct-bnb-4bit
  - grpo
  - lora
  - peft
  - trl
  - unsloth
  - reinforcement-learning
  - process-control
  - methanol
license: apache-2.0
---

# Model Card for methanol-apc

LoRA adapter for [`unsloth/Qwen2.5-3B-Instruct-bnb-4bit`](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct-bnb-4bit), fine-tuned with **GRPO** ([Group Relative Policy Optimization](https://huggingface.co/papers/2402.03300)) using [Unsloth](https://github.com/unslothai/unsloth) to act as an autonomous **Advanced Process Control (APC)** operator for a methanol synthesis reactor.

The agent reads simulated sensor readings (temperature, pressure, H₂/CO ratio, catalyst health, …) and emits a JSON control action — feed rates, cooling water flow, and compressor power — that is scored by the [`methanol-apc` OpenEnv environment](https://huggingface.co/spaces/glitchfilter/methanol-apc-env).

- **Model on Hugging Face:** [glitchfilter/methanol-apc](https://huggingface.co/glitchfilter/methanol-apc)
- **Environment:** [glitchfilter/methanol-apc-env (HF Space)](https://huggingface.co/spaces/glitchfilter/methanol-apc-env) · [Bhavneet1492/openenv-methanol-apc (GitHub)](https://github.com/Bhavneet1492/openenv-methanol-apc)
- **Base model:** [unsloth/Qwen2.5-3B-Instruct-bnb-4bit](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct-bnb-4bit)

## Quick start

```python
from unsloth import FastLanguageModel
from peft import PeftModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-3B-Instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)
model = PeftModel.from_pretrained(model, "glitchfilter/methanol-apc")
FastLanguageModel.for_inference(model)

system_prompt = (
    "You are an AI controller for a methanol synthesis reactor. "
    "Output a JSON control action with fields: "
    '{"feed_rate_h2": <0-10>, "feed_rate_co": <0-5>, '
    '"cooling_water_flow": <0-100>, "compressor_power": <0-100>}.'
)
sensors = "T=248.3°C P=85.0bar H2=4.50mol/s CO=2.20mol/s ratio=2.05 cool=55L/min cat_health=98%"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": f"Current sensor readings:\n{sensors}\n\nProvide control action as JSON:"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

import torch
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=128, temperature=0.3, do_sample=True,
                         pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

## Training procedure

Trained with **GRPO** accelerated by Unsloth's 4-bit quantized base model and LoRA adapters.

**Pipeline:** `LLM generates JSON action` → `reward fn parses & scores` → `env.step()` → `multi-component reward` → `GRPO update`.

### Key design choices

- **Curriculum learning** over three task types:
  - `startup` (40%) — easy: ramp reactor to operating temperature
  - `optimization` (35%) — medium: maximize profit at steady state
  - `disturbance_rejection` (25%) — hard: handle cooling system failures
- **Multi-component reward** combining:
  1. Physics reward from `env.step` (× 0.55)
  2. Format-compliance bonus for valid JSON actions (+0.10)
  3. Action-quality score grounded in stoichiometry / cooling adequacy ([−0.30, +0.20])
  4. 3-step lookahead penalty to surface delayed thermal-runaway consequences ([−0.20, 0])
- **Deterministic replay**: each prompt stores `(task, seed, num_warmup)` so all GRPO group completions evaluate against an identical environment state.

### Hyperparameters

| | |
|---|---|
| Base model | `unsloth/Qwen2.5-3B-Instruct-bnb-4bit` (4-bit) |
| LoRA `r` / `alpha` / dropout | 16 / 32 / 0 |
| LoRA target modules | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` |
| Max sequence length | 2048 |
| Max completion length | 120 tokens |
| Train steps | 200 |
| Per-device batch × grad accum | 2 × 4 |
| GRPO group size (`num_generations`) | 8 |
| Learning rate | 5e-6 |
| Warmup ratio | 0.05 |
| Max grad norm | 1.0 |
| Sampling temperature | 0.7 |
| KL coefficient | 0.05 |
| Precision | fp16 (bf16 where supported) |
| Gradient checkpointing | Unsloth |
| Prompt dataset size | 300 |

### Framework versions

- PEFT 0.18.1
- Unsloth (`git+https://github.com/unslothai/unsloth.git`)
- TRL ≥ 0.15
- `openenv-core[core]` ≥ 0.2.2

## Evaluation

The trained agent is compared against a random-action baseline on the `optimization` task (5 episodes × 15 steps). Plots are produced by the training notebook and saved to [plots/](plots/):

| Plot | File |
|---|---|
| Training loss | [plots/loss_curve.png](plots/loss_curve.png) |
| Reward per step (trained) | [plots/reward_curve.png](plots/reward_curve.png) |
| Baseline vs trained | [plots/baseline_vs_trained.png](plots/baseline_vs_trained.png) |

## Intended use & limitations

This adapter is a **research artifact** demonstrating GRPO-based fine-tuning for closed-loop chemical-process control on a *simulated* environment. It is **not** suitable for, and must not be deployed against, any real industrial reactor or safety-critical system. The simulator is a simplified model of methanol synthesis (ICI low-pressure process, Cu/ZnO/Al₂O₃ catalyst) and does not capture the full dynamics, instrumentation, or failure modes of a physical plant.

## Citations

GRPO:

```bibtex
@article{shao2024deepseekmath,
  title  = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
  author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
  year   = {2024},
  eprint = {arXiv:2402.03300}
}
```

Unsloth:

```bibtex
@software{unsloth2024,
  title  = {{Unsloth: 2x faster, 50\% less memory LLM finetuning}},
  author = {Daniel Han and Michael Han and {Unsloth team}},
  url    = {https://github.com/unslothai/unsloth},
  year   = {2024}
}
```