--- base_model: unsloth/Qwen2.5-3B-Instruct-bnb-4bit library_name: peft pipeline_tag: text-generation model_name: methanol-apc tags: - base_model:adapter:unsloth/Qwen2.5-3B-Instruct-bnb-4bit - grpo - lora - peft - trl - unsloth - reinforcement-learning - process-control - methanol license: apache-2.0 --- # Model Card for methanol-apc LoRA adapter for [`unsloth/Qwen2.5-3B-Instruct-bnb-4bit`](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct-bnb-4bit), fine-tuned with **GRPO** ([Group Relative Policy Optimization](https://huggingface.co/papers/2402.03300)) using [Unsloth](https://github.com/unslothai/unsloth) to act as an autonomous **Advanced Process Control (APC)** operator for a methanol synthesis reactor. The agent reads simulated sensor readings (temperature, pressure, H₂/CO ratio, catalyst health, …) and emits a JSON control action — feed rates, cooling water flow, and compressor power — that is scored by the [`methanol-apc` OpenEnv environment](https://huggingface.co/spaces/glitchfilter/methanol-apc-env). - **Model on Hugging Face:** [glitchfilter/methanol-apc](https://huggingface.co/glitchfilter/methanol-apc) - **Environment:** [glitchfilter/methanol-apc-env (HF Space)](https://huggingface.co/spaces/glitchfilter/methanol-apc-env) · [Bhavneet1492/openenv-methanol-apc (GitHub)](https://github.com/Bhavneet1492/openenv-methanol-apc) - **Base model:** [unsloth/Qwen2.5-3B-Instruct-bnb-4bit](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct-bnb-4bit) ## Quick start ```python from unsloth import FastLanguageModel from peft import PeftModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Qwen2.5-3B-Instruct-bnb-4bit", max_seq_length=2048, load_in_4bit=True, ) model = PeftModel.from_pretrained(model, "glitchfilter/methanol-apc") FastLanguageModel.for_inference(model) system_prompt = ( "You are an AI controller for a methanol synthesis reactor. " "Output a JSON control action with fields: " '{"feed_rate_h2": <0-10>, "feed_rate_co": <0-5>, ' '"cooling_water_flow": <0-100>, "compressor_power": <0-100>}.' ) sensors = "T=248.3°C P=85.0bar H2=4.50mol/s CO=2.20mol/s ratio=2.05 cool=55L/min cat_health=98%" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Current sensor readings:\n{sensors}\n\nProvide control action as JSON:"}, ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) import torch inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): out = model.generate(**inputs, max_new_tokens=128, temperature=0.3, do_sample=True, pad_token_id=tokenizer.eos_token_id) print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ## Training procedure Trained with **GRPO** accelerated by Unsloth's 4-bit quantized base model and LoRA adapters. **Pipeline:** `LLM generates JSON action` → `reward fn parses & scores` → `env.step()` → `multi-component reward` → `GRPO update`. ### Key design choices - **Curriculum learning** over three task types: - `startup` (40%) — easy: ramp reactor to operating temperature - `optimization` (35%) — medium: maximize profit at steady state - `disturbance_rejection` (25%) — hard: handle cooling system failures - **Multi-component reward** combining: 1. Physics reward from `env.step` (× 0.55) 2. Format-compliance bonus for valid JSON actions (+0.10) 3. Action-quality score grounded in stoichiometry / cooling adequacy ([−0.30, +0.20]) 4. 3-step lookahead penalty to surface delayed thermal-runaway consequences ([−0.20, 0]) - **Deterministic replay**: each prompt stores `(task, seed, num_warmup)` so all GRPO group completions evaluate against an identical environment state. ### Hyperparameters | | | |---|---| | Base model | `unsloth/Qwen2.5-3B-Instruct-bnb-4bit` (4-bit) | | LoRA `r` / `alpha` / dropout | 16 / 32 / 0 | | LoRA target modules | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` | | Max sequence length | 2048 | | Max completion length | 120 tokens | | Train steps | 200 | | Per-device batch × grad accum | 2 × 4 | | GRPO group size (`num_generations`) | 8 | | Learning rate | 5e-6 | | Warmup ratio | 0.05 | | Max grad norm | 1.0 | | Sampling temperature | 0.7 | | KL coefficient | 0.05 | | Precision | fp16 (bf16 where supported) | | Gradient checkpointing | Unsloth | | Prompt dataset size | 300 | ### Framework versions - PEFT 0.18.1 - Unsloth (`git+https://github.com/unslothai/unsloth.git`) - TRL ≥ 0.15 - `openenv-core[core]` ≥ 0.2.2 ## Evaluation The trained agent is compared against a random-action baseline on the `optimization` task (5 episodes × 15 steps). Plots are produced by the training notebook and saved to [plots/](plots/): | Plot | File | |---|---| | Training loss | [plots/loss_curve.png](plots/loss_curve.png) | | Reward per step (trained) | [plots/reward_curve.png](plots/reward_curve.png) | | Baseline vs trained | [plots/baseline_vs_trained.png](plots/baseline_vs_trained.png) | ## Intended use & limitations This adapter is a **research artifact** demonstrating GRPO-based fine-tuning for closed-loop chemical-process control on a *simulated* environment. It is **not** suitable for, and must not be deployed against, any real industrial reactor or safety-critical system. The simulator is a simplified model of methanol synthesis (ICI low-pressure process, Cu/ZnO/Al₂O₃ catalyst) and does not capture the full dynamics, instrumentation, or failure modes of a physical plant. ## Citations GRPO: ```bibtex @article{shao2024deepseekmath, title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}}, author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo}, year = {2024}, eprint = {arXiv:2402.03300} } ``` Unsloth: ```bibtex @software{unsloth2024, title = {{Unsloth: 2x faster, 50\% less memory LLM finetuning}}, author = {Daniel Han and Michael Han and {Unsloth team}}, url = {https://github.com/unslothai/unsloth}, year = {2024} } ```