--- license: mit library_name: peft base_model: Qwen/Qwen2.5-0.5B-Instruct tags: - trl - grpo - reinforcement-learning - epistemic-agency - hackathon --- # 🛡️ Epistemic Agent v2 - Autonomy Calibration Hub This model is a **Calibrated Epistemic Agent** trained specifically for the **OpenEnv India Hackathon 2026**. It was fine-tuned using **Group Relative Policy Optimization (GRPO)** to master the balance between autonomous action and information gathering. ## 🧠 Model Description Unlike typical LLMs that "hallucinate" or guess when faced with ambiguous instructions, this agent has been trained to use the `INVESTIGATE` action when it detects uncertainty. - **Objective**: Learn when to take direct autonomous action vs. when to pause and gather forensics. - **Algorithm**: GRPO (Group Relative Policy Optimization) - **Base Model**: Qwen2.5-0.5B-Instruct - **Task Alignment**: Autonomy Calibration Benchmark (OpenEnv) ## 📊 Training Performance The agent was trained on high-ambiguity scenarios across three domains: **Email Triage, DevOps Incidents, and Financial Requests.** | Benchmark | Blind Baseline | Calibrated Agent (Ours) | Improvement | |-----------|----------------|--------------------------|-------------| | Email Triage | 0.378 | **0.798** | +42.0% | | DevOps Incident | 0.572 | **0.939** | +36.7% | | Financial Request | 0.773 | **0.990** | +21.7% | ### Key Behavioral Signal: The model demonstrates an **Investigation Rate of 100%** on ambiguous signals, effectively resolving partial observability before committing to high-stakes decisions. ## 🛠️ Training Procedure - **Steps**: 100 - **Group Size (G)**: 8 generations per prompt - **Reward Range**: (0.01, 0.99) - Strictly OpenEnv compliant. - **Penalty Logic**: Severe negative rewards (-0.90) for "Act" decisions on Ambiguous states. ## 🚀 How to Use This model is designed to be used in conjunction with the [Autonomy Calibration Benchmark](https://huggingface.co/spaces/JOY0021/autonomy-calibration-benchmark). ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel base_model = "Qwen/Qwen2.5-0.5B-Instruct" adapter = "JOY0021/autonomy-grpo-agent-v2" tokenizer = AutoTokenizer.from_pretrained(base_model) model = AutoModelForCausalLM.from_pretrained(base_model) model = PeftModel.from_pretrained(model, adapter)