---
license: mit
library_name: peft
base_model: Qwen/Qwen2.5-0.5B-Instruct
tags:
- trl
- grpo
- reinforcement-learning
- epistemic-agency
- hackathon
---

# 🛡️ Epistemic Agent v2 - Autonomy Calibration Hub

This model is a **Calibrated Epistemic Agent** trained specifically for the **OpenEnv India Hackathon 2026**. 
It was fine-tuned using **Group Relative Policy Optimization (GRPO)** to master the balance between autonomous action and information gathering.

## 🧠 Model Description

Unlike typical LLMs that "hallucinate" or guess when faced with ambiguous instructions, this agent has been trained to use the `INVESTIGATE` action when it detects uncertainty.

- **Objective**: Learn when to take direct autonomous action vs. when to pause and gather forensics.
- **Algorithm**: GRPO (Group Relative Policy Optimization)
- **Base Model**: Qwen2.5-0.5B-Instruct
- **Task Alignment**: Autonomy Calibration Benchmark (OpenEnv)

## 📊 Training Performance

The agent was trained on high-ambiguity scenarios across three domains: **Email Triage, DevOps Incidents, and Financial Requests.**

| Benchmark | Blind Baseline | Calibrated Agent (Ours) | Improvement |
|-----------|----------------|--------------------------|-------------|
| Email Triage | 0.378 | **0.798** | +42.0% |
| DevOps Incident | 0.572 | **0.939** | +36.7% |
| Financial Request | 0.773 | **0.990** | +21.7% |

### Key Behavioral Signal:
The model demonstrates an **Investigation Rate of 100%** on ambiguous signals, effectively resolving partial observability before committing to high-stakes decisions.

## 🛠️ Training Procedure

- **Steps**: 100
- **Group Size (G)**: 8 generations per prompt
- **Reward Range**: (0.01, 0.99) - Strictly OpenEnv compliant.
- **Penalty Logic**: Severe negative rewards (-0.90) for "Act" decisions on Ambiguous states.

## 🚀 How to Use

This model is designed to be used in conjunction with the [Autonomy Calibration Benchmark](https://huggingface.co/spaces/JOY0021/autonomy-calibration-benchmark).

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "Qwen/Qwen2.5-0.5B-Instruct"
adapter = "JOY0021/autonomy-grpo-agent-v2"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter)