JOY0021's picture
Update README.md
de6a8fa verified
---
license: mit
library_name: peft
base_model: Qwen/Qwen2.5-0.5B-Instruct
tags:
- trl
- grpo
- reinforcement-learning
- epistemic-agency
- hackathon
---
# ๐Ÿ›ก๏ธ Epistemic Agent v2 - Autonomy Calibration Hub
This model is a **Calibrated Epistemic Agent** trained specifically for the **OpenEnv India Hackathon 2026**.
It was fine-tuned using **Group Relative Policy Optimization (GRPO)** to master the balance between autonomous action and information gathering.
## ๐Ÿง  Model Description
Unlike typical LLMs that "hallucinate" or guess when faced with ambiguous instructions, this agent has been trained to use the `INVESTIGATE` action when it detects uncertainty.
- **Objective**: Learn when to take direct autonomous action vs. when to pause and gather forensics.
- **Algorithm**: GRPO (Group Relative Policy Optimization)
- **Base Model**: Qwen2.5-0.5B-Instruct
- **Task Alignment**: Autonomy Calibration Benchmark (OpenEnv)
## ๐Ÿ“Š Training Performance
The agent was trained on high-ambiguity scenarios across three domains: **Email Triage, DevOps Incidents, and Financial Requests.**
| Benchmark | Blind Baseline | Calibrated Agent (Ours) | Improvement |
|-----------|----------------|--------------------------|-------------|
| Email Triage | 0.378 | **0.798** | +42.0% |
| DevOps Incident | 0.572 | **0.939** | +36.7% |
| Financial Request | 0.773 | **0.990** | +21.7% |
### Key Behavioral Signal:
The model demonstrates an **Investigation Rate of 100%** on ambiguous signals, effectively resolving partial observability before committing to high-stakes decisions.
## ๐Ÿ› ๏ธ Training Procedure
- **Steps**: 100
- **Group Size (G)**: 8 generations per prompt
- **Reward Range**: (0.01, 0.99) - Strictly OpenEnv compliant.
- **Penalty Logic**: Severe negative rewards (-0.90) for "Act" decisions on Ambiguous states.
## ๐Ÿš€ How to Use
This model is designed to be used in conjunction with the [Autonomy Calibration Benchmark](https://huggingface.co/spaces/JOY0021/autonomy-calibration-benchmark).
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = "Qwen/Qwen2.5-0.5B-Instruct"
adapter = "JOY0021/autonomy-grpo-agent-v2"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter)