JOY0021
/

autonomy-grpo-agent-v2

Reinforcement Learning

epistemic-agency

Model card Files Files and versions

autonomy-grpo-agent-v2 / README.md

JOY0021's picture

Update README.md

de6a8fa verified about 1 month ago

|

history blame contribute delete

2.37 kB

	---
	license: mit
	library_name: peft
	base_model: Qwen/Qwen2.5-0.5B-Instruct
	tags:
	- trl
	- grpo
	- reinforcement-learning
	- epistemic-agency
	- hackathon
	---

	# 🛡️ Epistemic Agent v2 - Autonomy Calibration Hub

	This model is a Calibrated Epistemic Agent trained specifically for the OpenEnv India Hackathon 2026.
	It was fine-tuned using Group Relative Policy Optimization (GRPO) to master the balance between autonomous action and information gathering.

	## 🧠 Model Description

	Unlike typical LLMs that "hallucinate" or guess when faced with ambiguous instructions, this agent has been trained to use the `INVESTIGATE` action when it detects uncertainty.

	- Objective: Learn when to take direct autonomous action vs. when to pause and gather forensics.
	- Algorithm: GRPO (Group Relative Policy Optimization)
	- Base Model: Qwen2.5-0.5B-Instruct
	- Task Alignment: Autonomy Calibration Benchmark (OpenEnv)

	## 📊 Training Performance

	The agent was trained on high-ambiguity scenarios across three domains: Email Triage, DevOps Incidents, and Financial Requests.

	\| Benchmark \| Blind Baseline \| Calibrated Agent (Ours) \| Improvement \|
	\|-----------\|----------------\|--------------------------\|-------------\|
	\| Email Triage \| 0.378 \| 0.798 \| +42.0% \|
	\| DevOps Incident \| 0.572 \| 0.939 \| +36.7% \|
	\| Financial Request \| 0.773 \| 0.990 \| +21.7% \|

	### Key Behavioral Signal:
	The model demonstrates an Investigation Rate of 100% on ambiguous signals, effectively resolving partial observability before committing to high-stakes decisions.

	## 🛠️ Training Procedure

	- Steps: 100
	- Group Size (G): 8 generations per prompt
	- Reward Range: (0.01, 0.99) - Strictly OpenEnv compliant.
	- Penalty Logic: Severe negative rewards (-0.90) for "Act" decisions on Ambiguous states.

	## 🚀 How to Use

	This model is designed to be used in conjunction with the [Autonomy Calibration Benchmark](https://huggingface.co/spaces/JOY0021/autonomy-calibration-benchmark).

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	base_model = "Qwen/Qwen2.5-0.5B-Instruct"
	adapter = "JOY0021/autonomy-grpo-agent-v2"

	tokenizer = AutoTokenizer.from_pretrained(base_model)
	model = AutoModelForCausalLM.from_pretrained(base_model)
	model = PeftModel.from_pretrained(model, adapter)