Instructions to use JOY0021/autonomy-grpo-agent-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use JOY0021/autonomy-grpo-agent-v2 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct") model = PeftModel.from_pretrained(base_model, "JOY0021/autonomy-grpo-agent-v2") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: peft | |
| base_model: Qwen/Qwen2.5-0.5B-Instruct | |
| tags: | |
| - trl | |
| - grpo | |
| - reinforcement-learning | |
| - epistemic-agency | |
| - hackathon | |
| # ๐ก๏ธ Epistemic Agent v2 - Autonomy Calibration Hub | |
| This model is a **Calibrated Epistemic Agent** trained specifically for the **OpenEnv India Hackathon 2026**. | |
| It was fine-tuned using **Group Relative Policy Optimization (GRPO)** to master the balance between autonomous action and information gathering. | |
| ## ๐ง Model Description | |
| Unlike typical LLMs that "hallucinate" or guess when faced with ambiguous instructions, this agent has been trained to use the `INVESTIGATE` action when it detects uncertainty. | |
| - **Objective**: Learn when to take direct autonomous action vs. when to pause and gather forensics. | |
| - **Algorithm**: GRPO (Group Relative Policy Optimization) | |
| - **Base Model**: Qwen2.5-0.5B-Instruct | |
| - **Task Alignment**: Autonomy Calibration Benchmark (OpenEnv) | |
| ## ๐ Training Performance | |
| The agent was trained on high-ambiguity scenarios across three domains: **Email Triage, DevOps Incidents, and Financial Requests.** | |
| | Benchmark | Blind Baseline | Calibrated Agent (Ours) | Improvement | | |
| |-----------|----------------|--------------------------|-------------| | |
| | Email Triage | 0.378 | **0.798** | +42.0% | | |
| | DevOps Incident | 0.572 | **0.939** | +36.7% | | |
| | Financial Request | 0.773 | **0.990** | +21.7% | | |
| ### Key Behavioral Signal: | |
| The model demonstrates an **Investigation Rate of 100%** on ambiguous signals, effectively resolving partial observability before committing to high-stakes decisions. | |
| ## ๐ ๏ธ Training Procedure | |
| - **Steps**: 100 | |
| - **Group Size (G)**: 8 generations per prompt | |
| - **Reward Range**: (0.01, 0.99) - Strictly OpenEnv compliant. | |
| - **Penalty Logic**: Severe negative rewards (-0.90) for "Act" decisions on Ambiguous states. | |
| ## ๐ How to Use | |
| This model is designed to be used in conjunction with the [Autonomy Calibration Benchmark](https://huggingface.co/spaces/JOY0021/autonomy-calibration-benchmark). | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from peft import PeftModel | |
| base_model = "Qwen/Qwen2.5-0.5B-Instruct" | |
| adapter = "JOY0021/autonomy-grpo-agent-v2" | |
| tokenizer = AutoTokenizer.from_pretrained(base_model) | |
| model = AutoModelForCausalLM.from_pretrained(base_model) | |
| model = PeftModel.from_pretrained(model, adapter) |