Spaces:
Sleeping
Sleeping
File size: 6,832 Bytes
b523c77 7bdf1e0 b523c77 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | # How We Fine-Tuned a 0.5B Model to Handle Customer Support Tickets Using GRPO
**OpenEnv Γ Scalar Hackathon**
---
## The Goal
Fine-tune `Qwen2.5-0.5B-Instruct` β a tiny 0.5 billion parameter model β to act as a customer support agent using reinforcement learning. Actual RL: the model takes actions, gets scored, and learns purely from the reward signal.
The environment: a live API that presents support tickets and grades the agent's responses across three progressively harder tasks. The challenge was making a model this small learn multi-step reasoning well enough to handle real customer scenarios.
---
## The Environment
The [Support Ticket Environment](https://algocore-support-ticket-env.hf.space) presents tickets from five categories β billing, account, technical, refund, and general β and asks the agent to respond with a structured JSON action:
```json
{"action_type": "classify", "category": "billing"}
{"action_type": "reply", "reply_text": "We've processed your refund..."}
{"action_type": "escalate", "reply_text": "Escalating to engineering."}
{"action_type": "close", "reply_text": "Closing ticket."}
```
Three tasks of increasing difficulty, each returning a reward between 0.0 and 1.0 with no gradient signal from human labels:
| Task | What it tests |
|------|--------------|
| **Task 1 β Classify** | Classify the ticket into the correct category |
| **Task 2 β Classify + Act** | Classify first, then take the correct action (reply / escalate / close) |
| **Task 3 β Full Resolution** | Classify, take the correct action, and write a quality reply addressing the specific issue |
---
## The Algorithm: GRPO
We used **GRPO** (Group Relative Policy Optimization) via `trl.GRPOTrainer` β a modern RL algorithm designed specifically for language models:
- Generates a **group** of completions for each prompt
- Scores all completions with the reward function
- Normalises advantages **within the group** β no separate value network needed
- Applies PPO-style clipped ratio updates against a frozen reference model with KL penalty
This makes GRPO significantly more memory-efficient than standard PPO for LLMs while delivering stable, consistent training. A 0.5B model + LoRA + GRPO fits comfortably on a single T4 GPU.
**Key hyperparameters:**
```
Group size G = 4
KL beta = 0.04
LoRA rank = 16 (applied across all attention and MLP projection layers)
Framework = HuggingFace transformers + peft (no Unsloth)
```
---
## Reward Engineering: Where the Real Work Happened
Before training, we did a thorough audit of the reward functions and built a precise, well-calibrated scoring system. This is what made the training signal reliable.
### Diverse ticket bank β 50 tickets across all categories
We built a rich training set of 50 realistic support tickets spanning all five categories with varied scenarios, specific resolution hints, and a range of correct actions. This gave the model genuine breadth to learn from rather than a narrow set of patterns.
### Perfectly calibrated reward weights
The Task 3 reward is a weighted sum across four components, tuned to sum to exactly 1.00:
```
0.20 (classify) + 0.40 (action) + 0.25 (reply quality) + 0.15 (efficiency) = 1.00
```
Every reward signal is clean, bounded, and meaningful.
### Two-tier reply quality scoring
Reply quality uses a two-tier keyword system:
```
Category keywords (broad relevance) β +0.03 each
Resolution hint keywords (ticket-specific) β +0.05 each
```
A generic reply scores lower than a reply that directly addresses the customer's specific issue β exactly the behaviour we want the model to learn.
### Classification credit carries forward
In Task 2, a correct classification at step 0 earns 0.3 credit that is preserved into step 1. The model is rewarded for the full quality of its decision-making across both steps, not just the final action.
### Real correctness tracking in Task 3
Classification correctness is tracked precisely β the 0.20 bonus in Task 3 is only awarded when the model actually classifies correctly, creating a genuine learning signal for accurate categorisation.
### Format reward shaping
A second reward function runs in parallel, reinforcing correct JSON structure:
```
Valid action_type β +0.15 bonus
Valid action_type + valid category β +0.20 bonus
Invalid action_type β β0.20 penalty
```
This kept the model outputting parseable, structured responses throughout training.
---
## Training Results
Trained on 1,000 prompts, 3 epochs, Kaggle T4 Γ2. Final loss: **0.0008**.
| Task | Before GRPO | After GRPO | Delta |
|------|-------------|------------|-------|
| Task 1 β Classify | 0.667 | **1.000** | +0.333 |
| Task 2 β Action | 0.117 | **0.450** | +0.333 |
| Task 3 β Full Resolve | 0.083 | **0.258** | +0.175 |
| **Overall** | **0.289** | **0.569** | **+0.280** |
**Task 1 is fully solved** β the model classifies support tickets with perfect accuracy after training.
Task 2 nearly quadrupled(four-fold) its score, showing the model learned both the classification and action selection steps reliably.
Task 3 more than tripled its score, demonstrating that even a 0.5B model can learn to write contextually relevant replies when the reward signal is precise.
Overall, the agent nearly doubled its score from 0.289 to 0.569 β a **+0.280 improvement purely from GRPO with no supervised labels**.
- **Training Logs:** [WandB Run](https://wandb.ai/vighneshdev1990-/support-ticket-grpo/runs/33q716zb)
---
## What Made It Work
Careful reward engineering mattered more than any hyperparameter choice. The key decisions:
- **Resolution hints in scoring** β giving the reward function ticket-specific targets meant the model got a stronger signal for replies that actually resolved the customer's issue, not just topically relevant ones.
- **Accumulated classification credit** β rewarding the full decision chain across steps encouraged coherent multi-step behaviour rather than optimising each step in isolation.
- **Diverse ticket bank** β 50 varied tickets pushed the model to generalise across real support scenarios rather than overfit to a narrow distribution.
- **Parallel reward functions** β running both the task reward and the format reward simultaneously gave the model signal on output quality and structural correctness at every step.
---
## Reproduce It
- **Model:** [AlgoCore/support-ticket-grpo-model](https://huggingface.co/AlgoCore/support-ticket-grpo-model)
- **Environment:** [algocore-support-ticket-env.hf.space](https://algocore-support-ticket-env.hf.space)
- **Notebook:** Included in the HF repo β runs end-to-end on Kaggle T4 with `HF_TOKEN` and `WANDB_API_KEY` in Secrets.
|