File size: 6,832 Bytes
b523c77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7bdf1e0
 
b523c77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# How We Fine-Tuned a 0.5B Model to Handle Customer Support Tickets Using GRPO

**OpenEnv Γ— Scalar Hackathon**

---

## The Goal

Fine-tune `Qwen2.5-0.5B-Instruct` β€” a tiny 0.5 billion parameter model β€” to act as a customer support agent using reinforcement learning. Actual RL: the model takes actions, gets scored, and learns purely from the reward signal.

The environment: a live API that presents support tickets and grades the agent's responses across three progressively harder tasks. The challenge was making a model this small learn multi-step reasoning well enough to handle real customer scenarios.

---

## The Environment

The [Support Ticket Environment](https://algocore-support-ticket-env.hf.space) presents tickets from five categories β€” billing, account, technical, refund, and general β€” and asks the agent to respond with a structured JSON action:

```json
{"action_type": "classify", "category": "billing"}
{"action_type": "reply",    "reply_text": "We've processed your refund..."}
{"action_type": "escalate", "reply_text": "Escalating to engineering."}
{"action_type": "close",    "reply_text": "Closing ticket."}
```

Three tasks of increasing difficulty, each returning a reward between 0.0 and 1.0 with no gradient signal from human labels:

| Task | What it tests |
|------|--------------|
| **Task 1 β€” Classify** | Classify the ticket into the correct category |
| **Task 2 β€” Classify + Act** | Classify first, then take the correct action (reply / escalate / close) |
| **Task 3 β€” Full Resolution** | Classify, take the correct action, and write a quality reply addressing the specific issue |

---

## The Algorithm: GRPO

We used **GRPO** (Group Relative Policy Optimization) via `trl.GRPOTrainer` β€” a modern RL algorithm designed specifically for language models:

- Generates a **group** of completions for each prompt
- Scores all completions with the reward function
- Normalises advantages **within the group** β€” no separate value network needed
- Applies PPO-style clipped ratio updates against a frozen reference model with KL penalty

This makes GRPO significantly more memory-efficient than standard PPO for LLMs while delivering stable, consistent training. A 0.5B model + LoRA + GRPO fits comfortably on a single T4 GPU.

**Key hyperparameters:**

```
Group size G  =  4
KL beta       =  0.04
LoRA rank     =  16   (applied across all attention and MLP projection layers)
Framework     =  HuggingFace transformers + peft  (no Unsloth)
```

---

## Reward Engineering: Where the Real Work Happened

Before training, we did a thorough audit of the reward functions and built a precise, well-calibrated scoring system. This is what made the training signal reliable.

### Diverse ticket bank β€” 50 tickets across all categories

We built a rich training set of 50 realistic support tickets spanning all five categories with varied scenarios, specific resolution hints, and a range of correct actions. This gave the model genuine breadth to learn from rather than a narrow set of patterns.

### Perfectly calibrated reward weights

The Task 3 reward is a weighted sum across four components, tuned to sum to exactly 1.00:

```
0.20 (classify) + 0.40 (action) + 0.25 (reply quality) + 0.15 (efficiency) = 1.00
```

Every reward signal is clean, bounded, and meaningful.

### Two-tier reply quality scoring

Reply quality uses a two-tier keyword system:

```
Category keywords (broad relevance)         β†’  +0.03 each
Resolution hint keywords (ticket-specific)  β†’  +0.05 each
```

A generic reply scores lower than a reply that directly addresses the customer's specific issue β€” exactly the behaviour we want the model to learn.

### Classification credit carries forward

In Task 2, a correct classification at step 0 earns 0.3 credit that is preserved into step 1. The model is rewarded for the full quality of its decision-making across both steps, not just the final action.

### Real correctness tracking in Task 3

Classification correctness is tracked precisely β€” the 0.20 bonus in Task 3 is only awarded when the model actually classifies correctly, creating a genuine learning signal for accurate categorisation.

### Format reward shaping

A second reward function runs in parallel, reinforcing correct JSON structure:

```
Valid action_type                           β†’  +0.15 bonus
Valid action_type + valid category          β†’  +0.20 bonus
Invalid action_type                         β†’  βˆ’0.20 penalty
```

This kept the model outputting parseable, structured responses throughout training.

---

## Training Results

Trained on 1,000 prompts, 3 epochs, Kaggle T4 Γ—2. Final loss: **0.0008**.

| Task | Before GRPO | After GRPO | Delta |
|------|-------------|------------|-------|
| Task 1 β€” Classify | 0.667 | **1.000** | +0.333 |
| Task 2 β€” Action | 0.117 | **0.450** | +0.333 |
| Task 3 β€” Full Resolve | 0.083 | **0.258** | +0.175 |
| **Overall** | **0.289** | **0.569** | **+0.280** |

**Task 1 is fully solved** β€” the model classifies support tickets with perfect accuracy after training.

Task 2 nearly quadrupled(four-fold) its score, showing the model learned both the classification and action selection steps reliably.

Task 3 more than tripled its score, demonstrating that even a 0.5B model can learn to write contextually relevant replies when the reward signal is precise.

Overall, the agent nearly doubled its score from 0.289 to 0.569 β€” a **+0.280 improvement purely from GRPO with no supervised labels**.

- **Training Logs:** [WandB Run](https://wandb.ai/vighneshdev1990-/support-ticket-grpo/runs/33q716zb)

---

## What Made It Work

Careful reward engineering mattered more than any hyperparameter choice. The key decisions:

- **Resolution hints in scoring** β€” giving the reward function ticket-specific targets meant the model got a stronger signal for replies that actually resolved the customer's issue, not just topically relevant ones.
- **Accumulated classification credit** β€” rewarding the full decision chain across steps encouraged coherent multi-step behaviour rather than optimising each step in isolation.
- **Diverse ticket bank** β€” 50 varied tickets pushed the model to generalise across real support scenarios rather than overfit to a narrow distribution.
- **Parallel reward functions** β€” running both the task reward and the format reward simultaneously gave the model signal on output quality and structural correctness at every step.

---

## Reproduce It

- **Model:** [AlgoCore/support-ticket-grpo-model](https://huggingface.co/AlgoCore/support-ticket-grpo-model)
- **Environment:** [algocore-support-ticket-env.hf.space](https://algocore-support-ticket-env.hf.space)
- **Notebook:** Included in the HF repo β€” runs end-to-end on Kaggle T4 with `HF_TOKEN` and `WANDB_API_KEY` in Secrets.