Spaces:

AlgoCore
/

support-ticket-env

Sleeping

App Files Files Community

support-ticket-env / Blog.md

Vighnesh

add wandb training logs link

7bdf1e0 about 1 month ago

preview code

raw

history blame contribute delete

6.83 kB

	# How We Fine-Tuned a 0.5B Model to Handle Customer Support Tickets Using GRPO

	OpenEnv × Scalar Hackathon

	---

	## The Goal

	Fine-tune `Qwen2.5-0.5B-Instruct` — a tiny 0.5 billion parameter model — to act as a customer support agent using reinforcement learning. Actual RL: the model takes actions, gets scored, and learns purely from the reward signal.

	The environment: a live API that presents support tickets and grades the agent's responses across three progressively harder tasks. The challenge was making a model this small learn multi-step reasoning well enough to handle real customer scenarios.

	---

	## The Environment

	The [Support Ticket Environment](https://algocore-support-ticket-env.hf.space) presents tickets from five categories — billing, account, technical, refund, and general — and asks the agent to respond with a structured JSON action:

	```json
	{"action_type": "classify", "category": "billing"}
	{"action_type": "reply", "reply_text": "We've processed your refund..."}
	{"action_type": "escalate", "reply_text": "Escalating to engineering."}
	{"action_type": "close", "reply_text": "Closing ticket."}
	```

	Three tasks of increasing difficulty, each returning a reward between 0.0 and 1.0 with no gradient signal from human labels:

	\| Task \| What it tests \|
	\|------\|--------------\|
	\| Task 1 — Classify \| Classify the ticket into the correct category \|
	\| Task 2 — Classify + Act \| Classify first, then take the correct action (reply / escalate / close) \|
	\| Task 3 — Full Resolution \| Classify, take the correct action, and write a quality reply addressing the specific issue \|

	---

	## The Algorithm: GRPO

	We used GRPO (Group Relative Policy Optimization) via `trl.GRPOTrainer` — a modern RL algorithm designed specifically for language models:

	- Generates a group of completions for each prompt
	- Scores all completions with the reward function
	- Normalises advantages within the group — no separate value network needed
	- Applies PPO-style clipped ratio updates against a frozen reference model with KL penalty

	This makes GRPO significantly more memory-efficient than standard PPO for LLMs while delivering stable, consistent training. A 0.5B model + LoRA + GRPO fits comfortably on a single T4 GPU.

	Key hyperparameters:

	```
	Group size G = 4
	KL beta = 0.04
	LoRA rank = 16 (applied across all attention and MLP projection layers)
	Framework = HuggingFace transformers + peft (no Unsloth)
	```

	---

	## Reward Engineering: Where the Real Work Happened

	Before training, we did a thorough audit of the reward functions and built a precise, well-calibrated scoring system. This is what made the training signal reliable.

	### Diverse ticket bank — 50 tickets across all categories

	We built a rich training set of 50 realistic support tickets spanning all five categories with varied scenarios, specific resolution hints, and a range of correct actions. This gave the model genuine breadth to learn from rather than a narrow set of patterns.

	### Perfectly calibrated reward weights

	The Task 3 reward is a weighted sum across four components, tuned to sum to exactly 1.00:

	```
	0.20 (classify) + 0.40 (action) + 0.25 (reply quality) + 0.15 (efficiency) = 1.00
	```

	Every reward signal is clean, bounded, and meaningful.

	### Two-tier reply quality scoring

	Reply quality uses a two-tier keyword system:

	```
	Category keywords (broad relevance) → +0.03 each
	Resolution hint keywords (ticket-specific) → +0.05 each
	```

	A generic reply scores lower than a reply that directly addresses the customer's specific issue — exactly the behaviour we want the model to learn.

	### Classification credit carries forward

	In Task 2, a correct classification at step 0 earns 0.3 credit that is preserved into step 1. The model is rewarded for the full quality of its decision-making across both steps, not just the final action.

	### Real correctness tracking in Task 3

	Classification correctness is tracked precisely — the 0.20 bonus in Task 3 is only awarded when the model actually classifies correctly, creating a genuine learning signal for accurate categorisation.

	### Format reward shaping

	A second reward function runs in parallel, reinforcing correct JSON structure:

	```
	Valid action_type → +0.15 bonus
	Valid action_type + valid category → +0.20 bonus
	Invalid action_type → −0.20 penalty
	```

	This kept the model outputting parseable, structured responses throughout training.

	---

	## Training Results

	Trained on 1,000 prompts, 3 epochs, Kaggle T4 ×2. Final loss: 0.0008.

	\| Task \| Before GRPO \| After GRPO \| Delta \|
	\|------\|-------------\|------------\|-------\|
	\| Task 1 — Classify \| 0.667 \| 1.000 \| +0.333 \|
	\| Task 2 — Action \| 0.117 \| 0.450 \| +0.333 \|
	\| Task 3 — Full Resolve \| 0.083 \| 0.258 \| +0.175 \|
	\| Overall \| 0.289 \| 0.569 \| +0.280 \|

	Task 1 is fully solved — the model classifies support tickets with perfect accuracy after training.

	Task 2 nearly quadrupled(four-fold) its score, showing the model learned both the classification and action selection steps reliably.

	Task 3 more than tripled its score, demonstrating that even a 0.5B model can learn to write contextually relevant replies when the reward signal is precise.

	Overall, the agent nearly doubled its score from 0.289 to 0.569 — a +0.280 improvement purely from GRPO with no supervised labels.

	- Training Logs: [WandB Run](https://wandb.ai/vighneshdev1990-/support-ticket-grpo/runs/33q716zb)

	---

	## What Made It Work

	Careful reward engineering mattered more than any hyperparameter choice. The key decisions:

	- Resolution hints in scoring — giving the reward function ticket-specific targets meant the model got a stronger signal for replies that actually resolved the customer's issue, not just topically relevant ones.
	- Accumulated classification credit — rewarding the full decision chain across steps encouraged coherent multi-step behaviour rather than optimising each step in isolation.
	- Diverse ticket bank — 50 varied tickets pushed the model to generalise across real support scenarios rather than overfit to a narrow distribution.
	- Parallel reward functions — running both the task reward and the format reward simultaneously gave the model signal on output quality and structural correctness at every step.

	---

	## Reproduce It

	- Model: [AlgoCore/support-ticket-grpo-model](https://huggingface.co/AlgoCore/support-ticket-grpo-model)
	- Environment: [algocore-support-ticket-env.hf.space](https://algocore-support-ticket-env.hf.space)
	- Notebook: Included in the HF repo — runs end-to-end on Kaggle T4 with `HF_TOKEN` and `WANDB_API_KEY` in Secrets.