Vighnesh
add wandb training logs link
7bdf1e0

How We Fine-Tuned a 0.5B Model to Handle Customer Support Tickets Using GRPO

OpenEnv Γ— Scalar Hackathon


The Goal

Fine-tune Qwen2.5-0.5B-Instruct β€” a tiny 0.5 billion parameter model β€” to act as a customer support agent using reinforcement learning. Actual RL: the model takes actions, gets scored, and learns purely from the reward signal.

The environment: a live API that presents support tickets and grades the agent's responses across three progressively harder tasks. The challenge was making a model this small learn multi-step reasoning well enough to handle real customer scenarios.


The Environment

The Support Ticket Environment presents tickets from five categories β€” billing, account, technical, refund, and general β€” and asks the agent to respond with a structured JSON action:

{"action_type": "classify", "category": "billing"}
{"action_type": "reply",    "reply_text": "We've processed your refund..."}
{"action_type": "escalate", "reply_text": "Escalating to engineering."}
{"action_type": "close",    "reply_text": "Closing ticket."}

Three tasks of increasing difficulty, each returning a reward between 0.0 and 1.0 with no gradient signal from human labels:

Task What it tests
Task 1 β€” Classify Classify the ticket into the correct category
Task 2 β€” Classify + Act Classify first, then take the correct action (reply / escalate / close)
Task 3 β€” Full Resolution Classify, take the correct action, and write a quality reply addressing the specific issue

The Algorithm: GRPO

We used GRPO (Group Relative Policy Optimization) via trl.GRPOTrainer β€” a modern RL algorithm designed specifically for language models:

  • Generates a group of completions for each prompt
  • Scores all completions with the reward function
  • Normalises advantages within the group β€” no separate value network needed
  • Applies PPO-style clipped ratio updates against a frozen reference model with KL penalty

This makes GRPO significantly more memory-efficient than standard PPO for LLMs while delivering stable, consistent training. A 0.5B model + LoRA + GRPO fits comfortably on a single T4 GPU.

Key hyperparameters:

Group size G  =  4
KL beta       =  0.04
LoRA rank     =  16   (applied across all attention and MLP projection layers)
Framework     =  HuggingFace transformers + peft  (no Unsloth)

Reward Engineering: Where the Real Work Happened

Before training, we did a thorough audit of the reward functions and built a precise, well-calibrated scoring system. This is what made the training signal reliable.

Diverse ticket bank β€” 50 tickets across all categories

We built a rich training set of 50 realistic support tickets spanning all five categories with varied scenarios, specific resolution hints, and a range of correct actions. This gave the model genuine breadth to learn from rather than a narrow set of patterns.

Perfectly calibrated reward weights

The Task 3 reward is a weighted sum across four components, tuned to sum to exactly 1.00:

0.20 (classify) + 0.40 (action) + 0.25 (reply quality) + 0.15 (efficiency) = 1.00

Every reward signal is clean, bounded, and meaningful.

Two-tier reply quality scoring

Reply quality uses a two-tier keyword system:

Category keywords (broad relevance)         β†’  +0.03 each
Resolution hint keywords (ticket-specific)  β†’  +0.05 each

A generic reply scores lower than a reply that directly addresses the customer's specific issue β€” exactly the behaviour we want the model to learn.

Classification credit carries forward

In Task 2, a correct classification at step 0 earns 0.3 credit that is preserved into step 1. The model is rewarded for the full quality of its decision-making across both steps, not just the final action.

Real correctness tracking in Task 3

Classification correctness is tracked precisely β€” the 0.20 bonus in Task 3 is only awarded when the model actually classifies correctly, creating a genuine learning signal for accurate categorisation.

Format reward shaping

A second reward function runs in parallel, reinforcing correct JSON structure:

Valid action_type                           β†’  +0.15 bonus
Valid action_type + valid category          β†’  +0.20 bonus
Invalid action_type                         β†’  βˆ’0.20 penalty

This kept the model outputting parseable, structured responses throughout training.


Training Results

Trained on 1,000 prompts, 3 epochs, Kaggle T4 Γ—2. Final loss: 0.0008.

Task Before GRPO After GRPO Delta
Task 1 β€” Classify 0.667 1.000 +0.333
Task 2 β€” Action 0.117 0.450 +0.333
Task 3 β€” Full Resolve 0.083 0.258 +0.175
Overall 0.289 0.569 +0.280

Task 1 is fully solved β€” the model classifies support tickets with perfect accuracy after training.

Task 2 nearly quadrupled(four-fold) its score, showing the model learned both the classification and action selection steps reliably.

Task 3 more than tripled its score, demonstrating that even a 0.5B model can learn to write contextually relevant replies when the reward signal is precise.

Overall, the agent nearly doubled its score from 0.289 to 0.569 β€” a +0.280 improvement purely from GRPO with no supervised labels.


What Made It Work

Careful reward engineering mattered more than any hyperparameter choice. The key decisions:

  • Resolution hints in scoring β€” giving the reward function ticket-specific targets meant the model got a stronger signal for replies that actually resolved the customer's issue, not just topically relevant ones.
  • Accumulated classification credit β€” rewarding the full decision chain across steps encouraged coherent multi-step behaviour rather than optimising each step in isolation.
  • Diverse ticket bank β€” 50 varied tickets pushed the model to generalise across real support scenarios rather than overfit to a narrow distribution.
  • Parallel reward functions β€” running both the task reward and the format reward simultaneously gave the model signal on output quality and structural correctness at every step.

Reproduce It