Spaces:
Sleeping
Sleeping
| # How We Fine-Tuned a 0.5B Model to Handle Customer Support Tickets Using GRPO | |
| **OpenEnv Γ Scalar Hackathon** | |
| --- | |
| ## The Goal | |
| Fine-tune `Qwen2.5-0.5B-Instruct` β a tiny 0.5 billion parameter model β to act as a customer support agent using reinforcement learning. Actual RL: the model takes actions, gets scored, and learns purely from the reward signal. | |
| The environment: a live API that presents support tickets and grades the agent's responses across three progressively harder tasks. The challenge was making a model this small learn multi-step reasoning well enough to handle real customer scenarios. | |
| --- | |
| ## The Environment | |
| The [Support Ticket Environment](https://algocore-support-ticket-env.hf.space) presents tickets from five categories β billing, account, technical, refund, and general β and asks the agent to respond with a structured JSON action: | |
| ```json | |
| {"action_type": "classify", "category": "billing"} | |
| {"action_type": "reply", "reply_text": "We've processed your refund..."} | |
| {"action_type": "escalate", "reply_text": "Escalating to engineering."} | |
| {"action_type": "close", "reply_text": "Closing ticket."} | |
| ``` | |
| Three tasks of increasing difficulty, each returning a reward between 0.0 and 1.0 with no gradient signal from human labels: | |
| | Task | What it tests | | |
| |------|--------------| | |
| | **Task 1 β Classify** | Classify the ticket into the correct category | | |
| | **Task 2 β Classify + Act** | Classify first, then take the correct action (reply / escalate / close) | | |
| | **Task 3 β Full Resolution** | Classify, take the correct action, and write a quality reply addressing the specific issue | | |
| --- | |
| ## The Algorithm: GRPO | |
| We used **GRPO** (Group Relative Policy Optimization) via `trl.GRPOTrainer` β a modern RL algorithm designed specifically for language models: | |
| - Generates a **group** of completions for each prompt | |
| - Scores all completions with the reward function | |
| - Normalises advantages **within the group** β no separate value network needed | |
| - Applies PPO-style clipped ratio updates against a frozen reference model with KL penalty | |
| This makes GRPO significantly more memory-efficient than standard PPO for LLMs while delivering stable, consistent training. A 0.5B model + LoRA + GRPO fits comfortably on a single T4 GPU. | |
| **Key hyperparameters:** | |
| ``` | |
| Group size G = 4 | |
| KL beta = 0.04 | |
| LoRA rank = 16 (applied across all attention and MLP projection layers) | |
| Framework = HuggingFace transformers + peft (no Unsloth) | |
| ``` | |
| --- | |
| ## Reward Engineering: Where the Real Work Happened | |
| Before training, we did a thorough audit of the reward functions and built a precise, well-calibrated scoring system. This is what made the training signal reliable. | |
| ### Diverse ticket bank β 50 tickets across all categories | |
| We built a rich training set of 50 realistic support tickets spanning all five categories with varied scenarios, specific resolution hints, and a range of correct actions. This gave the model genuine breadth to learn from rather than a narrow set of patterns. | |
| ### Perfectly calibrated reward weights | |
| The Task 3 reward is a weighted sum across four components, tuned to sum to exactly 1.00: | |
| ``` | |
| 0.20 (classify) + 0.40 (action) + 0.25 (reply quality) + 0.15 (efficiency) = 1.00 | |
| ``` | |
| Every reward signal is clean, bounded, and meaningful. | |
| ### Two-tier reply quality scoring | |
| Reply quality uses a two-tier keyword system: | |
| ``` | |
| Category keywords (broad relevance) β +0.03 each | |
| Resolution hint keywords (ticket-specific) β +0.05 each | |
| ``` | |
| A generic reply scores lower than a reply that directly addresses the customer's specific issue β exactly the behaviour we want the model to learn. | |
| ### Classification credit carries forward | |
| In Task 2, a correct classification at step 0 earns 0.3 credit that is preserved into step 1. The model is rewarded for the full quality of its decision-making across both steps, not just the final action. | |
| ### Real correctness tracking in Task 3 | |
| Classification correctness is tracked precisely β the 0.20 bonus in Task 3 is only awarded when the model actually classifies correctly, creating a genuine learning signal for accurate categorisation. | |
| ### Format reward shaping | |
| A second reward function runs in parallel, reinforcing correct JSON structure: | |
| ``` | |
| Valid action_type β +0.15 bonus | |
| Valid action_type + valid category β +0.20 bonus | |
| Invalid action_type β β0.20 penalty | |
| ``` | |
| This kept the model outputting parseable, structured responses throughout training. | |
| --- | |
| ## Training Results | |
| Trained on 1,000 prompts, 3 epochs, Kaggle T4 Γ2. Final loss: **0.0008**. | |
| | Task | Before GRPO | After GRPO | Delta | | |
| |------|-------------|------------|-------| | |
| | Task 1 β Classify | 0.667 | **1.000** | +0.333 | | |
| | Task 2 β Action | 0.117 | **0.450** | +0.333 | | |
| | Task 3 β Full Resolve | 0.083 | **0.258** | +0.175 | | |
| | **Overall** | **0.289** | **0.569** | **+0.280** | | |
| **Task 1 is fully solved** β the model classifies support tickets with perfect accuracy after training. | |
| Task 2 nearly quadrupled(four-fold) its score, showing the model learned both the classification and action selection steps reliably. | |
| Task 3 more than tripled its score, demonstrating that even a 0.5B model can learn to write contextually relevant replies when the reward signal is precise. | |
| Overall, the agent nearly doubled its score from 0.289 to 0.569 β a **+0.280 improvement purely from GRPO with no supervised labels**. | |
| - **Training Logs:** [WandB Run](https://wandb.ai/vighneshdev1990-/support-ticket-grpo/runs/33q716zb) | |
| --- | |
| ## What Made It Work | |
| Careful reward engineering mattered more than any hyperparameter choice. The key decisions: | |
| - **Resolution hints in scoring** β giving the reward function ticket-specific targets meant the model got a stronger signal for replies that actually resolved the customer's issue, not just topically relevant ones. | |
| - **Accumulated classification credit** β rewarding the full decision chain across steps encouraged coherent multi-step behaviour rather than optimising each step in isolation. | |
| - **Diverse ticket bank** β 50 varied tickets pushed the model to generalise across real support scenarios rather than overfit to a narrow distribution. | |
| - **Parallel reward functions** β running both the task reward and the format reward simultaneously gave the model signal on output quality and structural correctness at every step. | |
| --- | |
| ## Reproduce It | |
| - **Model:** [AlgoCore/support-ticket-grpo-model](https://huggingface.co/AlgoCore/support-ticket-grpo-model) | |
| - **Environment:** [algocore-support-ticket-env.hf.space](https://algocore-support-ticket-env.hf.space) | |
| - **Notebook:** Included in the HF repo β runs end-to-end on Kaggle T4 with `HF_TOKEN` and `WANDB_API_KEY` in Secrets. | |