arxiv:2510.22143

Benchmarking and Learning Real-World Customer Service Dialogue

Published on Jan 8

Authors:

Abstract

OlaBench and OlaMind address the gap between industrial customer service benchmarks and real-world deployment by introducing a comprehensive evaluation framework and a reinforcement learning approach that improves dialogue quality and operational efficiency.

AI-generated summary

Existing benchmarks and training pipelines for industrial intelligent customer service (ICS) remain misaligned with real-world dialogue requirements, overemphasizing verifiable task success while under-measuring subjective service quality and realistic failure modes, leaving a gap between offline gains and deployable dialogue behavior. We close this gap with a benchmark-to-optimization loop: we first introduce OlaBench, an ICS benchmark spanning retrieval-augmented generation, workflow-based systems, and agentic settings, which evaluates service capability, safety, and latency sensitivity; moreover, motivated by OlaBench results showing state-of-the-art LLMs still fall short, we propose OlaMind, which distills reusable reasoning patterns and service strategies from expert dialogues and applies rubric-aware staged exploration--exploitation reinforcement learning to improve model capability. OlaMind surpasses GPT-5.2 and Gemini 3 Pro on OlaBench (78.72 vs. 70.58/70.84) and, in online A/B tests, delivers an average +23.67% issue resolution and -6.6% human transfer rate versus the baseline, bridging offline gains to deployment. Together, OlaBench and OlaMind advance ICS systems toward more anthropomorphic, professional, and reliable deployment.

View arXiv page View PDF Add to collection

Community

TianhongGao

1 day ago

Hi everyone, we are the authors of this paper.

Official project page:
https://olamind-olabench.github.io/

We bridge the gap between offline evaluation and real-world deployment in industrial customer service with OlaBench, a multi-dimensional benchmark. We also propose OlaMind, a staged rubric-aware RL framework that leverages expert reasoning patterns to achieve state-of-the-art benchmark results and validated online gains.

Thanks for your interest!

wow-kong

about 13 hours ago

impressive

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2510.22143

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.22143 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.22143 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.22143 in a Space README.md to link it from this page.