Papers
arxiv:2602.09379

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Published on Jun 11
ยท Submitted by
Xu Shihao
on Jun 24
ยท Lyncia Lyncia
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy.

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

Community

LingxiDiagBench: Benchmarking LLMs for Chinese Psychiatric Consultation and Diagnosis [Accepted by KDD 2026]

TL;DR: A large-scale multi-agent benchmark revealing that while LLMs can distinguish depression from anxiety with 92.3% accuracy, they struggle badly at 12-way differential diagnosis (28.5%) โ€” and better conversational quality doesn't guarantee better diagnosis.

image

Dataset Link: https://huggingface.co/datasets/XuShihao6715/LingxiDiag-16K

The Problem

Mental health care faces a global workforce crisis. Psychiatric diagnosis depends on nuanced, multi-turn clinical interviews, yet existing AI benchmarks fall short in three key ways: they use template-based synthetic dialogues with little variability, omit the information needed for differential diagnosis, and rarely support dynamic multi-turn consultation evaluation.

What's New

This paper introduces LingxiDiagBench, the first large-scale, real-data-driven, multi-disease diagnostic benchmark for Chinese psychiatric AI. At its core is LingxiDiag-16K โ€” 16,000 synthetic consultation dialogues generated from 1,709 real outpatient EMRs collected at Shanghai Mental Health Center, carefully preserving real clinical demographic and diagnostic distributions across 12 ICD-10 categories.

The benchmark covers two evaluation paradigms:

  • Static: Fixed dialogue transcripts for reproducible diagnosis and next-question prediction tasks
  • Dynamic: Real-time multi-turn consultation where LLMs act as Doctor Agents interviewing LLM-powered Patient Agents

Four doctor consultation strategies are compared: Free-form, Symptom-Tree, APA-Guided, and APA-Guided + MRD-RAG.

Key Findings

  • ๐ŸŸข Binary classification (depression vs. anxiety) is largely solved โ€” top models hit 92.3% accuracy
  • ๐ŸŸก 4-way classification (including comorbidity) drops to 43.0% โ€” comorbidity recognition remains hard
  • ๐Ÿ”ด 12-way differential diagnosis hits only 28.5% โ€” a substantial open challenge
  • โš ๏ธ Dynamic < Static: Interactive consultation consistently underperforms static evaluation, suggesting poor information-gathering strategies hurt downstream reasoning
  • ๐Ÿ” Consultation quality โ‰  Diagnostic accuracy: LLM-as-a-Judge scores correlate with diagnostic accuracy at only r = 0.43, showing that asking good questions and reaching correct diagnoses are decoupled skills
  • โœ… RAG helps: APA-Guided + MRD-RAG improves overall classification by ~5% over APA-Guided alone

Why It Matters

LingxiDiagBench provides a standardized, reproducible platform to systematically evaluate and improve AI psychiatric diagnosis โ€” something the field has been missing. The benchmark design is language-agnostic and grounded in international clinical standards (DSM-5/ICD-10), making it extensible beyond Chinese.

Benchmark Results Takeways

๐Ÿ“Š Static Evaluation โ€” Best Model per Task

Performance on fixed consultation transcripts across both the synthetic (LingxiDiag-16K) and real clinical (LingxiDiag-Clinical) test sets:

Task Best Model (Synthetic) Acc (Synthetic) Best Model (Real) Acc (Real)
2-class (Depression vs. Anxiety) Gemini-3-Flash 0.854 Qwen3-4B 0.887
4-class (+ Comorbidity + Others) Grok-4.1-Fast 0.470 Qwen3-32B 0.524
12-class (Full ICD-10 Differential) GPT-5-Mini 0.409 TF-IDF + SVM 0.320
12-class Top-3 Accuracy TF-IDF + LR 0.645 Qwen3-4B 0.698
Overall Score TF-IDF + LR 0.533 Qwen3-32B 0.548

๐Ÿค– Dynamic Evaluation โ€” Best Strategy per Dataset

Performance of the end-to-end consultation pipeline (Doctor Agent โ†’ Patient Agent โ†’ Diagnosis), across both data settings:

Strategy Best Model 2-class Acc 4-class Acc 12-class Acc Clf-Ovl
Synthetic (LingxiDiag-16K)
Free-form Grok-4.1-Fast 88.6% 34.0% 25.5% 40.1%
Symptom-Tree DeepSeek-V3.2 86.5% 31.0% 21.5% 38.0%
APA-Guided DeepSeek-V3.2 88.5% 31.5% 23.0% 41.2%
APA-Guided + MRD-RAG Grok-4.1-Fast 88.5% 43.0% 28.5% 45.4%
Real (LingxiDiag-Clinical)
Free-form Qwen3-8B 88.8% 40.0% 43.0% 49.0%
Symptom-Tree GPT-OSS-20B 91.2% 43.0% 44.5% 50.0%
APA-Guided Qwen3-32B 80.0% 36.0% 46.5% 48.3%
APA-Guided + MRD-RAG GPT-OSS-20B 78.8% 37.5% 45.5% 47.2%

๐Ÿ” Cross-Dataset Transfer โ€” Does Synthetic Training Generalize to Real Data?

To validate that LingxiDiag-16K encodes clinically meaningful knowledge (not just surface statistics), models fine-tuned on synthetic data were evaluated on real clinical cases:

Model 12-class Acc (Real, Zero-shot) 12-class Acc (Real, +LoRA SFT) Gain
Qwen3-8B 4.1% 41.4% +37.3%
Qwen3-32B 20.4% 39.7% +19.3%

The authors emphasize that this benchmark is for research purposes only and must not be deployed in clinical settings without rigorous validation and human oversight.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2602.09379
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.09379 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.09379 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.09379 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.