microsoft
/

UserLM-8b

@@ -1,65 +0,0 @@
----
-stage: early-research
-research-area: agents
-last-updated: 2026-03-22
----
-# UserLM-8b: Flipping the Dialogue
-**Team:** Microsoft Research, Georgia Institute of Technology
-**Contributors:** Tarek Naous, Philippe Laban, Wei Xu, Jennifer Neville
----
-## What it is
-UserLM-8b is a purpose-built User Language Model — an LLM post-trained to simulate the "user" role in multi-turn conversations, rather than the typical "assistant" role. Trained via full-parameter fine-tuning of Llama3-8b-Base on ~344K filtered WildChat conversations with LLM-generated high-level user intents, UserLM-8b can initiate conversations from a task intent, generate realistic follow-up utterances conditioned on conversation state, and naturally terminate dialogues. It enables more realistic evaluation of assistant LLMs by producing user behavior that better matches real human patterns — indirect phrasing, gradual intent revelation, and natural conversation endings.
-🔗 [microsoft/UserLM-8b (HuggingFace)](https://huggingface.co/microsoft/UserLM-8b) · [Paper (arXiv)](https://arxiv.org/abs/2510.06552) · Accepted at ICLR 2026
-## Core idea
-Prior work simulates users by prompting assistant LMs to role-play, but assistant LMs are post-trained to be cooperative, exhaustive, and well-structured — the opposite of how real users behave. We show that better assistants actually make *worse* user simulators, as their assistant-like tendencies (structured responses, sycophantic compliance, inability to end conversations) become more ingrained. UserLM's key insight: "flip the dialogue" by training on real user utterances from WildChat, conditioning on LLM-generated high-level intents that capture the user's goal without specifying exact language. This produces models that naturally decompose intent across turns, use diverse phrasing, make realistic language choices (including typos and informal grammar), and know when to stop talking. The approach is hard to replicate with prompting alone because the distributional gap between assistant and user text is fundamental — confirmed by AI-text detectors classifying UserLM output as 80% human-like vs. 0-3% for prompted assistants. When deployed for evaluation, GPT-4o's task performance drops from 74.6% to 57.4%, revealing assistant weaknesses hidden by overly cooperative prompted simulators.
-## Why it matters
-**To the field:** Introduces user language modeling as a distinct post-training objective, demonstrating that base LLMs can be trained toward opposing roles (user vs. assistant). Establishes a six-metric evaluation framework for user simulators covering multi-turn interaction (first-turn diversity, intent decomposition, dialogue termination) and simulation robustness (naturalness, user role adherence, intent adherence). The finding that stronger assistants make worse simulators challenges the assumption that general LLM capability transfers across roles.
-**Product integration:** Directly enables more realistic evaluation of Microsoft's conversational AI products. By replacing prompted assistant simulators with UserLM in evaluation pipelines, teams can surface assistant failures that would be missed by overly cooperative simulated users — particularly relevant for Copilot and other multi-turn interfaces where users naturally underspecify.
-**Future directions:** Opens research on personalized user LMs (simulating specific demographics or domains), user LMs as foundations for judge models (more realistic preference evaluation), synthetic data generation for training more robust assistants, and scaling user LMs beyond 8B parameters for higher-fidelity simulation.
-## Collaborations
-- **Academic:** Georgia Institute of Technology (Tarek Naous, Wei Xu) — co-authorship; Tarek Naous as MSR intern (Summer 2025)
-## Current status
-**Headline:** UserLM-8b achieves 60-70% lower perplexity on real user utterances than all baselines, and reveals a 17% gap in GPT-4o task performance hidden by prompted simulators.
-- Accepted at ICLR 2026
-- UserLM-8b achieves PPL of 5.60 on WildChat (vs. 26+ for prompted assistants), 14.92 on out-of-domain PRISM
-- Intrinsic evaluation: 94.55% first-turn diversity (on par with real users at 94.01%), 63.54 F1 on dialogue termination (vs. 1-15 for prompted assistants)
-- Simulation robustness: 80%+ naturalness score (vs. 0-3% for prompted assistants), 93%+ role and intent adherence
-- Extrinsic evaluation: GPT-4o assistant performance drops from 74.6% to 57.4% with UserLM-8b simulator
-- Released on HuggingFace with 368 likes, 612 monthly downloads, 14 community quantizations
-## Related landscape
-- [USP-8b (Wang et al., 2025) — Fine-tuned Llama3-8b on LMSys-Chat for user simulation](https://arxiv.org/abs/2502.18968)
-- [ChatBench (Chang et al., 2025) — Human-AI evaluation framework bridging static and interactive benchmarks](https://arxiv.org/abs/2504.07114)
-- [PlatoLM (Kong et al., 2024) — Teaching LLMs in multi-round dialogue via user simulator](https://aclanthology.org/2024.acl-long.423/)
-- [Ivey et al., 2024 — Analysis showing LLM-based user simulators fail to replicate human qualities](https://arxiv.org/abs/2409.08330)
-- [Lost in Conversation (Laban et al., 2025) — Companion work on LLM multi-turn degradation](https://arxiv.org/abs/2505.06120)
-## Real-world impact
-- Released UserLM-8b as an open model (MIT license) for the research community to build more realistic evaluation pipelines
-- Demonstrated that conventional evaluation with prompted assistants overestimates multi-turn performance by ~17%, motivating reevaluation of existing benchmarks
-- Established user language modeling as a new post-training paradigm, distinct from and complementary to assistant training
-## Publications & links
-- [Flipping the Dialogue: Training and Evaluating User Language Models — ICLR 2026](https://arxiv.org/abs/2510.06552)
-- [HuggingFace: microsoft/UserLM-8b](https://huggingface.co/microsoft/UserLM-8b)
-- [GitHub: microsoft/lost_in_conversation (companion codebase)](https://github.com/microsoft/lost_in_conversation)