microsoft
/

UserLM-8b

@@ -24,45 +24,7 @@ Developed by: Tarek Naous (intern at MSR Summer 2025), Philippe Laban (MSR), Wei
 Paper: Will add once the work is on arXiv
-# Uses
-## Direct intended uses
-The UserLM-8b is released for use by researchers involved in the evaluation of assistant LLMs. In such scenarios, UserLM-8b can be used to simulate multi-turn conversations, with our analyses (see Section 3 of the paper) giving evidence that UserLM-8b provides more realistic simulation of user behavior than other methods (such as prompting an assistant model). UserLM-8b offers a user simulation environment that can better estimate the performance of an assistant LLM with real users. See Section 4 of the paper for an initial implementation of such an evaluation.
-## Downstream uses
-We envision several potential uses for UserLM-8b that we did not implement yet in our presented work but describe in our Discussion section as potential research directions for UserLMs. These potential applications include: (1) user modeling (i.e., predicting user responses to a given set of questions), (2) foundation for judge models (i.e., LLM-as-a-judge finetuning), (3) synthetic data generation (in conjunction with an assistant LM).
-## Out-of-scope uses
-We caution potential users of the model that UserLM-8b is **not an assistant LM**, unlike the majority of LLMs released on HuggingFace. As such, it is unlikely to be useful to end-users that require assistance with a task, for which an assistant LLM (such as microsoft/Phi-4) might be more appropriate.
-We do not recommend using UserLM in commercial or real-world applications without further testing and development. It is being released for research purposes.
-# Risks and limitations
-The paper accompanying this model release presents several evaluations of UserLM-8b and its potential limitations.
-First in Section 3, we describe the robustness experiments we conducted with UserLM-8b, which show that though the model can more robustly adhere to the user role and the provided task intent, the robustness numbers are not perfect (< 100%), meaning that the UserLM-8b can occasionally get detracted from its user role or its initial task intent.
-Second in Section 4, we describe the possibility for the UserLM-8b to hallucinate additional requirements that are not provided in the task intent. In such cases, we find that the UserLM can introduce new facts or constraints to the task. This can both be beneficial (diversifying simulation conditions) and detrimental (e.g., in cases where the hallucination is incompatible with the task intent). Hallucination mitigation is unfortunately an unsolved research problem, and all generative models (including UserLMs) generate hallucinated text on occasion. One mitigation option is to provide user intents that are as specified as possible, which limits the opportunities for the UserLM to hallucinate task information.
-UserLM was designed and tested using the English language. Performance in other languages may vary and should be assessed by someone who is both an expert in the expected outputs and a native speaker of that language.
-UserLM inherits any biases, errors, or omissions produced by its base model. Developers are advised to choose an appropriate base LLM/MLLM carefully, depending on the intended use case.
-UserLM inherits any biases, errors, or omissions characteristic of its training data, which may be amplified by any AI-generated interpretations.
-There has not been a systematic effort to ensure that systems using UserLM are protected from security vulnerabilities such as indirect prompt injection attacks. Any systems using it should take proactive measures to harden their systems as appropriate.
-# Recommendations
-The UserLM-8b is a research release, and it is likely to require some adaptation when adapted to new tasks and environments. In Appendix D.1 of the paper (Generation Configuration for UserLM-8b), we describe four generation guardrails (Filtering First Tokens, Avoiding Dialogue Termination, Maximal and Minimal Length Threshold, and Filter Verbatim Repetitions) we implemented to get the UserLM-8b to effectively simulate user utterances on the use-cases described in our paper. We encourage users of UserLM-8b to adopt and adapt these guardrails in their own use-cases.
-# How to get started with the model
 Here’s a simple snippet to use the model:
 ```python
@@ -101,6 +63,43 @@ response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=Tr
 print(response)
 ```
 # Training details
 ## Training data
 We trained on a filtered version of [WildChat-1M](https://huggingface.co/datasets/allenai/WildChat-1M). The details on the filtering and processing are Appendix A and Section 2 of our paper. We do not release any data or processing scripts with our paper, as we believe these are sufficiently detailed in the paper that they can be reimplemented.

 Paper: Will add once the work is on arXiv
+# How to get started with the model
 Here’s a simple snippet to use the model:
 ```python
 print(response)
 ```
+# Uses
+## Direct intended uses
+The UserLM-8b is released for use by researchers involved in the evaluation of assistant LLMs. In such scenarios, UserLM-8b can be used to simulate multi-turn conversations, with our analyses (see Section 3 of the paper) giving evidence that UserLM-8b provides more realistic simulation of user behavior than other methods (such as prompting an assistant model). UserLM-8b offers a user simulation environment that can better estimate the performance of an assistant LLM with real users. See Section 4 of the paper for an initial implementation of such an evaluation.
+## Downstream uses
+We envision several potential uses for UserLM-8b that we did not implement yet in our presented work but describe in our Discussion section as potential research directions for UserLMs. These potential applications include: (1) user modeling (i.e., predicting user responses to a given set of questions), (2) foundation for judge models (i.e., LLM-as-a-judge finetuning), (3) synthetic data generation (in conjunction with an assistant LM).
+## Out-of-scope uses
+We caution potential users of the model that UserLM-8b is **not an assistant LM**, unlike the majority of LLMs released on HuggingFace. As such, it is unlikely to be useful to end-users that require assistance with a task, for which an assistant LLM (such as microsoft/Phi-4) might be more appropriate.
+We do not recommend using UserLM in commercial or real-world applications without further testing and development. It is being released for research purposes.
+# Risks and limitations
+The paper accompanying this model release presents several evaluations of UserLM-8b and its potential limitations.
+First in Section 3, we describe the robustness experiments we conducted with UserLM-8b, which show that though the model can more robustly adhere to the user role and the provided task intent, the robustness numbers are not perfect (< 100%), meaning that the UserLM-8b can occasionally get detracted from its user role or its initial task intent.
+Second in Section 4, we describe the possibility for the UserLM-8b to hallucinate additional requirements that are not provided in the task intent. In such cases, we find that the UserLM can introduce new facts or constraints to the task. This can both be beneficial (diversifying simulation conditions) and detrimental (e.g., in cases where the hallucination is incompatible with the task intent). Hallucination mitigation is unfortunately an unsolved research problem, and all generative models (including UserLMs) generate hallucinated text on occasion. One mitigation option is to provide user intents that are as specified as possible, which limits the opportunities for the UserLM to hallucinate task information.
+UserLM was designed and tested using the English language. Performance in other languages may vary and should be assessed by someone who is both an expert in the expected outputs and a native speaker of that language.
+UserLM inherits any biases, errors, or omissions produced by its base model. Developers are advised to choose an appropriate base LLM/MLLM carefully, depending on the intended use case.
+UserLM inherits any biases, errors, or omissions characteristic of its training data, which may be amplified by any AI-generated interpretations.
+There has not been a systematic effort to ensure that systems using UserLM are protected from security vulnerabilities such as indirect prompt injection attacks. Any systems using it should take proactive measures to harden their systems as appropriate.
+# Recommendations
+The UserLM-8b is a research release, and it is likely to require some adaptation when adapted to new tasks and environments. In Appendix D.1 of the paper (Generation Configuration for UserLM-8b), we describe four generation guardrails (Filtering First Tokens, Avoiding Dialogue Termination, Maximal and Minimal Length Threshold, and Filter Verbatim Repetitions) we implemented to get the UserLM-8b to effectively simulate user utterances on the use-cases described in our paper. We encourage users of UserLM-8b to adopt and adapt these guardrails in their own use-cases.
 # Training details
 ## Training data
 We trained on a filtered version of [WildChat-1M](https://huggingface.co/datasets/allenai/WildChat-1M). The details on the filtering and processing are Appendix A and Section 2 of our paper. We do not release any data or processing scripts with our paper, as we believe these are sufficiently detailed in the paper that they can be reimplemented.