SocialR1-4B

SocialR1-4B is a social reasoning model built on Qwen3-4B, trained with trajectory-level reinforcement learning (GRPO) using the Social-R1 framework. It enhances social reasoning capabilities by aligning reasoning processes with the Social Information Processing (SIP) theory.

πŸ“„ Paper: Social-R1: Enhancing Social Reasoning in LLMs through Trajectory-Level Reinforcement Learning

Highlights

  • 🧠 SIP-Guided Reasoning: Enforces stage-consistent social inference β€” Cue Encoding β†’ Cue Interpretation β†’ Goal Clarification β†’ Response Generation
  • 🎯 Multi-Dimensional Reward: Combines structural reward, content reward, inference efficiency, and format reward with curriculum-style weighting
  • πŸ“Š Strong Performance: Enables a 4B-parameter model to match or outperform substantially larger baselines across static MCQ benchmarks, open-ended generation (FanToM), and interactive settings (SOTOPIA)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Jincenzi/SocialR1-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

messages = [
    {"role": "user", "content": "You should first think about the reasoning process in the mind and then provide with the answer.The reasoning process and answer are enclosed within <think> </think> and <Answer> </Answer> tags, respectively."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

  • Base Model: Qwen3-4B
  • Training Method: Group Relative Policy Optimization (GRPO)
  • Training Steps: 600
  • Hardware: 8Γ— NVIDIA A100 (80GB)
  • Group Size: 5
  • KL Coefficient: 0.04
  • Learning Rate: 5Γ—10⁻⁷
  • Reward Design: SIP structural reward ($R_\text{struct}$) + SIP content reward ($R_\text{cont}$) + inference efficiency ($R_\text{len}$) + format reward ($R_\text{fmt}$)

Evaluation

SocialR1-4B is evaluated across three complementary settings:

  • Static MCQ: ToMBench, ToMBench-Hard, SocialIQA, SimpleToM, EmoBench, MotiveBench, Hi-ToM, TactfulToM
  • Open-ended Generation: FanToM
  • Interactive Social Intelligence: SOTOPIA

Related Resources

Resource Link
Paper arXiv:2603.09249
SocialR1-8B Jincenzi/SocialR1-8B

Citation

@inproceedings{wu2026socialr1,
  title={Social-R1: Enhancing Social Reasoning in LLMs through Trajectory-Level Reinforcement Learning},
  author={Wu, Jincenzi and Lei, Yuxuan and Lian, Jianxun and Huang, Yitian and Zhou, Lexin and Li, Haotian and Yang, Deng and Xie, Xing and Meng, Helen},
  booktitle={Arxiv},
  year={2026}
}

Contact

For questions or discussions, please contact jincenziwu@gmail.com.

Downloads last month
57
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Jincenzi/SocialR1-4B

Finetuned
Qwen/Qwen3-4B
Finetuned
(646)
this model
Quantizations
2 models

Dataset used to train Jincenzi/SocialR1-4B

Collection including Jincenzi/SocialR1-4B

Paper for Jincenzi/SocialR1-4B