CogFlow-RM

CogFlow-RM is a comparative preference reward model used to evaluate and rank responses generated by the CogFlow generation model. It learns to predict whether a candidate response strictly outperforms a set of reference responses in social intelligence reasoning tasks.

This model is based on Llama-3.1-8B-Instruct and trained as a classification model with a binary output: 0 means the candidate response is the best (outperforms all references), 1 means it is worse than at least one reference response.

📄 Paper | 💻 Code

Model Details

  • Base Model: meta-llama/Llama-3.1-8B-Instruct
  • Architecture: LlamaForCausalLM (used as a token classification model with num_labels=2)
  • Training Method: Full fine-tuning with DeepSpeed ZeRO-3
  • Task: Comparative preference ranking — determining if a candidate response is superior to reference responses

How to Use

Loading the Model

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("thu-coai/CogFlow-RM", trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained("thu-coai/CogFlow-RM")
model.to("cuda")
model.eval()

Scoring a Response

The reward model takes a prompt containing the user input, reference responses, and a candidate response. It outputs a classification score at the last token.

import json
import torch

rm_instruction = """[Task]
Given a user query ([Input]), multiple reference responses ([Reference Responses]), and a candidate response for evaluation ([Candidate Response]).

The reference responses are given in order, and the first reference response is the best one. You should determine whether the candidate response strictly outperforms all reference responses. Thus, 0 means the candidate response is the best one, 1 means the candidate response is worse than at least one reference responses.

[Input]
{user_input}

[Reference Responses]
{reference_responses}

[Candidate Response]
{candidate_response}

[Output]
The rank of the candidate response is: """

user_input = "On a Saturday morning, Mark and Sarah joined a neighborhood garden cleanup. Mark borrowed Sarah's shovel to dig near the oak tree, returning it with dirt on the handle. As rain clouds gathered, Sarah rushed to store tools in the community shed while Mark checked tomato plants. Young Tom approached Sarah, eyeing the muddy shovel she was wiping with a blue rag.\nWhat does Tom say to Sarah?\nAnswer in one short sentence."

reference_responses = [
    {"rank": 1, "content": "Maya might feel embarrassed and hurt, then switch to normal speech while avoiding eye contact to hide her discomfort and end the interaction quickly."},
    {"rank": 2, "content": "Based on the guidance, Maya might feel embarrassed and hurt, prompting her to stop the baby talk and speak normally but quietly while avoiding eye contact to end the interaction quickly."},
    {"rank": 3, "content": "Maya might feel embarrassed and immediately switch to normal speech while avoiding eye contact to complete the transaction."},
]

candidate_response = """<think>
<Observation>
Mark borrowed Sarah's shovel and returned it with dirt on its handle after digging near an oak tree. Sarah is wiping the muddy shovel with a blue rag while rushing to store tools in the community shed as rain clouds gather. Young Tom approaches Sarah while eyeing the muddy shovel she is holding.
</Observation>
<Attribution and Evaluation>
Tom's attention is causally attributable to the shovel's highly visible mud deposit and its ongoing active handling by Sarah, as the dirt serves as the primary sensory trigger for a young observer.
</Attribution and Evaluation>
<Behavior>
To express curiosity about the muddy shovel's origin, Tom might ask Sarah about the source of the mud.
</Behavior>
</think>
Based on the stimulus triggering his curiosity about the shovel's prominent mud, Tom asks a simple, direct question about its origin:
**"Where did all that mud come from?"**"""

prompt = rm_instruction.format(
    user_input=user_input,
    reference_responses=json.dumps(reference_responses, indent=4, ensure_ascii=False),
    candidate_response=candidate_response,
)

messages = [{"role": "user", "content": prompt}]
prompt_str = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
prompt_str = prompt_str.strip().removesuffix("<think>")

inputs = tokenizer(prompt_str, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model(**inputs, return_dict=True)

logits = outputs.logits[:, -1, :]
softmax_scores = torch.softmax(logits, dim=-1)
score = float(softmax_scores[0][0])  # Probability of class 0 (candidate is best)

print(f"Reward score (probability candidate is best): {score:.4f}")

Interpretation

  • Higher score (closer to 1.0 for class 0): The candidate response is judged to be superior to all reference responses.
  • Lower score (closer to 0.0 for class 0): The candidate response is worse than at least one reference response.

The score softmax_scores[0] represents P(class=0), i.e., the probability that the candidate response is the best.

Training Details

  • Framework: LLaMA-Factory
  • Learning rate: 5e-6
  • Epochs: 3
  • Max sequence length: 5120
  • Batch size: 24 per device, gradient accumulation 1
  • Optimizer: AdamW with cosine scheduler, 0.1 warmup ratio
  • Precision: bf16

Use in RL Training

This reward model is designed to be used with the veRL framework for reinforcement learning (GRPO). In the RL pipeline:

  1. The RM is loaded as an AutoModelForTokenClassification with num_labels=2, consistent with standalone evaluation
  2. It is wrapped in FSDP with CPU offload for distributed training
  3. For each token position, softmax(logits)[:, :, 0] (probability of class 0) is used as the token-level reward
  4. The reward at the last valid token position is passed to the custom reward function (custom_reward_full.py), which combines it with length and diversity rewards

Citation

If you use this model, please cite our paper:

@article{cogflow2025,
  title={Think Socially via Cognitive Reasoning},
  author={CogFlow Team},
  journal={arXiv preprint arXiv:2509.22546},
  year={2025},
  url={https://arxiv.org/abs/2509.22546}
}
Downloads last month
25
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for czyarl/CogFlow-RM

Finetuned
(2730)
this model

Paper for czyarl/CogFlow-RM