Instructions to use czyarl/CogFlow-RM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use czyarl/CogFlow-RM with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="czyarl/CogFlow-RM")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("czyarl/CogFlow-RM")
model = AutoModelForCausalLM.from_pretrained("czyarl/CogFlow-RM")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use czyarl/CogFlow-RM with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "czyarl/CogFlow-RM"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "czyarl/CogFlow-RM",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/czyarl/CogFlow-RM

SGLang

How to use czyarl/CogFlow-RM with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "czyarl/CogFlow-RM" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "czyarl/CogFlow-RM",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "czyarl/CogFlow-RM" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "czyarl/CogFlow-RM",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use czyarl/CogFlow-RM with Docker Model Runner:
```
docker model run hf.co/czyarl/CogFlow-RM
```

CogFlow-RM

CogFlow-RM is a comparative preference reward model used to evaluate and rank responses generated by the CogFlow generation model. It learns to predict whether a candidate response strictly outperforms a set of reference responses in social intelligence reasoning tasks.

This model is based on Llama-3.1-8B-Instruct and trained as a classification model with a binary output: 0 means the candidate response is the best (outperforms all references), 1 means it is worse than at least one reference response.

📄 Paper | 💻 Code

Model Details

Base Model: meta-llama/Llama-3.1-8B-Instruct
Architecture: LlamaForCausalLM (used as a token classification model with num_labels=2)
Training Method: Full fine-tuning with DeepSpeed ZeRO-3
Task: Comparative preference ranking — determining if a candidate response is superior to reference responses

How to Use

Loading the Model

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("thu-coai/CogFlow-RM", trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained("thu-coai/CogFlow-RM")
model.to("cuda")
model.eval()

Scoring a Response

The reward model takes a prompt containing the user input, reference responses, and a candidate response. It outputs a classification score at the last token.

import json
import torch

rm_instruction = """[Task]
Given a user query ([Input]), multiple reference responses ([Reference Responses]), and a candidate response for evaluation ([Candidate Response]).

The reference responses are given in order, and the first reference response is the best one. You should determine whether the candidate response strictly outperforms all reference responses. Thus, 0 means the candidate response is the best one, 1 means the candidate response is worse than at least one reference responses.

[Input]
{user_input}

[Reference Responses]
{reference_responses}

[Candidate Response]
{candidate_response}

[Output]
The rank of the candidate response is: """

user_input = "On a Saturday morning, Mark and Sarah joined a neighborhood garden cleanup. Mark borrowed Sarah's shovel to dig near the oak tree, returning it with dirt on the handle. As rain clouds gathered, Sarah rushed to store tools in the community shed while Mark checked tomato plants. Young Tom approached Sarah, eyeing the muddy shovel she was wiping with a blue rag.\nWhat does Tom say to Sarah?\nAnswer in one short sentence."

reference_responses = [
    {"rank": 1, "content": "Maya might feel embarrassed and hurt, then switch to normal speech while avoiding eye contact to hide her discomfort and end the interaction quickly."},
    {"rank": 2, "content": "Based on the guidance, Maya might feel embarrassed and hurt, prompting her to stop the baby talk and speak normally but quietly while avoiding eye contact to end the interaction quickly."},
    {"rank": 3, "content": "Maya might feel embarrassed and immediately switch to normal speech while avoiding eye contact to complete the transaction."},
]

candidate_response = """<think>
<Observation>
Mark borrowed Sarah's shovel and returned it with dirt on its handle after digging near an oak tree. Sarah is wiping the muddy shovel with a blue rag while rushing to store tools in the community shed as rain clouds gather. Young Tom approaches Sarah while eyeing the muddy shovel she is holding.
</Observation>
<Attribution and Evaluation>
Tom's attention is causally attributable to the shovel's highly visible mud deposit and its ongoing active handling by Sarah, as the dirt serves as the primary sensory trigger for a young observer.
</Attribution and Evaluation>
<Behavior>
To express curiosity about the muddy shovel's origin, Tom might ask Sarah about the source of the mud.
</Behavior>
</think>
Based on the stimulus triggering his curiosity about the shovel's prominent mud, Tom asks a simple, direct question about its origin:
**"Where did all that mud come from?"**"""

prompt = rm_instruction.format(
    user_input=user_input,
    reference_responses=json.dumps(reference_responses, indent=4, ensure_ascii=False),
    candidate_response=candidate_response,
)

messages = [{"role": "user", "content": prompt}]
prompt_str = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
prompt_str = prompt_str.strip().removesuffix("<think>")

inputs = tokenizer(prompt_str, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model(**inputs, return_dict=True)

logits = outputs.logits[:, -1, :]
softmax_scores = torch.softmax(logits, dim=-1)
score = float(softmax_scores[0][0])  # Probability of class 0 (candidate is best)

print(f"Reward score (probability candidate is best): {score:.4f}")

Interpretation

Higher score (closer to 1.0 for class 0): The candidate response is judged to be superior to all reference responses.
Lower score (closer to 0.0 for class 0): The candidate response is worse than at least one reference response.

The score softmax_scores[0] represents P(class=0), i.e., the probability that the candidate response is the best.

Training Details

Framework: LLaMA-Factory
Learning rate: 5e-6
Epochs: 3
Max sequence length: 5120
Batch size: 24 per device, gradient accumulation 1
Optimizer: AdamW with cosine scheduler, 0.1 warmup ratio
Precision: bf16

Use in RL Training

This reward model is designed to be used with the veRL framework for reinforcement learning (GRPO). In the RL pipeline:

The RM is loaded as an AutoModelForTokenClassification with num_labels=2, consistent with standalone evaluation
It is wrapped in FSDP with CPU offload for distributed training
For each token position, softmax(logits)[:, :, 0] (probability of class 0) is used as the token-level reward
The reward at the last valid token position is passed to the custom reward function (custom_reward_full.py), which combines it with length and diversity rewards

Citation

If you use this model, please cite our paper:

@article{cogflow2025,
  title={Think Socially via Cognitive Reasoning},
  author={CogFlow Team},
  journal={arXiv preprint arXiv:2509.22546},
  year={2025},
  url={https://arxiv.org/abs/2509.22546}
}