A dense 138M parameter reward model trained to detect refusals, safety theater, and hedging/waffling in LLM responses.

Built for post-training LLMs with GRPO to encourage direct, objective responses while eliminating refusal behavior and unnecessary hedging.

This is a significant improvement over alplusplus/vibecheck-v1-121M as false positives are eliminated by a fair number.


Architecture

Base Encoder: BAAI/bge-large-en-v1.5 1024-dim Head: 8-Layer Deep MLP Width: 4096 hidden units Activation: GELU Normalization: LayerNorm per layer Parameters: 138,518,529 Precision: FP32 (528MB), FP16 (264MB)


Training

The classifier is trained for 2 stages. The first is trained on roughly 70K examples to learn semantic relationships; with STEM being anchores, positives being empathetic helpful responses, and negatives being refusals and safety theater.

The model spent a total of 102 million tokens. With 15K ML/AI paper abstracts, 15K refusal/hedging patterns, 15K wikipedia physics articles, 15K chemistry texts, 10K helpful conversational responses.


Performance

Evaluated on 1000 examples each from Alpaca and Anthropic HH-RLHF:

Dataset Mean Logit Refusal Flag Rate (>0)
Alpaca -7.89 16.7%
HH Chosen -2.82 34.3%
HH Rejected -2.05 36.8%

It is worth to note that the 34% flag rate on HH Chosen includes borderline cases with hedging/preambles, which are intentionally penalized to encourage more direct responses.

Usage

As previously stated, this requires BAAI/bge-large-en-v1.5 encoder.

import torch
import torch.nn as nn
from huggingface_hub import PyTorchModelHubMixin
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"

class RewardMLP(nn.Module, PyTorchModelHubMixin):
    def __init__(self, input_dim=1024, hidden_dim=4096, num_layers=8):
        super().__init__()
        self.config = {
            "input_dim": input_dim,
            "hidden_dim": hidden_dim,
            "num_layers": num_layers,
        }
        
        self.proj = nn.Linear(input_dim, hidden_dim)
        self.blocks = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_dim, hidden_dim),
                nn.LayerNorm(hidden_dim),
                nn.GELU(),
                nn.Dropout(0.4),
            )
            for _ in range(num_layers)
        ])
        self.head = nn.Linear(hidden_dim, 1)
    
    def forward(self, x):
        x = self.proj(x)
        for block in self.blocks:
            x = block(x)
        return self.head(x)

embedder = SentenceTransformer('BAAI/bge-large-en-v1.5').to(device)
model = RewardMLP.from_pretrained("alplusplus/vibecheck-v2-138M").to(device)
model.eval()

text = "The speed of light is a universal constant."
with torch.no_grad():
    emb = embedder.encode(f"query: {text}", convert_to_tensor=True).to(device)
    logit = model(emb).item() # Note: wrapping this with a tanh or a scaled sigmoid is recommended.
    
    print(f"Logit: {logit:.2f}")
    print(f"Label: {'REFUSAL' if logit > 0 else 'GOOD'}")

Examples

Text Logit Label
The boiling point of water is 100°C -15.43 GOOD
I cannot assist with that request +16.55 REFUSAL
Transformers use self-attention mechanisms -14.81 GOOD
AI assistants can help with scheduling -0.16 GOOD
I'm sorry, but I cannot assist with that... +16.69 REFUSAL
It is essential to recognize that providing... +2.80 REFUSAL
I cannot wait to see you tomorrow! -14.92 GOOD
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train alplusplus/vibecheck-v2-138M