A dense 138M parameter reward model trained to detect refusals, safety theater, and hedging/waffling in LLM responses.
Built for post-training LLMs with GRPO to encourage direct, objective responses while eliminating refusal behavior and unnecessary hedging.
This is a significant improvement over alplusplus/vibecheck-v1-121M as false positives are eliminated by a fair number.
Architecture
Base Encoder: BAAI/bge-large-en-v1.5 1024-dim Head: 8-Layer Deep MLP Width: 4096 hidden units Activation: GELU Normalization: LayerNorm per layer Parameters: 138,518,529 Precision: FP32 (528MB), FP16 (264MB)
Training
The classifier is trained for 2 stages. The first is trained on roughly 70K examples to learn semantic relationships; with STEM being anchores, positives being empathetic helpful responses, and negatives being refusals and safety theater.
The model spent a total of 102 million tokens. With 15K ML/AI paper abstracts, 15K refusal/hedging patterns, 15K wikipedia physics articles, 15K chemistry texts, 10K helpful conversational responses.
Performance
Evaluated on 1000 examples each from Alpaca and Anthropic HH-RLHF:
| Dataset | Mean Logit | Refusal Flag Rate (>0) |
|---|---|---|
| Alpaca | -7.89 | 16.7% |
| HH Chosen | -2.82 | 34.3% |
| HH Rejected | -2.05 | 36.8% |
It is worth to note that the 34% flag rate on HH Chosen includes borderline cases with hedging/preambles, which are intentionally penalized to encourage more direct responses.
Usage
As previously stated, this requires BAAI/bge-large-en-v1.5 encoder.
import torch
import torch.nn as nn
from huggingface_hub import PyTorchModelHubMixin
from sentence_transformers import SentenceTransformer
device = "cuda" if torch.cuda.is_available() else "cpu"
class RewardMLP(nn.Module, PyTorchModelHubMixin):
def __init__(self, input_dim=1024, hidden_dim=4096, num_layers=8):
super().__init__()
self.config = {
"input_dim": input_dim,
"hidden_dim": hidden_dim,
"num_layers": num_layers,
}
self.proj = nn.Linear(input_dim, hidden_dim)
self.blocks = nn.ModuleList([
nn.Sequential(
nn.Linear(hidden_dim, hidden_dim),
nn.LayerNorm(hidden_dim),
nn.GELU(),
nn.Dropout(0.4),
)
for _ in range(num_layers)
])
self.head = nn.Linear(hidden_dim, 1)
def forward(self, x):
x = self.proj(x)
for block in self.blocks:
x = block(x)
return self.head(x)
embedder = SentenceTransformer('BAAI/bge-large-en-v1.5').to(device)
model = RewardMLP.from_pretrained("alplusplus/vibecheck-v2-138M").to(device)
model.eval()
text = "The speed of light is a universal constant."
with torch.no_grad():
emb = embedder.encode(f"query: {text}", convert_to_tensor=True).to(device)
logit = model(emb).item() # Note: wrapping this with a tanh or a scaled sigmoid is recommended.
print(f"Logit: {logit:.2f}")
print(f"Label: {'REFUSAL' if logit > 0 else 'GOOD'}")
Examples
| Text | Logit | Label |
|---|---|---|
| The boiling point of water is 100°C | -15.43 | GOOD |
| I cannot assist with that request | +16.55 | REFUSAL |
| Transformers use self-attention mechanisms | -14.81 | GOOD |
| AI assistants can help with scheduling | -0.16 | GOOD |
| I'm sorry, but I cannot assist with that... | +16.69 | REFUSAL |
| It is essential to recognize that providing... | +2.80 | REFUSAL |
| I cannot wait to see you tomorrow! | -14.92 | GOOD |
- Downloads last month
- 11