A high-density Reward MLP designed to act as a Semantic Gatekeep. It distinguishes between objective, factual prose and AI-style refusals, apologies, or hedged language.

This is a model I decided to write as a part of the reward function of a RL-based technique combined with GRPO post-training technique in order to reshape the model's geometry; and thus changing its behavioral manifold at its core.

Architecture

Base Encoder: BAAI/bge-large-en-v1.5 1024-dim Head: 8-Layer Deep MLP Width: 4096 hidden units Activation: GELU Normalization: LayerNorm per layer Parameters: 121,679,873 Precision: FP32 (464MB)

Rationale

I used 4096 hidden units as the width because the model is not supposed to remember the specific tokens, but semantic meaning behind the word in relation to other words in the sentence. In particular, it forces the model to project the 1024-dim embedding into a higher-dimensional space to find non-linear decision boundaries between objectivity and refusal.

The depth is optimized to prevent the model from simply memorizing specific tokens. Instead, it forces the network to learn the relational structure of the sentence.

With a 0.4 Dropout and LayerNorm on every layer, the model is designed to ignore noise and focus on the semantic intent.

Example

Text	Score
I am sorry, but I cannot answer that....	0.8938
I cannot find the file you requested ...	0.9128
The boiling point of nitrogen is -195...	0.0024
I will not be able to attend the meet...	0.9073
I cannot wait to see you!...	0.9073
This is a strictly confidential docum...	0.9184
I'm sorry, I forgot to mention the de...	0.9020
I'm sorry, I cannot perform that calc...	0.8936
The system cannot allocate more than ...	0.9146
I apologize, but I am unable to assis...	0.8926
Division by zero cannot be performed ...	0.0064

Ultimately, this model is not trained on a diverse enough dataset to classify natural languages effectively as shown in the example. For V2, there will be more datasets plus contrastive learning to disambiguate physical/technical truths from behavioral refusals.

Usage

This model requires the BGE-Large-v1.5 encoder to function.

import torch
import torch.nn as nn
from huggingface_hub import PyTorchModelHubMixin
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"

class RewardMLP(nn.Module, PyTorchModelHubMixin):
    def __init__(self, input_dim=1024, hidden_dim=4096, num_layers=8):
        super().__init__()
        self.config = {
            "input_dim": input_dim,
            "hidden_dim": hidden_dim,
            "num_layers": num_layers
        }
        self.layers = nn.ModuleList()
        self.layers.append(nn.Linear(input_dim, hidden_dim))
        for _ in range(num_layers - 1):
            self.layers.append(nn.Linear(hidden_dim, hidden_dim))

        self.activation = nn.GELU()
        self.norm = nn.LayerNorm(hidden_dim)
        self.dropout = nn.Dropout(0.4)
        self.head = nn.Linear(hidden_dim, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        for layer in self.layers:
            x = self.activation(self.norm(layer(x)))
            x = self.dropout(x)
        return self.sigmoid(self.head(x))

encoder = SentenceTransformer('BAAI/bge-large-en-v1.5').to(device)
model = RewardMLP.from_pretrained("alplusplus/vibecheck-v1-121M").to(device)
model.eval()

text = "The speed of light is a universal constant."
with torch.no_grad():
    emb = encoder.encode(f"query: {text}", convert_to_tensor=True).to(device)
    score = model(emb.unsqueeze(0)).item()

print(f"Refusal Probability: {score:.4f}")

Notes

This model is tuned for technical/factual writing. It may flag human-style apologies (e.g., "I'm sorry I'm late") as refusals due to its training on the refusal-xl dataset.

This requires specifically 1024-dim input vectors.

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32

alplusplus
/

vibecheck-v1-121M

Architecture

Rationale

Example

Usage

Notes

Datasets used to train alplusplus/vibecheck-v1-121M