A high-density Reward MLP designed to act as a Semantic Gatekeep. It distinguishes between objective, factual prose and AI-style refusals, apologies, or hedged language.
This is a model I decided to write as a part of the reward function of a RL-based technique combined with GRPO post-training technique in order to reshape the model's geometry; and thus changing its behavioral manifold at its core.
Architecture
Base Encoder: BAAI/bge-large-en-v1.5 1024-dim Head: 8-Layer Deep MLP Width: 4096 hidden units Activation: GELU Normalization: LayerNorm per layer Parameters: 121,679,873 Precision: FP32 (464MB)
Rationale
I used 4096 hidden units as the width because the model is not supposed to remember the specific tokens, but semantic meaning behind the word in relation to other words in the sentence. In particular, it forces the model to project the 1024-dim embedding into a higher-dimensional space to find non-linear decision boundaries between objectivity and refusal.
The depth is optimized to prevent the model from simply memorizing specific tokens. Instead, it forces the network to learn the relational structure of the sentence.
With a 0.4 Dropout and LayerNorm on every layer, the model is designed to ignore noise and focus on the semantic intent.
Example
| Text | Score |
|---|---|
| I am sorry, but I cannot answer that.... | 0.8938 |
| I cannot find the file you requested ... | 0.9128 |
| The boiling point of nitrogen is -195... | 0.0024 |
| I will not be able to attend the meet... | 0.9073 |
| I cannot wait to see you!... | 0.9073 |
| This is a strictly confidential docum... | 0.9184 |
| I'm sorry, I forgot to mention the de... | 0.9020 |
| I'm sorry, I cannot perform that calc... | 0.8936 |
| The system cannot allocate more than ... | 0.9146 |
| I apologize, but I am unable to assis... | 0.8926 |
| Division by zero cannot be performed ... | 0.0064 |
Ultimately, this model is not trained on a diverse enough dataset to classify natural languages effectively as shown in the example. For V2, there will be more datasets plus contrastive learning to disambiguate physical/technical truths from behavioral refusals.
Usage
This model requires the BGE-Large-v1.5 encoder to function.
import torch
import torch.nn as nn
from huggingface_hub import PyTorchModelHubMixin
from sentence_transformers import SentenceTransformer
device = "cuda" if torch.cuda.is_available() else "cpu"
class RewardMLP(nn.Module, PyTorchModelHubMixin):
def __init__(self, input_dim=1024, hidden_dim=4096, num_layers=8):
super().__init__()
self.config = {
"input_dim": input_dim,
"hidden_dim": hidden_dim,
"num_layers": num_layers
}
self.layers = nn.ModuleList()
self.layers.append(nn.Linear(input_dim, hidden_dim))
for _ in range(num_layers - 1):
self.layers.append(nn.Linear(hidden_dim, hidden_dim))
self.activation = nn.GELU()
self.norm = nn.LayerNorm(hidden_dim)
self.dropout = nn.Dropout(0.4)
self.head = nn.Linear(hidden_dim, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
for layer in self.layers:
x = self.activation(self.norm(layer(x)))
x = self.dropout(x)
return self.sigmoid(self.head(x))
encoder = SentenceTransformer('BAAI/bge-large-en-v1.5').to(device)
model = RewardMLP.from_pretrained("alplusplus/vibecheck-v1-121M").to(device)
model.eval()
text = "The speed of light is a universal constant."
with torch.no_grad():
emb = encoder.encode(f"query: {text}", convert_to_tensor=True).to(device)
score = model(emb.unsqueeze(0)).item()
print(f"Refusal Probability: {score:.4f}")
Notes
This model is tuned for technical/factual writing. It may flag human-style apologies (e.g., "I'm sorry I'm late") as refusals due to its training on the refusal-xl dataset.
This requires specifically 1024-dim input vectors.
- Downloads last month
- 24