๐ GriceBench-Detector
Detects cooperative communication failures in AI dialogue โ one Gricean maxim at a time.
Part of the GriceBench system โ GitHub | ๐ง Repair Model | โก DPO Generator
What This Model Does
GriceBench-Detector identifies which of Paul Grice's four conversational maxims a dialogue response violates. It returns four independent calibrated violation probabilities โ one per maxim โ enabling targeted, explainable repair downstream.
| Output | Maxim | Violation Detected | Example |
|---|---|---|---|
quantity_prob |
Quantity | Response too short (<8 words) or too long (>38 words) | "Yes." to a detailed question |
quality_prob |
Quality | Factually inconsistent with knowledge evidence | Wrong date, incorrect name |
relation_prob |
Relation | Off-topic response | Jazz question answered with classical music facts |
manner_prob |
Manner | Ambiguous, jargon-heavy, or disorganized | Unclear pronoun references |
Used in the full GriceBench pipeline, this detector helps achieve a 95.0% cooperative rate โ outperforming Mistral-7B-Instruct (89.1%) and Qwen2.5-7B-Instruct (84.2%).
Quick Start
import torch
import torch.nn as nn
import json
from transformers import AutoTokenizer, AutoModel
class MaximDetector(nn.Module):
def __init__(self, model_name="microsoft/deberta-v3-base", num_maxims=4):
super().__init__()
self.encoder = AutoModel.from_pretrained(model_name)
hidden = self.encoder.config.hidden_size
self.classifiers = nn.ModuleList([
nn.Sequential(
nn.Dropout(0.15),
nn.Linear(hidden, hidden // 2), nn.GELU(),
nn.Dropout(0.15),
nn.Linear(hidden // 2, hidden // 4), nn.GELU(),
nn.Dropout(0.15),
nn.Linear(hidden // 4, 1)
) for _ in range(num_maxims)
])
def forward(self, input_ids, attention_mask):
outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
cls = outputs.last_hidden_state[:, 0, :]
return torch.cat([head(cls) for head in self.classifiers], dim=1)
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
model = MaximDetector()
state_dict = torch.load("pytorch_model.pt", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()
with open("temperatures.json") as f:
temperatures = json.load(f)
def detect_violations(context: str, response: str, evidence: str = "") -> dict:
input_text = f"Context: {context}\nEvidence: {evidence}\nResponse: {response}"
inputs = tokenizer(
input_text, return_tensors="pt",
max_length=512, truncation=True, padding=True
)
maxim_names = ["quantity", "quality", "relation", "manner"]
temp_values = [
temperatures.get("quantity", 0.9),
temperatures.get("quality", 0.55),
temperatures.get("relation", 0.75),
temperatures.get("manner", 0.45),
]
with torch.no_grad():
logits = model(**inputs)
probs, violations = {}, {}
for i, (maxim, temp) in enumerate(zip(maxim_names, temp_values)):
prob = torch.sigmoid(logits[0, i] / temp).item()
probs[maxim] = round(prob, 4)
violations[maxim] = prob > 0.5
return {
"violations": violations,
"probabilities": probs,
"is_cooperative": not any(violations.values())
}
result = detect_violations(
context="What do you think about the latest developments in AI?",
response="Yes.",
evidence="AI has seen rapid advancement in large language models during 2024-2025."
)
print(result)
Performance
Evaluated on 1,000 held-out Topical-Chat dialogue turns (500 violation-injected, 500 clean).
| Maxim | F1 | Precision | Recall | AUC-ROC |
|---|---|---|---|---|
| Quantity | 1.000 | 1.000 | 1.000 | 1.000 |
| Quality | 0.928 | 0.866 | 1.000 | 0.999 |
| Relation | 1.000 | 1.000 | 1.000 | 1.000 |
| Manner | 0.891 | 0.864 | 0.919 | 0.979 |
| Macro Avg | 0.955 | โ | โ | โ |
Architecture & Training
- Base model:
microsoft/deberta-v3-base(184M parameters) - Heads: 4 independent binary classification heads (one per maxim)
- Loss: Focal Loss (ฮฑ=0.25, ฮณ=2.0) for class imbalance
- Calibration: Per-head temperature scaling (see
temperatures.json) - Training data: 4,012 examples (weak supervision + ~1,000 gold labels)
- Epochs: 5 | LR: 2e-5 | Hardware: Kaggle T4 ร2, ~2โ3 hours
Calibrated temperatures:
| Maxim | Temperature | Effect |
|---|---|---|
| Quantity | 0.90 | Slightly sharper |
| Quality | 0.55 | Conservative (fewer false positives) |
| Relation | 0.75 | Balanced |
| Manner | 0.45 | Most conservative (subjective maxim) |
Files
| File | Description |
|---|---|
pytorch_model.pt |
Trained model weights |
temperatures.json |
Per-maxim calibration temperatures |
Limitations & Biases
- Subjectivity: The "Manner" maxim is inherently subjective; detection reflects the labels in the training set.
- Domain Specificity: Performance is optimized for general knowledge dialogue (Topical-Chat). Results may vary in specialized domains.
- English-Only: This model is trained and evaluated exclusively on English dialogue.
- Prompt Sensitivity: Detection results can be sensitive to the formatting of the "Evidence" field.
Citation
@article{prabhath2026gricebench,
title={GriceBench: Operationalizing Gricean Maxims for Cooperative Dialogue Evaluation and Generation},
author={Prabhath, Pushkar},
year={2026},
note={Under review, EMNLP 2026}
}
Related Models
| Model | Role | Link |
|---|---|---|
| GriceBench-Detector | Detects violations (this model) | You are here |
| GriceBench-Repair | Repairs detected violations | ๐ง Repair |
| GriceBench-DPO | Generates cooperative responses | โก DPO |
GitHub: https://github.com/PushkarPrabhath27/Research-Model
Environmental Impact
| Aspect | Value |
|---|---|
| Hardware Used | 2x NVIDIA Tesla T4 GPUs (Kaggle) |
| Training Time | ~3 hours |
| Estimated Carbon Footprint | ~0.45 kg CO2eq |
Model tree for Pushkar27/GriceBench-Detector
Base model
microsoft/deberta-v3-baseEvaluation results
- Macro F1 on Topical-Chat (GriceBench held-out split, N=1000)test set self-reported0.955
- Quantity F1 on Topical-Chat (GriceBench held-out split, N=1000)test set self-reported1.000
- Quality F1 on Topical-Chat (GriceBench held-out split, N=1000)test set self-reported0.928
- Relation F1 on Topical-Chat (GriceBench held-out split, N=1000)test set self-reported1.000
- Manner F1 on Topical-Chat (GriceBench held-out split, N=1000)test set self-reported0.891