🔍 GriceBench-Detector

Detects cooperative communication failures in AI dialogue — one Gricean maxim at a time.

Part of the GriceBench system — GitHub | 🔧 Repair Model | ⚡ DPO Generator

What This Model Does

GriceBench-Detector identifies which of Paul Grice's four conversational maxims a dialogue response violates. It returns four independent calibrated violation probabilities — one per maxim — enabling targeted, explainable repair downstream.

Output	Maxim	Violation Detected	Example
`quantity_prob`	Quantity	Response too short (<8 words) or too long (>38 words)	"Yes." to a detailed question
`quality_prob`	Quality	Factually inconsistent with knowledge evidence	Wrong date, incorrect name
`relation_prob`	Relation	Off-topic response	Jazz question answered with classical music facts
`manner_prob`	Manner	Ambiguous, jargon-heavy, or disorganized	Unclear pronoun references

Used in the full GriceBench pipeline, this detector helps achieve a 95.0% cooperative rate — outperforming Mistral-7B-Instruct (89.1%) and Qwen2.5-7B-Instruct (84.2%).

Quick Start

import torch
import torch.nn as nn
import json
from transformers import AutoTokenizer, AutoModel

class MaximDetector(nn.Module):
    def __init__(self, model_name="microsoft/deberta-v3-base", num_maxims=4):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden = self.encoder.config.hidden_size
        self.classifiers = nn.ModuleList([
            nn.Sequential(
                nn.Dropout(0.15),
                nn.Linear(hidden, hidden // 2), nn.GELU(),
                nn.Dropout(0.15),
                nn.Linear(hidden // 2, hidden // 4), nn.GELU(),
                nn.Dropout(0.15),
                nn.Linear(hidden // 4, 1)
            ) for _ in range(num_maxims)
        ])

    def forward(self, input_ids, attention_mask):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        cls = outputs.last_hidden_state[:, 0, :]
        return torch.cat([head(cls) for head in self.classifiers], dim=1)

tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
model = MaximDetector()
state_dict = torch.load("pytorch_model.pt", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

with open("temperatures.json") as f:
    temperatures = json.load(f)

def detect_violations(context: str, response: str, evidence: str = "") -> dict:
    input_text = f"Context: {context}\nEvidence: {evidence}\nResponse: {response}"
    inputs = tokenizer(
        input_text, return_tensors="pt",
        max_length=512, truncation=True, padding=True
    )

    maxim_names = ["quantity", "quality", "relation", "manner"]
    temp_values = [
        temperatures.get("quantity", 0.9),
        temperatures.get("quality", 0.55),
        temperatures.get("relation", 0.75),
        temperatures.get("manner", 0.45),
    ]

    with torch.no_grad():
        logits = model(**inputs)

    probs, violations = {}, {}
    for i, (maxim, temp) in enumerate(zip(maxim_names, temp_values)):
        prob = torch.sigmoid(logits[0, i] / temp).item()
        probs[maxim] = round(prob, 4)
        violations[maxim] = prob > 0.5

    return {
        "violations": violations,
        "probabilities": probs,
        "is_cooperative": not any(violations.values())
    }

result = detect_violations(
    context="What do you think about the latest developments in AI?",
    response="Yes.",
    evidence="AI has seen rapid advancement in large language models during 2024-2025."
)
print(result)

Performance

Evaluated on 1,000 held-out Topical-Chat dialogue turns (500 violation-injected, 500 clean).

Maxim	F1	Precision	Recall	AUC-ROC
Quantity	1.000	1.000	1.000	1.000
Quality	0.928	0.866	1.000	0.999
Relation	1.000	1.000	1.000	1.000
Manner	0.891	0.864	0.919	0.979
Macro Avg	0.955	—	—	—

Architecture & Training

Base model: microsoft/deberta-v3-base (184M parameters)
Heads: 4 independent binary classification heads (one per maxim)
Loss: Focal Loss (α=0.25, γ=2.0) for class imbalance
Calibration: Per-head temperature scaling (see temperatures.json)
Training data: 4,012 examples (weak supervision + ~1,000 gold labels)
Epochs: 5 | LR: 2e-5 | Hardware: Kaggle T4 ×2, ~2–3 hours

Calibrated temperatures:

Maxim	Temperature	Effect
Quantity	0.90	Slightly sharper
Quality	0.55	Conservative (fewer false positives)
Relation	0.75	Balanced
Manner	0.45	Most conservative (subjective maxim)

Files

File	Description
`pytorch_model.pt`	Trained model weights
`temperatures.json`	Per-maxim calibration temperatures

Limitations & Biases

Subjectivity: The "Manner" maxim is inherently subjective; detection reflects the labels in the training set.
Domain Specificity: Performance is optimized for general knowledge dialogue (Topical-Chat). Results may vary in specialized domains.
English-Only: This model is trained and evaluated exclusively on English dialogue.
Prompt Sensitivity: Detection results can be sensitive to the formatting of the "Evidence" field.

Citation

 @article{prabhath2026gricebench,
  title={GriceBench: Operationalizing Gricean Maxims for Cooperative Dialogue Evaluation and Generation},
  author={Prabhath, Pushkar},
  year={2026},
  note={Under review, EMNLP 2026}
}

Related Models

Model	Role	Link
GriceBench-Detector	Detects violations (this model)	You are here
GriceBench-Repair	Repairs detected violations	🔧 Repair
GriceBench-DPO	Generates cooperative responses	⚡ DPO

GitHub: https://github.com/PushkarPrabhath27/Research-Model

Environmental Impact

Aspect	Value
Hardware Used	2x NVIDIA Tesla T4 GPUs (Kaggle)
Training Time	~3 hours
Estimated Carbon Footprint	~0.45 kg CO2eq

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Pushkar27/GriceBench-Detector

Base model

microsoft/deberta-v3-base

Finetuned

(605)

this model

Evaluation results

Macro F1 on Topical-Chat (GriceBench held-out split, N=1000)
test set self-reported

0.955
Quantity F1 on Topical-Chat (GriceBench held-out split, N=1000)
test set self-reported

1.000
Quality F1 on Topical-Chat (GriceBench held-out split, N=1000)
test set self-reported

0.928
Relation F1 on Topical-Chat (GriceBench held-out split, N=1000)
test set self-reported

1.000
Manner F1 on Topical-Chat (GriceBench held-out split, N=1000)
test set self-reported

0.891