๐Ÿ” GriceBench-Detector

Detects cooperative communication failures in AI dialogue โ€” one Gricean maxim at a time.

License HuggingFace Python 3.8+

Part of the GriceBench system โ€” GitHub | ๐Ÿ”ง Repair Model | โšก DPO Generator


What This Model Does

GriceBench-Detector identifies which of Paul Grice's four conversational maxims a dialogue response violates. It returns four independent calibrated violation probabilities โ€” one per maxim โ€” enabling targeted, explainable repair downstream.

Output Maxim Violation Detected Example
quantity_prob Quantity Response too short (<8 words) or too long (>38 words) "Yes." to a detailed question
quality_prob Quality Factually inconsistent with knowledge evidence Wrong date, incorrect name
relation_prob Relation Off-topic response Jazz question answered with classical music facts
manner_prob Manner Ambiguous, jargon-heavy, or disorganized Unclear pronoun references

Used in the full GriceBench pipeline, this detector helps achieve a 95.0% cooperative rate โ€” outperforming Mistral-7B-Instruct (89.1%) and Qwen2.5-7B-Instruct (84.2%).


Quick Start

import torch
import torch.nn as nn
import json
from transformers import AutoTokenizer, AutoModel

class MaximDetector(nn.Module):
    def __init__(self, model_name="microsoft/deberta-v3-base", num_maxims=4):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden = self.encoder.config.hidden_size
        self.classifiers = nn.ModuleList([
            nn.Sequential(
                nn.Dropout(0.15),
                nn.Linear(hidden, hidden // 2), nn.GELU(),
                nn.Dropout(0.15),
                nn.Linear(hidden // 2, hidden // 4), nn.GELU(),
                nn.Dropout(0.15),
                nn.Linear(hidden // 4, 1)
            ) for _ in range(num_maxims)
        ])

    def forward(self, input_ids, attention_mask):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        cls = outputs.last_hidden_state[:, 0, :]
        return torch.cat([head(cls) for head in self.classifiers], dim=1)

tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
model = MaximDetector()
state_dict = torch.load("pytorch_model.pt", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

with open("temperatures.json") as f:
    temperatures = json.load(f)

def detect_violations(context: str, response: str, evidence: str = "") -> dict:
    input_text = f"Context: {context}\nEvidence: {evidence}\nResponse: {response}"
    inputs = tokenizer(
        input_text, return_tensors="pt",
        max_length=512, truncation=True, padding=True
    )

    maxim_names = ["quantity", "quality", "relation", "manner"]
    temp_values = [
        temperatures.get("quantity", 0.9),
        temperatures.get("quality", 0.55),
        temperatures.get("relation", 0.75),
        temperatures.get("manner", 0.45),
    ]

    with torch.no_grad():
        logits = model(**inputs)

    probs, violations = {}, {}
    for i, (maxim, temp) in enumerate(zip(maxim_names, temp_values)):
        prob = torch.sigmoid(logits[0, i] / temp).item()
        probs[maxim] = round(prob, 4)
        violations[maxim] = prob > 0.5

    return {
        "violations": violations,
        "probabilities": probs,
        "is_cooperative": not any(violations.values())
    }

result = detect_violations(
    context="What do you think about the latest developments in AI?",
    response="Yes.",
    evidence="AI has seen rapid advancement in large language models during 2024-2025."
)
print(result)

Performance

Evaluated on 1,000 held-out Topical-Chat dialogue turns (500 violation-injected, 500 clean).

Maxim F1 Precision Recall AUC-ROC
Quantity 1.000 1.000 1.000 1.000
Quality 0.928 0.866 1.000 0.999
Relation 1.000 1.000 1.000 1.000
Manner 0.891 0.864 0.919 0.979
Macro Avg 0.955 โ€” โ€” โ€”

Architecture & Training

  • Base model: microsoft/deberta-v3-base (184M parameters)
  • Heads: 4 independent binary classification heads (one per maxim)
  • Loss: Focal Loss (ฮฑ=0.25, ฮณ=2.0) for class imbalance
  • Calibration: Per-head temperature scaling (see temperatures.json)
  • Training data: 4,012 examples (weak supervision + ~1,000 gold labels)
  • Epochs: 5 | LR: 2e-5 | Hardware: Kaggle T4 ร—2, ~2โ€“3 hours

Calibrated temperatures:

Maxim Temperature Effect
Quantity 0.90 Slightly sharper
Quality 0.55 Conservative (fewer false positives)
Relation 0.75 Balanced
Manner 0.45 Most conservative (subjective maxim)

Files

File Description
pytorch_model.pt Trained model weights
temperatures.json Per-maxim calibration temperatures

Limitations & Biases

  • Subjectivity: The "Manner" maxim is inherently subjective; detection reflects the labels in the training set.
  • Domain Specificity: Performance is optimized for general knowledge dialogue (Topical-Chat). Results may vary in specialized domains.
  • English-Only: This model is trained and evaluated exclusively on English dialogue.
  • Prompt Sensitivity: Detection results can be sensitive to the formatting of the "Evidence" field.

Citation

 @article{prabhath2026gricebench,
  title={GriceBench: Operationalizing Gricean Maxims for Cooperative Dialogue Evaluation and Generation},
  author={Prabhath, Pushkar},
  year={2026},
  note={Under review, EMNLP 2026}
}

Related Models

Model Role Link
GriceBench-Detector Detects violations (this model) You are here
GriceBench-Repair Repairs detected violations ๐Ÿ”ง Repair
GriceBench-DPO Generates cooperative responses โšก DPO

GitHub: https://github.com/PushkarPrabhath27/Research-Model


Environmental Impact

Aspect Value
Hardware Used 2x NVIDIA Tesla T4 GPUs (Kaggle)
Training Time ~3 hours
Estimated Carbon Footprint ~0.45 kg CO2eq
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Pushkar27/GriceBench-Detector

Finetuned
(605)
this model

Evaluation results