File size: 7,889 Bytes
fca1e39
9d98903
fca1e39
9d98903
 
 
fca1e39
 
 
 
 
 
 
 
 
9d98903
6824a07
9d98903
fca1e39
 
 
 
9d98903
 
 
fca1e39
 
 
 
 
 
 
6824a07
fca1e39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d98903
 
d923478
9d98903
d923478
9d98903
d923478
9d98903
d923478
 
 
9d98903
6824a07
 
 
d923478
9d98903
d923478
9d98903
d923478
9d98903
d923478
ebfe70b
d923478
9d98903
d923478
 
 
 
 
 
9d98903
d923478
9d98903
d923478
9d98903
d923478
fca1e39
9d98903
 
 
 
ebfe70b
9d98903
 
 
 
 
6824a07
9d98903
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ebfe70b
 
 
9d98903
6824a07
9d98903
ebfe70b
 
9d98903
ebfe70b
9d98903
 
 
 
 
 
 
ebfe70b
9d98903
6824a07
ebfe70b
 
9d98903
 
 
 
ebfe70b
9d98903
 
 
 
 
 
fca1e39
 
6824a07
fca1e39
 
 
 
d923478
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d98903
ebfe70b
9d98903
 
ebfe70b
 
9d98903
 
d923478
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- text-classification
- multi-label-classification
- dialogue
- conversational-ai
- gricean-maxims
- cooperative-communication
- deberta
- nlp
- pragmatics
datasets:
- topical-chat
metrics:
- f1
- precision
- recall
- roc_auc
pipeline_tag: text-classification
base_model: microsoft/deberta-v3-base
model-index:
- name: GriceBench-Detector
  results:
  - task:
      type: text-classification
      name: Multi-Label Gricean Maxim Violation Detection
    dataset:
      name: Topical-Chat (GriceBench held-out split, N=1000)
      type: topical-chat
      split: test
    metrics:
    - type: f1
      value: 0.955
      name: Macro F1
    - type: f1
      value: 1.000
      name: Quantity F1
    - type: f1
      value: 0.928
      name: Quality F1
    - type: f1
      value: 1.000
      name: Relation F1
    - type: f1
      value: 0.891
      name: Manner F1
---

<div align="center">

# πŸ” GriceBench-Detector

**Detects cooperative communication failures in AI dialogue β€” one Gricean maxim at a time.**

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![HuggingFace](https://img.shields.io/badge/πŸ€—-GriceBench-yellow)](https://huggingface.co/Pushkar27)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)

**Part of the GriceBench system** β€”
[GitHub](https://github.com/PushkarPrabhath27/Research-Model) |
[πŸ”§ Repair Model](https://huggingface.co/Pushkar27/GriceBench-Repair) |
[⚑ DPO Generator](https://huggingface.co/Pushkar27/GriceBench-DPO)

</div>

---

## What This Model Does

GriceBench-Detector identifies which of Paul Grice's four conversational maxims a dialogue response violates. It returns four independent calibrated violation probabilities β€” one per maxim β€” enabling targeted, explainable repair downstream.

| Output | Maxim | Violation Detected | Example |
|--------|-------|-------------------|---------|
| `quantity_prob` | Quantity | Response too short (<8 words) or too long (>38 words) | "Yes." to a detailed question |
| `quality_prob` | Quality | Factually inconsistent with knowledge evidence | Wrong date, incorrect name |
| `relation_prob` | Relation | Off-topic response | Jazz question answered with classical music facts |
| `manner_prob` | Manner | Ambiguous, jargon-heavy, or disorganized | Unclear pronoun references |

Used in the full GriceBench pipeline, this detector helps achieve a **95.0% cooperative rate** β€” outperforming Mistral-7B-Instruct (89.1%) and Qwen2.5-7B-Instruct (84.2%).

---

## Quick Start

```python
import torch
import torch.nn as nn
import json
from transformers import AutoTokenizer, AutoModel

class MaximDetector(nn.Module):
    def __init__(self, model_name="microsoft/deberta-v3-base", num_maxims=4):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden = self.encoder.config.hidden_size
        self.classifiers = nn.ModuleList([
            nn.Sequential(
                nn.Dropout(0.15),
                nn.Linear(hidden, hidden // 2), nn.GELU(),
                nn.Dropout(0.15),
                nn.Linear(hidden // 2, hidden // 4), nn.GELU(),
                nn.Dropout(0.15),
                nn.Linear(hidden // 4, 1)
            ) for _ in range(num_maxims)
        ])

    def forward(self, input_ids, attention_mask):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        cls = outputs.last_hidden_state[:, 0, :]
        return torch.cat([head(cls) for head in self.classifiers], dim=1)

tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
model = MaximDetector()
state_dict = torch.load("pytorch_model.pt", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

with open("temperatures.json") as f:
    temperatures = json.load(f)

def detect_violations(context: str, response: str, evidence: str = "") -> dict:
    input_text = f"Context: {context}\nEvidence: {evidence}\nResponse: {response}"
    inputs = tokenizer(
        input_text, return_tensors="pt",
        max_length=512, truncation=True, padding=True
    )

    maxim_names = ["quantity", "quality", "relation", "manner"]
    temp_values = [
        temperatures.get("quantity", 0.9),
        temperatures.get("quality", 0.55),
        temperatures.get("relation", 0.75),
        temperatures.get("manner", 0.45),
    ]

    with torch.no_grad():
        logits = model(**inputs)

    probs, violations = {}, {}
    for i, (maxim, temp) in enumerate(zip(maxim_names, temp_values)):
        prob = torch.sigmoid(logits[0, i] / temp).item()
        probs[maxim] = round(prob, 4)
        violations[maxim] = prob > 0.5

    return {
        "violations": violations,
        "probabilities": probs,
        "is_cooperative": not any(violations.values())
    }

result = detect_violations(
    context="What do you think about the latest developments in AI?",
    response="Yes.",
    evidence="AI has seen rapid advancement in large language models during 2024-2025."
)
print(result)
```

---

## Performance

Evaluated on **1,000 held-out Topical-Chat dialogue turns** (500 violation-injected, 500 clean).

| Maxim | F1 | Precision | Recall | AUC-ROC |
|-------|-----|-----------|--------|---------|
| Quantity | **1.000** | 1.000 | 1.000 | 1.000 |
| Quality | 0.928 | 0.866 | 1.000 | 0.999 |
| Relation | **1.000** | 1.000 | 1.000 | 1.000 |
| Manner | 0.891 | 0.864 | 0.919 | 0.979 |
| **Macro Avg** | **0.955** | β€” | β€” | β€” |

---

## Architecture & Training

- **Base model:** `microsoft/deberta-v3-base` (184M parameters)
- **Heads:** 4 independent binary classification heads (one per maxim)
- **Loss:** Focal Loss (Ξ±=0.25, Ξ³=2.0) for class imbalance
- **Calibration:** Per-head temperature scaling (see `temperatures.json`)
- **Training data:** 4,012 examples (weak supervision + ~1,000 gold labels)
- **Epochs:** 5 | **LR:** 2e-5 | **Hardware:** Kaggle T4 Γ—2, ~2–3 hours

**Calibrated temperatures:**

| Maxim | Temperature | Effect |
|-------|-------------|--------|
| Quantity | 0.90 | Slightly sharper |
| Quality | 0.55 | Conservative (fewer false positives) |
| Relation | 0.75 | Balanced |
| Manner | 0.45 | Most conservative (subjective maxim) |

---

## Files

| File | Description |
|------|-------------|
| `pytorch_model.pt` | Trained model weights |
| `temperatures.json` | Per-maxim calibration temperatures |

---

## Limitations & Biases

- **Subjectivity:** The "Manner" maxim is inherently subjective; detection reflects the labels in the training set.
- **Domain Specificity:** Performance is optimized for general knowledge dialogue (Topical-Chat). Results may vary in specialized domains.
- **English-Only:** This model is trained and evaluated exclusively on English dialogue.
- **Prompt Sensitivity:** Detection results can be sensitive to the formatting of the "Evidence" field.

---

## Citation

```bibtex
 @article{prabhath2026gricebench,
  title={GriceBench: Operationalizing Gricean Maxims for Cooperative Dialogue Evaluation and Generation},
  author={Prabhath, Pushkar},
  year={2026},
  note={Under review, EMNLP 2026}
}
```

---

## Related Models

| Model | Role | Link |
|-------|------|------|
| GriceBench-Detector | Detects violations (this model) | You are here |
| GriceBench-Repair | Repairs detected violations | [πŸ”§ Repair](https://huggingface.co/Pushkar27/GriceBench-Repair) |
| GriceBench-DPO | Generates cooperative responses | [⚑ DPO](https://huggingface.co/Pushkar27/GriceBench-DPO) |

**GitHub:** https://github.com/PushkarPrabhath27/Research-Model

---

## Environmental Impact

| Aspect | Value |
|--------|-------|
| Hardware Used | 2x NVIDIA Tesla T4 GPUs (Kaggle) |
| Training Time | ~3 hours |
| Estimated Carbon Footprint | ~0.45 kg CO2eq