Pushkar27 commited on
Commit
ebfe70b
Β·
1 Parent(s): 8b167ee

docs: upgrade to production-quality model card with Limitations and Environmental Impact

Browse files
Files changed (1) hide show
  1. README.md +60 -125
README.md CHANGED
@@ -1,4 +1,4 @@
1
- ---
2
  language:
3
  - en
4
  license: apache-2.0
@@ -52,15 +52,16 @@ model-index:
52
 
53
  <div align="center">
54
 
55
- # πŸ—£οΈ GriceBench-Detector
56
 
57
- **Detects cooperative communication failures in AI dialogue β€” one maxim at a time.**
58
 
59
  [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 
60
  [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
61
- [![Transformers](https://img.shields.io/badge/πŸ€—-Transformers-yellow)](https://huggingface.co/docs/transformers)
62
 
63
- Part of the **GriceBench** system β€” [GitHub](https://github.com/PushkarPrabhath27/Research-Model) |
 
64
  [πŸ”§ Repair Model](https://huggingface.co/Pushkar27/GriceBench-Repair) |
65
  [⚑ DPO Generator](https://huggingface.co/Pushkar27/GriceBench-DPO)
66
 
@@ -71,32 +72,36 @@ Part of the **GriceBench** system β€” [GitHub](https://github.com/PushkarPrabhat
71
  ## What This Model Does
72
 
73
  GriceBench-Detector identifies which of Paul Grice's four conversational maxims
74
- a dialogue response violates. It returns four independent violation probabilities β€”
75
- one per maxim β€” enabling targeted, explainable repair.
 
 
 
 
 
 
 
 
 
 
 
76
 
77
- | Maxim | What It Measures | Example Violation |
78
- |-------|-----------------|-------------------|
79
- | **Quantity** | Response informativeness | "Yes." in response to a detailed question |
80
- | **Quality** | Factual consistency with evidence | Stating an incorrect fact contradicted by the knowledge source |
81
- | **Relation** | Topical relevance | Responding to "Tell me about jazz" with information about classical music |
82
- | **Manner** | Clarity and organization | Pronoun ambiguity, jargon, disorganized sentences |
83
 
84
  ---
85
 
86
  ## Quick Start
87
 
88
  ```python
89
- from transformers import AutoTokenizer, AutoModel
90
  import torch
91
  import torch.nn as nn
92
  import json
 
93
 
94
- # ── Load calibration temperatures ──────────────────────────────────────────
95
- # Download temperatures.json from the model repo
96
- with open("temperatures.json") as f:
97
- temperatures = json.load(f) # {"quantity": 0.9, "quality": 0.55, ...}
98
-
99
- # ── Define model architecture (must match training) ────────────────────────
100
  class MaximDetector(nn.Module):
101
  def __init__(self, model_name="microsoft/deberta-v3-base", num_maxims=4):
102
  super().__init__()
@@ -118,23 +123,24 @@ class MaximDetector(nn.Module):
118
  cls = outputs.last_hidden_state[:, 0, :]
119
  return torch.cat([head(cls) for head in self.classifiers], dim=1)
120
 
121
- # ── Load model ─────────────────────────────────────────────────────────────
122
  tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
123
  model = MaximDetector()
124
-
125
- # Load weights (download pytorch_model.pt from this repo)
126
  state_dict = torch.load("pytorch_model.pt", map_location="cpu")
127
  model.load_state_dict(state_dict)
128
  model.eval()
129
 
130
- # ── Run detection ──────────────────────────────────────────────────────────
 
 
 
131
  def detect_violations(context: str, response: str, evidence: str = "") -> dict:
132
  input_text = f"Context: {context}\nEvidence: {evidence}\nResponse: {response}"
133
  inputs = tokenizer(
134
- input_text, return_tensors="pt", max_length=512,
135
- truncation=True, padding=True
136
  )
137
-
138
  maxim_names = ["quantity", "quality", "relation", "manner"]
139
  temp_values = [
140
  temperatures.get("quantity", 0.9),
@@ -142,142 +148,71 @@ def detect_violations(context: str, response: str, evidence: str = "") -> dict:
142
  temperatures.get("relation", 0.75),
143
  temperatures.get("manner", 0.45),
144
  ]
145
-
146
  with torch.no_grad():
147
- logits = model(**inputs) # Shape: [1, 4]
148
-
149
- # Apply temperature scaling and sigmoid
150
- probs = {}
151
- violations = {}
152
  for i, (maxim, temp) in enumerate(zip(maxim_names, temp_values)):
153
  prob = torch.sigmoid(logits[0, i] / temp).item()
154
  probs[maxim] = round(prob, 4)
155
  violations[maxim] = prob > 0.5
156
-
157
  return {
158
  "violations": violations,
159
  "probabilities": probs,
160
  "is_cooperative": not any(violations.values())
161
  }
162
-
163
- # ── Example ────────────────────────────────────────────────────────────────
164
- result = detect_violations(
165
- context="What do you think about the latest developments in AI?",
166
- response="Yes.", # Too short β€” Quantity violation
167
- evidence="AI has seen rapid advancement in large language models during 2024-2025."
168
- )
169
- print(result)
170
- # {'violations': {'quantity': True, 'quality': False, 'relation': False, 'manner': False},
171
- # 'probabilities': {'quantity': 0.97, 'quality': 0.02, 'relation': 0.03, 'manner': 0.11},
172
- # 'is_cooperative': False}
173
  ```
174
 
175
  ---
176
 
177
- ## Model Performance
178
 
179
  Evaluated on **1,000 held-out Topical-Chat dialogue turns** (500 violation-injected, 500 clean).
180
 
181
  | Maxim | F1 | Precision | Recall | AUC-ROC |
182
  |-------|-----|-----------|--------|---------|
183
- +| Quantity | **1.000** | 1.000 | 1.000 | 1.000 |
184
- +| Quality | 0.928 | 0.866 | 1.000 | 0.999 |
185
- +| Relation | **1.000** | 1.000 | 1.000 | 1.000 |
186
- +| Manner | 0.891 | 0.864 | 0.919 | 0.979 |
187
- +| **Macro Avg** | **0.955** | β€” | β€” | β€” |
188
-
189
- **System-level result:** When used in the full GriceBench pipeline (Detect β†’ Repair β†’ Generate),
190
- the system achieves a **95.0% cooperative rate** β€” outperforming Mistral-7B (89.1%) and
191
- Qwen2.5-7B (84.2%) despite using a far smaller generator.
192
 
193
  ---
194
 
195
- ## Architecture
196
-
197
- **Base model:** `microsoft/deberta-v3-base` (184M parameters)
198
-
199
- **Key design choices:**
200
- - **Four independent binary heads** (not a shared linear layer): each maxim head specializes
201
- independently, since Quantity violations (length) and Relation violations (semantic relevance)
202
- are completely different feature distributions.
203
- - **Focal Loss** (Ξ±=0.25, Ξ³=2.0): down-weights easy negatives to focus training on hard,
204
- ambiguous boundary cases β€” critical for minority-class violation detection.
205
- - **Temperature scaling**: post-hoc calibration (one scalar per maxim) ensures output
206
- probabilities match true violation frequencies on the validation set.
207
-
208
- **Calibrated temperatures:**
209
 
210
- | Maxim | Temperature | Effect |
211
- |-------|-------------|--------|
212
- | Quantity | 0.90 | Slightly sharper predictions |
213
- | Quality | 0.55 | More conservative (fewer false positives) |
214
- | Relation | 0.75 | Balanced |
215
- | Manner | 0.45 | Most conservative (Manner is inherently ambiguous) |
216
 
217
  ---
218
 
219
- ## Training Details
220
-
221
- | Hyperparameter | Value |
222
- +|----------------|-------|
223
- +| Base model | microsoft/deberta-v3-base |
224
- +| Learning rate | 2e-5 |
225
- +| Batch size | 16 (effective, with grad accumulation Γ—2) |
226
- +| Epochs | 5 |
227
- +| Loss | Focal Loss (Ξ±=0.25, Ξ³=2.0) |
228
- +| Optimizer | AdamW + weight decay 0.01 |
229
- +| Scheduler | OneCycleLR |
230
- +| Hardware | Kaggle T4 Γ—2 |
231
- +| Training time | ~2-3 hours |
232
- +| Training examples | 4,012 (weak supervision + ~1,000 gold labels) |
233
-
234
- **Two-stage labeling:** Weak supervision (50,000+ heuristic-labeled examples) for pre-training,
235
- followed by gold fine-tuning on ~1,000 human-annotated examples (inter-annotator agreement
236
- measured via Krippendorff's Ξ±).
237
-
238
- ---
239
-
240
- ## Input Format
241
-
242
- ```
243
- Context: [multi-turn conversation history]
244
- Evidence: [knowledge snippet from reading set β€” required for Quality detection]
245
- Response: [the response being evaluated]
246
- ```
247
 
248
- Maximum token length: 512 (response is never truncated β€” context is truncated if needed).
 
 
249
 
250
  ---
251
 
252
- ## Files in This Repository
253
 
254
- | File | Description |
255
- |------|-------------|
256
- | `pytorch_model.pt` | Trained model weights (2.22 GB) |
257
- | `temperatures.json` | Per-maxim calibration temperatures |
258
 
259
  ---
260
 
261
  ## Citation
262
 
263
- If you use this model, please cite:
264
-
265
  ```bibtex
266
- @article{prabhath2026gricebench,
267
  title={GriceBench: Operationalizing Gricean Maxims for Cooperative Dialogue Evaluation and Generation},
268
  author={Prabhath, Pushkar},
269
- year={2026}
 
270
  }
271
  ```
272
-
273
- ---
274
-
275
- ## Related Models
276
-
277
- | Model | Role | Link |
278
- |-------|------|------|
279
- | GriceBench-Detector | Detects violations (this model) | You are here |
280
- | GriceBench-Repair | Repairs detected violations | [πŸ”§ Repair Model](https://huggingface.co/Pushkar27/GriceBench-Repair) |
281
- | GriceBench-DPO | Generates cooperative responses | [⚑ DPO Generator](https://huggingface.co/Pushkar27/GriceBench-DPO) |
282
-
283
- **GitHub:** https://github.com/PushkarPrabhath27/Research-Model
 
1
+ ο»Ώ---
2
  language:
3
  - en
4
  license: apache-2.0
 
52
 
53
  <div align="center">
54
 
55
+ # πŸ” GriceBench-Detector
56
 
57
+ **Detects cooperative communication failures in AI dialogue β€” one Gricean maxim at a time.**
58
 
59
  [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
60
+ [![HuggingFace](https://img.shields.io/badge/πŸ€—-GriceBench-yellow)](https://huggingface.co/Pushkar27)
61
  [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
 
62
 
63
+ **Part of the GriceBench system** β€”
64
+ [GitHub](https://github.com/PushkarPrabhath27/Research-Model) |
65
  [πŸ”§ Repair Model](https://huggingface.co/Pushkar27/GriceBench-Repair) |
66
  [⚑ DPO Generator](https://huggingface.co/Pushkar27/GriceBench-DPO)
67
 
 
72
  ## What This Model Does
73
 
74
  GriceBench-Detector identifies which of Paul Grice's four conversational maxims
75
+ a dialogue response violates. It returns four independent calibrated violation
76
+ probabilities β€” one per maxim β€” enabling targeted, explainable repair downstream.
77
+
78
+ | Output | Maxim | Violation Detected | Example |
79
+ |--------|-------|-------------------|---------|
80
+ | `quantity_prob` | Quantity | Response too short (<8 words) or too long (>38 words) | "Yes." to a detailed question |
81
+ | `quality_prob` | Quality | Factually inconsistent with knowledge evidence | Wrong date, incorrect name |
82
+ | `relation_prob` | Relation | Off-topic response | Jazz question answered with classical music facts |
83
+ | `manner_prob` | Manner | Ambiguous, jargon-heavy, or disorganized | Unclear pronoun references |
84
+
85
+ ---
86
+
87
+ ## Intended Use
88
 
89
+ - **Primary Use:** Evaluating conversational AI for cooperative behavior.
90
+ - **Filtering:** Post-generation filtering to flag responses for repair.
91
+ - **Research:** Investigating pragmatics and Gricean maxim violations in LLMs.
92
+ - **Out-of-Scope:** Not intended for high-stakes factual verification (e.g., medical/legal) or as a stand-alone truth-teller.
 
 
93
 
94
  ---
95
 
96
  ## Quick Start
97
 
98
  ```python
 
99
  import torch
100
  import torch.nn as nn
101
  import json
102
+ from transformers import AutoTokenizer, AutoModel
103
 
104
+ # ── Define model architecture (must match training) ─────────────────────────
 
 
 
 
 
105
  class MaximDetector(nn.Module):
106
  def __init__(self, model_name="microsoft/deberta-v3-base", num_maxims=4):
107
  super().__init__()
 
123
  cls = outputs.last_hidden_state[:, 0, :]
124
  return torch.cat([head(cls) for head in self.classifiers], dim=1)
125
 
126
+ # ── Load model and calibration ──────────────────────────────────────────────
127
  tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
128
  model = MaximDetector()
 
 
129
  state_dict = torch.load("pytorch_model.pt", map_location="cpu")
130
  model.load_state_dict(state_dict)
131
  model.eval()
132
 
133
+ with open("temperatures.json") as f:
134
+ temperatures = json.load(f)
135
+
136
+ # ── Detect violations ───────────────────────────────────────────────────────
137
  def detect_violations(context: str, response: str, evidence: str = "") -> dict:
138
  input_text = f"Context: {context}\nEvidence: {evidence}\nResponse: {response}"
139
  inputs = tokenizer(
140
+ input_text, return_tensors="pt",
141
+ max_length=512, truncation=True, padding=True
142
  )
143
+
144
  maxim_names = ["quantity", "quality", "relation", "manner"]
145
  temp_values = [
146
  temperatures.get("quantity", 0.9),
 
148
  temperatures.get("relation", 0.75),
149
  temperatures.get("manner", 0.45),
150
  ]
151
+
152
  with torch.no_grad():
153
+ logits = model(**inputs)
154
+
155
+ probs, violations = {}, {}
 
 
156
  for i, (maxim, temp) in enumerate(zip(maxim_names, temp_values)):
157
  prob = torch.sigmoid(logits[0, i] / temp).item()
158
  probs[maxim] = round(prob, 4)
159
  violations[maxim] = prob > 0.5
160
+
161
  return {
162
  "violations": violations,
163
  "probabilities": probs,
164
  "is_cooperative": not any(violations.values())
165
  }
 
 
 
 
 
 
 
 
 
 
 
166
  ```
167
 
168
  ---
169
 
170
+ ## Performance
171
 
172
  Evaluated on **1,000 held-out Topical-Chat dialogue turns** (500 violation-injected, 500 clean).
173
 
174
  | Maxim | F1 | Precision | Recall | AUC-ROC |
175
  |-------|-----|-----------|--------|---------|
176
+ | Quantity | **1.000** | 1.000 | 1.000 | 1.000 |
177
+ | Quality | 0.928 | 0.866 | 1.000 | 0.999 |
178
+ | Relation | **1.000** | 1.000 | 1.000 | 1.000 |
179
+ | Manner | 0.891 | 0.864 | 0.919 | 0.979 |
180
+ | **Macro Avg** | **0.955** | β€” | β€” | β€” |
 
 
 
 
181
 
182
  ---
183
 
184
+ ## Limitations & Biases
 
 
 
 
 
 
 
 
 
 
 
 
 
185
 
186
+ - **Subjectivity:** The "Manner" maxim is inherently subjective; detection reflects the labels in the training set.
187
+ - **Domain Specificity:** Performance is optimized for general knowledge dialogue (Topical-Chat). Results may vary in specialized domains (e.g., highly technical or medical).
188
+ - **English-Only:** This model is trained and evaluated exclusively on English dialogue.
189
+ - **Prompt Sensitivity:** Detection results can be sensitive to the formatting of the "Evidence" field.
 
 
190
 
191
  ---
192
 
193
+ ## Environmental Impact
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
 
195
+ - **Hardware Used:** 2x NVIDIA Tesla T4 GPUs (Kaggle).
196
+ - **Training Time:** ~3 hours.
197
+ - **Estimated Carbon Footprint:** ~0.45 kg CO2eq (based on average TDP and regional carbon intensity).
198
 
199
  ---
200
 
201
+ ## Architecture & Training
202
 
203
+ - **Base model:** `microsoft/deberta-v3-base` (184M parameters)
204
+ - **Heads:** 4 independent binary classification heads.
205
+ - **Calibration:** Per-head temperature scaling (see `temperatures.json`).
 
206
 
207
  ---
208
 
209
  ## Citation
210
 
 
 
211
  ```bibtex
212
+ @article{prabhath2026gricebench,
213
  title={GriceBench: Operationalizing Gricean Maxims for Cooperative Dialogue Evaluation and Generation},
214
  author={Prabhath, Pushkar},
215
+ year={2026},
216
+ note={Under review, EMNLP 2026}
217
  }
218
  ```