Pushkar27
/

GriceBench-Detector

@@ -1,100 +1,89 @@
----
 language:
-  - en
 license: apache-2.0
 library_name: transformers
 tags:
-  - text-classification
-  - multi-label-classification
-  - dialogue
-  - conversational-ai
-  - gricean-maxims
-  - cooperative-communication
-  - deberta
-  - nlp
-  - pragmatics
 datasets:
-  - topical_chat
 metrics:
-  - f1
-  - precision
-  - recall
-  - roc_auc
 pipeline_tag: text-classification
 base_model: microsoft/deberta-v3-base
 model-index:
-  - name: GriceBench-Detector
-    results:
-      - task:
-          type: text-classification
-          name: Multi-Label Gricean Maxim Violation Detection
-        dataset:
-          name: Topical-Chat (GriceBench held-out split)
-          type: custom
-          split: test
-        metrics:
-          - type: f1
-            value: 0.955
-            name: Macro F1
-          - type: f1
-            value: 1.000
-            name: Quantity F1
-          - type: f1
-            value: 0.928
-            name: Quality F1
-          - type: f1
-            value: 1.000
-            name: Relation F1
-          - type: f1
-            value: 0.891
-            name: Manner F1
 ---
-<div align="center">
-# 🔍 GriceBench-Detector
-**Detects cooperative communication failures in AI dialogue — one Gricean maxim at a time.**
-[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
-[![HuggingFace](https://img.shields.io/badge/🤗-GriceBench-yellow)](https://huggingface.co/Pushkar27)
-[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
-**Part of the GriceBench system** —
-[GitHub](https://github.com/PushkarPrabhath27/Research-Model) |
-[🔧 Repair Model](https://huggingface.co/Pushkar27/GriceBench-Repair) |
-[⚡ DPO Generator](https://huggingface.co/Pushkar27/GriceBench-DPO)
-</div>
----
-## What This Model Does
-GriceBench-Detector identifies which of Paul Grice's four conversational maxims
-a dialogue response violates. It returns four independent calibrated violation
-probabilities — one per maxim — enabling targeted, explainable repair downstream.
-| Output | Maxim | Violation Detected | Example |
-|--------|-------|-------------------|---------|
-| `quantity_prob` | Quantity | Response too short (<8 words) or too long (>38 words) | "Yes." to a detailed question |
-| `quality_prob` | Quality | Factually inconsistent with knowledge evidence | Wrong date, incorrect name |
-| `relation_prob` | Relation | Off-topic response | Jazz question answered with classical music facts |
-| `manner_prob` | Manner | Ambiguous, jargon-heavy, or disorganized | Unclear pronoun references |
----
-## Intended Use
-- **Primary Use:** Evaluating conversational AI for cooperative behavior.
-- **Filtering:** Post-generation filtering to flag responses for repair.
-- **Research:** Investigating pragmatics and Gricean maxim violations in LLMs.
-- **Out-of-Scope:** Not intended for high-stakes factual verification (e.g., medical/legal) or as a stand-alone truth-teller.
----
-## Quick Start
 ```python
 import torch
 import torch.nn as nn
@@ -124,6 +113,7 @@ class MaximDetector(nn.Module):
         return torch.cat([head(cls) for head in self.classifiers], dim=1)
 # ── Load model and calibration ──────────────────────────────────────────────
 tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
 model = MaximDetector()
 state_dict = torch.load("pytorch_model.pt", map_location="cpu")
@@ -150,7 +140,7 @@ def detect_violations(context: str, response: str, evidence: str = "") -> dict:
     ]
     with torch.no_grad():
-        logits = model(**inputs)
     probs, violations = {}, {}
     for i, (maxim, temp) in enumerate(zip(maxim_names, temp_values)):
@@ -163,51 +153,51 @@ def detect_violations(context: str, response: str, evidence: str = "") -> dict:
         "probabilities": probs,
         "is_cooperative": not any(violations.values())
     }
-```
----
-## Performance
-Evaluated on **1,000 held-out Topical-Chat dialogue turns** (500 violation-injected, 500 clean).
-| Maxim | F1 | Precision | Recall | AUC-ROC |
-|-------|-----|-----------|--------|---------|
-| Quantity | **1.000** | 1.000 | 1.000 | 1.000 |
-| Quality | 0.928 | 0.866 | 1.000 | 0.999 |
-| Relation | **1.000** | 1.000 | 1.000 | 1.000 |
-| Manner | 0.891 | 0.864 | 0.919 | 0.979 |
-| **Macro Avg** | **0.955** | — | — | — |
----
-## Limitations & Biases
-- **Subjectivity:** The "Manner" maxim is inherently subjective; detection reflects the labels in the training set.
-- **Domain Specificity:** Performance is optimized for general knowledge dialogue (Topical-Chat). Results may vary in specialized domains (e.g., highly technical or medical).
-- **English-Only:** This model is trained and evaluated exclusively on English dialogue.
-- **Prompt Sensitivity:** Detection results can be sensitive to the formatting of the "Evidence" field.
----
-## Environmental Impact
-- **Hardware Used:** 2x NVIDIA Tesla T4 GPUs (Kaggle).
-- **Training Time:** ~3 hours.
-- **Estimated Carbon Footprint:** ~0.45 kg CO2eq (based on average TDP and regional carbon intensity).
----
-## Architecture & Training
-- **Base model:** `microsoft/deberta-v3-base` (184M parameters)
-- **Heads:** 4 independent binary classification heads.
-- **Calibration:** Per-head temperature scaling (see `temperatures.json`).
----
-## Citation
 ```bibtex
  @article{prabhath2026gricebench,
   title={GriceBench: Operationalizing Gricean Maxims for Cooperative Dialogue Evaluation and Generation},
@@ -216,3 +206,15 @@ Evaluated on **1,000 held-out Topical-Chat dialogue turns** (500 violation-injec
   note={Under review, EMNLP 2026}
 }
 ```

+---
 language:
+- en
 license: apache-2.0
 library_name: transformers
 tags:
+- text-classification
+- multi-label-classification
+- dialogue
+- conversational-ai
+- gricean-maxims
+- cooperative-communication
+- deberta
+- nlp
+- pragmatics
 datasets:
+- topical_chat
 metrics:
+- f1
+- precision
+- recall
+- roc_auc
 pipeline_tag: text-classification
 base_model: microsoft/deberta-v3-base
 model-index:
+- name: GriceBench-Detector
+  results:
+  - task:
+      type: text-classification
+      name: Multi-Label Gricean Maxim Violation Detection
+    dataset:
+      name: Topical-Chat (GriceBench held-out split, N=1000)
+      type: custom
+      split: test
+    metrics:
+    - type: f1
+      value: 0.955
+      name: Macro F1
+    - type: f1
+      value: 1.000
+      name: Quantity F1
+    - type: f1
+      value: 0.928
+      name: Quality F1
+    - type: f1
+      value: 1.000
+      name: Relation F1
+    - type: f1
+      value: 0.891
+      name: Manner F1
 ---
+🔍 GriceBench-Detector
+Detects cooperative communication failures in AI dialogue — one Gricean maxim at a time.
+License-Apache%202.0-blue.svg
+%F0%9F%A4%97-GriceBench-yellow
+python-3.8+-blue.svg
+Part of the GriceBench system —
+GitHub |
+🔧 Repair Model |
+⚡ DPO Generator
+What This Model Does
+GriceBench-Detector identifies which of Paul Grice's four conversational maxims a dialogue response violates. It returns four independent calibrated violation probabilities — one per maxim — enabling targeted, explainable repair downstream.
+Output	Maxim	Violation Detected	Example
+quantity_prob	Quantity	Response too short (<8 words) or too long (>38 words)	"Yes." to a detailed question
+quality_prob	Quality	Factually inconsistent with knowledge evidence	Wrong date, incorrect name
+relation_prob	Relation	Off-topic response	Jazz question answered with classical music facts
+manner_prob	Manner	Ambiguous, jargon-heavy, or disorganized	Unclear pronoun references
+Used in the full GriceBench pipeline, this detector helps achieve a 95.0% cooperative rate — outperforming Mistral-7B-Instruct (89.1%) and Qwen2.5-7B-Instruct (84.2%).
+Quick Start
 ```python
 import torch
 import torch.nn as nn
         return torch.cat([head(cls) for head in self.classifiers], dim=1)
 # ── Load model and calibration ──────────────────────────────────────────────
+# Download pytorch_model.pt and temperatures.json from this repo first
 tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
 model = MaximDetector()
 state_dict = torch.load("pytorch_model.pt", map_location="cpu")
     ]
     with torch.no_grad():
+        logits = model(**inputs)  # Shape: [1, 4]
     probs, violations = {}, {}
     for i, (maxim, temp) in enumerate(zip(maxim_names, temp_values)):
         "probabilities": probs,
         "is_cooperative": not any(violations.values())
     }
+# ── Example ─────────────────────────────────────────────────────────────────
+result = detect_violations(
+    context="What do you think about the latest developments in AI?",
+    response="Yes.",   # Too short — Quantity violation
+    evidence="AI has seen rapid advancement in large language models during 2024-2025."
+)
+print(result)
+# {'violations': {'quantity': True, 'quality': False, 'relation': False, 'manner': False},
+#  'probabilities': {'quantity': 0.97, 'quality': 0.02, 'relation': 0.03, 'manner': 0.11},
+#  'is_cooperative': False}
+```
+Performance
+Evaluated on 1,000 held-out Topical-Chat dialogue turns (500 violation-injected, 500 clean).
+Maxim	F1	Precision	Recall	AUC-ROC
+Quantity	1.000	1.000	1.000	1.000
+Quality	0.928	0.866	1.000	0.999
+Relation	1.000	1.000	1.000	1.000
+Manner	0.891	0.864	0.919	0.979
+Macro Avg	0.955	—	—	—
+Architecture & Training
+Base model: microsoft/deberta-v3-base (184M parameters)
+Heads: 4 independent binary classification heads (one per maxim)
+Loss: Focal Loss (α=0.25, γ=2.0) for class imbalance
+Calibration: Per-head temperature scaling (see temperatures.json)
+Training data: 4,012 examples (weak supervision + ~1,000 gold labels)
+Epochs: 5 | LR: 2e-5 | Hardware: Kaggle T4 ×2, ~2–3 hours
+Calibrated temperatures:
+Maxim	Temperature	Effect
+Quantity	0.90	Slightly sharper
+Quality	0.55	Conservative (fewer false positives)
+Relation	0.75	Balanced
+Manner	0.45	Most conservative (subjective maxim)
+Files
+File	Description
+pytorch_model.pt	Trained model weights
+temperatures.json	Per-maxim calibration temperatures
+Limitations & Biases
+Subjectivity: The "Manner" maxim is inherently subjective; detection reflects the labels in the training set.
+Domain Specificity: Performance is optimized for general knowledge dialogue (Topical-Chat). Results may vary in specialized domains.
+English-Only: This model is trained and evaluated exclusively on English dialogue.
+Prompt Sensitivity: Detection results can be sensitive to the formatting of the "Evidence" field.
+Citation
 ```bibtex
  @article{prabhath2026gricebench,
   title={GriceBench: Operationalizing Gricean Maxims for Cooperative Dialogue Evaluation and Generation},
   note={Under review, EMNLP 2026}
 }
 ```
+Related Models
+Model	Role	Link
+GriceBench-Detector	Detects violations (this model)	You are here
+GriceBench-Repair	Repairs detected violations	🔧 Repair
+GriceBench-DPO	Generates cooperative responses	⚡ DPO
+GitHub: https://github.com/PushkarPrabhath27/Research-Model
+Environmental Impact
+Aspect	Value
+Hardware Used	2x NVIDIA Tesla T4 GPUs (Kaggle)
+Training Time	~3 hours
+Estimated Carbon Footprint	~0.45 kg CO2eq