---
language:
- en
license: apache-2.0
library_name: peft
tags:
- text-generation
- dialogue
- gricean-maxims
- cooperative-communication
- lora
- dpo
- direct-preference-optimization
- peft
- gpt2
- nlp
datasets:
- topical-chat
metrics:
- cooperative_rate
pipeline_tag: text-generation
base_model: openai-community/gpt2-medium
model-index:
- name: GriceBench-DPO
  results:
  - task:
      type: text-generation
      name: Cooperative Dialogue Generation
    dataset:
      name: Topical-Chat (GriceBench test split)
      type: topical-chat
      split: test
    metrics:
    - type: cooperative_rate
      value: 0.832
      name: Standalone Cooperative Rate
    - type: cooperative_rate
      value: 0.950
      name: Full Pipeline Cooperative Rate
    - type: accuracy
      value: 0.750
      name: DPO Preference Accuracy
---

<div align="center">

# ⚡ GriceBench-DPO

**GPT-2-medium fine-tuned with Direct Preference Optimization to generate cooperative dialogue.**

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![PEFT LoRA](https://img.shields.io/badge/🤗-PEFT%20LoRA-yellow)](https://huggingface.co/docs/peft)
[![HuggingFace](https://img.shields.io/badge/🤗-GriceBench-yellow)](https://huggingface.co/Pushkar27)

**Part of the GriceBench system** —
[GitHub](https://github.com/PushkarPrabhath27/Research-Model) |
[🔍 Detector](https://huggingface.co/Pushkar27/GriceBench-Detector) |
[🔧 Repair Model](https://huggingface.co/Pushkar27/GriceBench-Repair)

</div>

---

## What This Model Does

GriceBench-DPO is a LoRA-adapted GPT-2-medium model trained with Direct Preference Optimization (DPO) to generate dialogue responses that comply with Gricean conversational maxims. It is the **generation stage** of the GriceBench pipeline, producing responses that are more likely to be cooperative *before* any post-generation detection and repair is applied.

| Metric | Score | Context |
|--------|-------|---------|
| Standalone cooperative rate | 83.2% | Using this model alone |
| Full pipeline cooperative rate | **95.0%** | DPO + Detector + Repair |
| DPO preference accuracy | 75.0% | Held-out preference pairs |
| DPO eval loss | 0.5595 | End of training |

> **Important:** The 95.0% figure requires the full pipeline. This model alone achieves 83.2% — still competitive with the un-tuned baseline (83.8%), with Relation violations dramatically reduced (~62% → ~10%).

---

## Quick Start

```python
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load LoRA adapter on GPT-2-medium base
adapter_path = "Pushkar27/GriceBench-DPO"
config = PeftConfig.from_pretrained(adapter_path)
print(f"Base model: {config.base_model_name_or_path}")
# → openai-community/gpt2-medium

tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
base_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    torch_dtype=torch.float32,
)
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()

def generate_cooperative_response(context: str, max_new_tokens: int = 80) -> str:
    prompt = f"Context: {context}\nResponse:"
    inputs = tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.85,
            top_p=0.92,
            repetition_penalty=1.3,
            pad_token_id=tokenizer.eos_token_id,
        )

    new_tokens = output_ids[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()


context = "What do you think about the history of jazz music in New Orleans?"
print(generate_cooperative_response(context))
```

---

## Full Pipeline Usage (Recommended for Best Results)

```python
# For 95.0% cooperative rate, use all three GriceBench models together:
# Step 1: Generate with this DPO model
response = generate_cooperative_response(context)

# Step 2: Detect any remaining violations
result = detect_violations(context, response, evidence)

# Step 3: Repair each flagged violation
for maxim, violated in result["violations"].items():
    if violated and maxim != "relation":
        response = repair_violation(context, response, maxim)

print(response)
```

Full pipeline implementation: [GitHub repository](https://github.com/PushkarPrabhath27/Research-Model)

---

## Ablation Results (Why You Need the Full Pipeline)

| Configuration | Cooperative Rate | Notes |
|---------------|-----------------|-------|
| Baseline (GPT-2, no tuning) | 83.8% | Reference |
| **This model (DPO only)** | **83.2%** | Relation violations -52pp; Manner unchanged |
| Detect + Repair (no DPO) | 93.0% | Repair handles Manner |
| **Full System** | **95.0%** | DPO + Detect + Repair combined |

**Why DPO alone barely moves the overall number:** DPO dramatically reduces Relation violations (62% → ~10%) but cannot address Manner violations (still ~64%), which are the dominant failure mode. The repair model handles Manner. Together: 95.0%.

---

## Training Details

### Model Architecture

| Parameter | Value |
|-----------|-------|
| Base model | `openai-community/gpt2-medium` (355M) |
| Method | LoRA (Low-Rank Adaptation) |
| LoRA rank (r) | 128 |
| LoRA alpha (α) | 256 |
| Target modules | q, k, v, o attention projections |
| Adapter size | ~25 MB |

### DPO Training

| Hyperparameter | Value |
|----------------|-------|
| Algorithm | Direct Preference Optimization (DPO) |
| DPO β | 0.1 |
| Learning rate | 5e-7 |
| Batch size | 16 (grad accum ×8) |
| Epochs | 3 |
| Training pairs | 1,970 filtered preference pairs |
| Hardware | Kaggle P100-16GB, ~24 minutes |

### DPO Loss (Plain Text)

The DPO loss maximizes the margin between chosen (y_w) and rejected (y_l) responses relative to a reference model:

  L_DPO = -log sigmoid( beta * [ log(pi(y_w|x)/pi_ref(y_w|x))
                                  - log(pi(y_l|x)/pi_ref(y_l|x)) ] )

where beta = 0.1 controls preference strength, y_w = cooperative response, y_l = violating response.

### Training Data

| Source | Pairs | Description |
|--------|-------|-------------|
| Human-labeled | 411 | Expert-verified cooperative/violating pairs |
| Repair-derived | ~1,200 | (original violation, T5-repaired output) |
| Synthetic (LLM) | ~1,200 | Generated via Groq API (llama-3.3-70b) |
| **Total (filtered)** | **1,970** | After conflict-detection filtering |

---

## Files

| File | Description |
|------|-------------|
| `adapter_config.json` | LoRA configuration (base model, rank, alpha) |
| `adapter_model.safetensors` | LoRA weights (~25 MB) |
| `tokenizer.json` | GPT-2 tokenizer |
| `tokenizer_config.json` | Tokenizer configuration |
| `special_tokens_map.json` | Special token mappings |

---

## Limitations

- **Manner violations persist standalone:** DPO reduces Relation violations but not Manner. The full pipeline is required for the headline 95.0% result.
- **Single domain:** Trained and evaluated on Topical-Chat only.
- **English only:** No multilingual support.
- **Preference accuracy (75.0%) vs. Phase 5 training accuracy (98.7%):** The 75.0% figure is from held-out Phase 7 evaluation (canonical). The 98.7% was from in-distribution Phase 5 evaluation and is not the representative number.

---

## Citation

```bibtex
 @article{prabhath2026gricebench,
  title={GriceBench: Operationalizing Gricean Maxims for Cooperative Dialogue Evaluation and Generation},
  author={Prabhath, Pushkar},
  year={2026},
  note={Under review, EMNLP 2026}
}
```

---

## Related Models

| Model | Role | Link |
|-------|------|------|
| GriceBench-Detector | Detects violations | [🔍 Detector](https://huggingface.co/Pushkar27/GriceBench-Detector) |
| GriceBench-Repair | Repairs violations | [🔧 Repair](https://huggingface.co/Pushkar27/GriceBench-Repair) |
| GriceBench-DPO | Generates cooperative responses (this model) | You are here |

**GitHub:** https://github.com/PushkarPrabhath27/Research-Model

---

## Environmental Impact

| Aspect | Value |
|--------|-------|
| Hardware Used | NVIDIA Tesla P100 GPU |
| Training Time | ~24 minutes |
| Estimated Carbon Footprint | ~0.05 kg CO2eq