⚡ GriceBench-DPO

GPT-2-medium fine-tuned with Direct Preference Optimization to generate cooperative dialogue.

Part of the GriceBench system — GitHub | 🔍 Detector | 🔧 Repair Model

What This Model Does

GriceBench-DPO is a LoRA-adapted GPT-2-medium model trained with Direct Preference Optimization (DPO) to generate dialogue responses that comply with Gricean conversational maxims. It is the generation stage of the GriceBench pipeline, producing responses that are more likely to be cooperative before any post-generation detection and repair is applied.

Metric	Score	Context
Standalone cooperative rate	83.2%	Using this model alone
Full pipeline cooperative rate	95.0%	DPO + Detector + Repair
DPO preference accuracy	75.0%	Held-out preference pairs
DPO eval loss	0.5595	End of training

Important: The 95.0% figure requires the full pipeline. This model alone achieves 83.2% — still competitive with the un-tuned baseline (83.8%), with Relation violations dramatically reduced (~62% → ~10%).

Quick Start

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load LoRA adapter on GPT-2-medium base
adapter_path = "Pushkar27/GriceBench-DPO"
config = PeftConfig.from_pretrained(adapter_path)
print(f"Base model: {config.base_model_name_or_path}")
# → openai-community/gpt2-medium

tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
base_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    torch_dtype=torch.float32,
)
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()

def generate_cooperative_response(context: str, max_new_tokens: int = 80) -> str:
    prompt = f"Context: {context}\nResponse:"
    inputs = tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.85,
            top_p=0.92,
            repetition_penalty=1.3,
            pad_token_id=tokenizer.eos_token_id,
        )

    new_tokens = output_ids[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()


context = "What do you think about the history of jazz music in New Orleans?"
print(generate_cooperative_response(context))

Full Pipeline Usage (Recommended for Best Results)

# For 95.0% cooperative rate, use all three GriceBench models together:
# Step 1: Generate with this DPO model
response = generate_cooperative_response(context)

# Step 2: Detect any remaining violations
result = detect_violations(context, response, evidence)

# Step 3: Repair each flagged violation
for maxim, violated in result["violations"].items():
    if violated and maxim != "relation":
        response = repair_violation(context, response, maxim)

print(response)

Full pipeline implementation: GitHub repository

Ablation Results (Why You Need the Full Pipeline)

Configuration	Cooperative Rate	Notes
Baseline (GPT-2, no tuning)	83.8%	Reference
This model (DPO only)	83.2%	Relation violations -52pp; Manner unchanged
Detect + Repair (no DPO)	93.0%	Repair handles Manner
Full System	95.0%	DPO + Detect + Repair combined

Why DPO alone barely moves the overall number: DPO dramatically reduces Relation violations (62% → ~10%) but cannot address Manner violations (still ~64%), which are the dominant failure mode. The repair model handles Manner. Together: 95.0%.

Training Details

Model Architecture

Parameter	Value
Base model	`openai-community/gpt2-medium` (355M)
Method	LoRA (Low-Rank Adaptation)
LoRA rank (r)	128
LoRA alpha (α)	256
Target modules	q, k, v, o attention projections
Adapter size	~25 MB

DPO Training

Hyperparameter	Value
Algorithm	Direct Preference Optimization (DPO)
DPO β	0.1
Learning rate	5e-7
Batch size	16 (grad accum ×8)
Epochs	3
Training pairs	1,970 filtered preference pairs
Hardware	Kaggle P100-16GB, ~24 minutes

DPO Loss (Plain Text)

The DPO loss maximizes the margin between chosen (y_w) and rejected (y_l) responses relative to a reference model:

L_DPO = -log sigmoid( beta * [ log(pi(y_w|x)/pi_ref(y_w|x)) - log(pi(y_l|x)/pi_ref(y_l|x)) ] )

where beta = 0.1 controls preference strength, y_w = cooperative response, y_l = violating response.

Training Data

Source	Pairs	Description
Human-labeled	411	Expert-verified cooperative/violating pairs
Repair-derived	~1,200	(original violation, T5-repaired output)
Synthetic (LLM)	~1,200	Generated via Groq API (llama-3.3-70b)
Total (filtered)	1,970	After conflict-detection filtering

Files

File	Description
`adapter_config.json`	LoRA configuration (base model, rank, alpha)
`adapter_model.safetensors`	LoRA weights (~25 MB)
`tokenizer.json`	GPT-2 tokenizer
`tokenizer_config.json`	Tokenizer configuration
`special_tokens_map.json`	Special token mappings

Limitations

Manner violations persist standalone: DPO reduces Relation violations but not Manner. The full pipeline is required for the headline 95.0% result.
Single domain: Trained and evaluated on Topical-Chat only.
English only: No multilingual support.
Preference accuracy (75.0%) vs. Phase 5 training accuracy (98.7%): The 75.0% figure is from held-out Phase 7 evaluation (canonical). The 98.7% was from in-distribution Phase 5 evaluation and is not the representative number.

Citation

 @article{prabhath2026gricebench,
  title={GriceBench: Operationalizing Gricean Maxims for Cooperative Dialogue Evaluation and Generation},
  author={Prabhath, Pushkar},
  year={2026},
  note={Under review, EMNLP 2026}
}

Related Models

Model	Role	Link
GriceBench-Detector	Detects violations	🔍 Detector
GriceBench-Repair	Repairs violations	🔧 Repair
GriceBench-DPO	Generates cooperative responses (this model)	You are here

GitHub: https://github.com/PushkarPrabhath27/Research-Model

Environmental Impact

Aspect	Value
Hardware Used	NVIDIA Tesla P100 GPU
Training Time	~24 minutes
Estimated Carbon Footprint	~0.05 kg CO2eq

Downloads last month: 96

Model tree for Pushkar27/GriceBench-DPO

Base model

openai-community/gpt2-medium

Adapter

(284)

this model

Evaluation results

Standalone Cooperative Rate on Topical-Chat (GriceBench test split)
test set self-reported

0.832
Full Pipeline Cooperative Rate on Topical-Chat (GriceBench test split)
test set self-reported

0.950
DPO Preference Accuracy on Topical-Chat (GriceBench test split)
test set self-reported

0.750