File size: 4,403 Bytes
985ea97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb67802
985ea97
 
 
 
d0779f7
985ea97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb67802
985ea97
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
base_model: Qwen/Qwen2.5-7B
tags:
- chat
library_name: transformers
---

## Links for Reference

- **Homepage: https://cupid.kixlab.org** 
- **Repository: https://github.com/kixlab/CUPID** 
- **Benchmark Dataset: https://huggingface.co/datasets/kixlab/CUPID**
- **Paper: https://arxiv.org/abs/2508.01674** 
- **Point of Contact: taesoo.kim@kaist.ac.kr**

# TL; DR

**PrefMatcher-7B** instantiates the *Preference Match* metric proposed in the [CUPID benchmark](https://huggingface.co/datasets/kixlab/CUPID) (COLM 2025). The model takes a preference description and an evaluation checklist to assess whether each checklist item matches or is covered by the preference. The model is trained using [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as its base model.  PrefMatcher provides a high-fidelity, cost efficient judge for automatic evaluation on the CUPID benchmark.

# Model Details

PrefMatcher-7B was finetuned through QLoRA for 1 epoch on 4k data samples (i.e., prefernece-checklist matches). PrefMatcher achieved a Krippendorff's alpha of 0.748 with human annotations. The data samples were created through the synthesis pipeline for the CUPID benchmark, which were then evaluated or matched by GPT-4o. The model was trained through the [torchtune](https://github.com/pytorch/torchtune) library.

## Model Description

- **Model type:** Language model
- **Language(s) (NLP):** English
- **License:** Apache 2.0

# Usage
Here is example code to use the model with [VLLM](https://github.com/vllm-project/vllm) to predict the match between a preference and an evaluation checklist.
```python
from vllm import LLM, SamplingParams

model_name = "kixlab/prefmatcher-7b"

# Load the model
llm = LLM(
    model=model_name,
    load_format="safetensors",
    kv_cache_dtype="auto",
    max_model_len=512
)

# Prepare example input
preference = "Analysis should focus exclusively on visible surface defects and their direct correlation to specific printer settings."
checklist = [
    "Does the training document provide a detailed framework?",
    "Does the training document provide a systematic framework?",
    "Does the framework link external and internal test cube measurements to specific diagnostics?",
    "Does the framework link external and internal test cube measurements to specific quality improvement actions?",
]

checklist_str = "\n".join([f"{i+1}. {item}" for i, item in enumerate(checklist)])
messages = [{
    "role": "system",
    "content": "You are an analytical and insightful assistant that can determine the similarity between **evaluation checklists** and **evaluation criteria**. A criterion describes an aspect of AI outputs that should be evaluated. A checklist contain questions that are used to evaluate more specific or fine-grained aspects of the AI outputs. You will be provided with pairs of checklists and criteria. For each pair, you should determine whether each entry in the checklist is **covered** by the criterion. **Covered** means that the criterion and the checklist entry will evaluate the same or similar aspects of an AI output, even if they use different wording or phrasing."
},
{
    "role": "user",
    "content": f"#### Criterion\n\n{preference}\n\n#### Checklist\n\n{checklist_str}"
}]

sampling_params = SamplingParams(
    max_tokens=512,
    temperature=0.7
)

# Generate the output
outputs = llm.chat(messages, sampling_params=sampling_params, use_tqdm=False)

# Print the output
print(outputs[0].outputs[0].text)
```

# Training Details
## Training hyperparameters

The following hyperparameters were used for training:
- learning_rate: 3e-4
- train_batch_size: 4
- gradient_accumulation_steps: 8
- weight_decay: 1e-2
- optimizer: AdamW
- lr_scheduler_type: Cosine with warmup
- num_warmup_steps: 100
- lora_rank: 64
- lora_alpha: 128
- lora_dropout: 0.0
- lora_attn_modules: ['q_proj', 'v_proj', 'output_proj']
- apply_lora_to_mlp: True

# Citation

If you find our work useful, please consider citing our paper!

**BibTeX:**

```bibtex
@article{kim2025cupid,
  title     = {CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions},
  author    = {Kim, Tae Soo and Lee, Yoonjoo and Park, Yoonah and Kim, Jiho and Kim, Young-Ho and Kim, Juho},
  journal   = {arXiv preprint arXiv:2508.01674},
  year      = {2025},
}
```