prefmatcher-7b / README.md
tsook's picture
Update README.md
bb67802 verified
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
base_model: Qwen/Qwen2.5-7B
tags:
- chat
library_name: transformers
---
## Links for Reference
- **Homepage: https://cupid.kixlab.org**
- **Repository: https://github.com/kixlab/CUPID**
- **Benchmark Dataset: https://huggingface.co/datasets/kixlab/CUPID**
- **Paper: https://arxiv.org/abs/2508.01674**
- **Point of Contact: taesoo.kim@kaist.ac.kr**
# TL; DR
**PrefMatcher-7B** instantiates the *Preference Match* metric proposed in the [CUPID benchmark](https://huggingface.co/datasets/kixlab/CUPID) (COLM 2025). The model takes a preference description and an evaluation checklist to assess whether each checklist item matches or is covered by the preference. The model is trained using [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as its base model. PrefMatcher provides a high-fidelity, cost efficient judge for automatic evaluation on the CUPID benchmark.
# Model Details
PrefMatcher-7B was finetuned through QLoRA for 1 epoch on 4k data samples (i.e., prefernece-checklist matches). PrefMatcher achieved a Krippendorff's alpha of 0.748 with human annotations. The data samples were created through the synthesis pipeline for the CUPID benchmark, which were then evaluated or matched by GPT-4o. The model was trained through the [torchtune](https://github.com/pytorch/torchtune) library.
## Model Description
- **Model type:** Language model
- **Language(s) (NLP):** English
- **License:** Apache 2.0
# Usage
Here is example code to use the model with [VLLM](https://github.com/vllm-project/vllm) to predict the match between a preference and an evaluation checklist.
```python
from vllm import LLM, SamplingParams
model_name = "kixlab/prefmatcher-7b"
# Load the model
llm = LLM(
model=model_name,
load_format="safetensors",
kv_cache_dtype="auto",
max_model_len=512
)
# Prepare example input
preference = "Analysis should focus exclusively on visible surface defects and their direct correlation to specific printer settings."
checklist = [
"Does the training document provide a detailed framework?",
"Does the training document provide a systematic framework?",
"Does the framework link external and internal test cube measurements to specific diagnostics?",
"Does the framework link external and internal test cube measurements to specific quality improvement actions?",
]
checklist_str = "\n".join([f"{i+1}. {item}" for i, item in enumerate(checklist)])
messages = [{
"role": "system",
"content": "You are an analytical and insightful assistant that can determine the similarity between **evaluation checklists** and **evaluation criteria**. A criterion describes an aspect of AI outputs that should be evaluated. A checklist contain questions that are used to evaluate more specific or fine-grained aspects of the AI outputs. You will be provided with pairs of checklists and criteria. For each pair, you should determine whether each entry in the checklist is **covered** by the criterion. **Covered** means that the criterion and the checklist entry will evaluate the same or similar aspects of an AI output, even if they use different wording or phrasing."
},
{
"role": "user",
"content": f"#### Criterion\n\n{preference}\n\n#### Checklist\n\n{checklist_str}"
}]
sampling_params = SamplingParams(
max_tokens=512,
temperature=0.7
)
# Generate the output
outputs = llm.chat(messages, sampling_params=sampling_params, use_tqdm=False)
# Print the output
print(outputs[0].outputs[0].text)
```
# Training Details
## Training hyperparameters
The following hyperparameters were used for training:
- learning_rate: 3e-4
- train_batch_size: 4
- gradient_accumulation_steps: 8
- weight_decay: 1e-2
- optimizer: AdamW
- lr_scheduler_type: Cosine with warmup
- num_warmup_steps: 100
- lora_rank: 64
- lora_alpha: 128
- lora_dropout: 0.0
- lora_attn_modules: ['q_proj', 'v_proj', 'output_proj']
- apply_lora_to_mlp: True
# Citation
If you find our work useful, please consider citing our paper!
**BibTeX:**
```bibtex
@article{kim2025cupid,
title = {CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions},
author = {Kim, Tae Soo and Lee, Yoonjoo and Park, Yoonah and Kim, Jiho and Kim, Young-Ho and Kim, Juho},
journal = {arXiv preprint arXiv:2508.01674},
year = {2025},
}
```