|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
base_model: Qwen/Qwen2.5-7B |
|
|
tags: |
|
|
- chat |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
## Links for Reference |
|
|
|
|
|
- **Homepage: https://cupid.kixlab.org** |
|
|
- **Repository: https://github.com/kixlab/CUPID** |
|
|
- **Benchmark Dataset: https://huggingface.co/datasets/kixlab/CUPID** |
|
|
- **Paper: https://arxiv.org/abs/2508.01674** |
|
|
- **Point of Contact: taesoo.kim@kaist.ac.kr** |
|
|
|
|
|
# TL; DR |
|
|
|
|
|
**PrefMatcher-7B** instantiates the *Preference Match* metric proposed in the [CUPID benchmark](https://huggingface.co/datasets/kixlab/CUPID) (COLM 2025). The model takes a preference description and an evaluation checklist to assess whether each checklist item matches or is covered by the preference. The model is trained using [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as its base model. PrefMatcher provides a high-fidelity, cost efficient judge for automatic evaluation on the CUPID benchmark. |
|
|
|
|
|
# Model Details |
|
|
|
|
|
PrefMatcher-7B was finetuned through QLoRA for 1 epoch on 4k data samples (i.e., prefernece-checklist matches). PrefMatcher achieved a Krippendorff's alpha of 0.748 with human annotations. The data samples were created through the synthesis pipeline for the CUPID benchmark, which were then evaluated or matched by GPT-4o. The model was trained through the [torchtune](https://github.com/pytorch/torchtune) library. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Model type:** Language model |
|
|
- **Language(s) (NLP):** English |
|
|
- **License:** Apache 2.0 |
|
|
|
|
|
# Usage |
|
|
Here is example code to use the model with [VLLM](https://github.com/vllm-project/vllm) to predict the match between a preference and an evaluation checklist. |
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
model_name = "kixlab/prefmatcher-7b" |
|
|
|
|
|
# Load the model |
|
|
llm = LLM( |
|
|
model=model_name, |
|
|
load_format="safetensors", |
|
|
kv_cache_dtype="auto", |
|
|
max_model_len=512 |
|
|
) |
|
|
|
|
|
# Prepare example input |
|
|
preference = "Analysis should focus exclusively on visible surface defects and their direct correlation to specific printer settings." |
|
|
checklist = [ |
|
|
"Does the training document provide a detailed framework?", |
|
|
"Does the training document provide a systematic framework?", |
|
|
"Does the framework link external and internal test cube measurements to specific diagnostics?", |
|
|
"Does the framework link external and internal test cube measurements to specific quality improvement actions?", |
|
|
] |
|
|
|
|
|
checklist_str = "\n".join([f"{i+1}. {item}" for i, item in enumerate(checklist)]) |
|
|
messages = [{ |
|
|
"role": "system", |
|
|
"content": "You are an analytical and insightful assistant that can determine the similarity between **evaluation checklists** and **evaluation criteria**. A criterion describes an aspect of AI outputs that should be evaluated. A checklist contain questions that are used to evaluate more specific or fine-grained aspects of the AI outputs. You will be provided with pairs of checklists and criteria. For each pair, you should determine whether each entry in the checklist is **covered** by the criterion. **Covered** means that the criterion and the checklist entry will evaluate the same or similar aspects of an AI output, even if they use different wording or phrasing." |
|
|
}, |
|
|
{ |
|
|
"role": "user", |
|
|
"content": f"#### Criterion\n\n{preference}\n\n#### Checklist\n\n{checklist_str}" |
|
|
}] |
|
|
|
|
|
sampling_params = SamplingParams( |
|
|
max_tokens=512, |
|
|
temperature=0.7 |
|
|
) |
|
|
|
|
|
# Generate the output |
|
|
outputs = llm.chat(messages, sampling_params=sampling_params, use_tqdm=False) |
|
|
|
|
|
# Print the output |
|
|
print(outputs[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
# Training Details |
|
|
## Training hyperparameters |
|
|
|
|
|
The following hyperparameters were used for training: |
|
|
- learning_rate: 3e-4 |
|
|
- train_batch_size: 4 |
|
|
- gradient_accumulation_steps: 8 |
|
|
- weight_decay: 1e-2 |
|
|
- optimizer: AdamW |
|
|
- lr_scheduler_type: Cosine with warmup |
|
|
- num_warmup_steps: 100 |
|
|
- lora_rank: 64 |
|
|
- lora_alpha: 128 |
|
|
- lora_dropout: 0.0 |
|
|
- lora_attn_modules: ['q_proj', 'v_proj', 'output_proj'] |
|
|
- apply_lora_to_mlp: True |
|
|
|
|
|
# Citation |
|
|
|
|
|
If you find our work useful, please consider citing our paper! |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
```bibtex |
|
|
@article{kim2025cupid, |
|
|
title = {CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions}, |
|
|
author = {Kim, Tae Soo and Lee, Yoonjoo and Park, Yoonah and Kim, Jiho and Kim, Young-Ho and Kim, Juho}, |
|
|
journal = {arXiv preprint arXiv:2508.01674}, |
|
|
year = {2025}, |
|
|
} |
|
|
``` |