--- license: apache-2.0 language: - en pipeline_tag: text-generation base_model: Qwen/Qwen2.5-7B tags: - chat library_name: transformers --- ## Links for Reference - **Homepage: https://cupid.kixlab.org** - **Repository: https://github.com/kixlab/CUPID** - **Benchmark Dataset: https://huggingface.co/datasets/kixlab/CUPID** - **Paper: https://arxiv.org/abs/2508.01674** - **Point of Contact: taesoo.kim@kaist.ac.kr** # TL; DR **PrefMatcher-7B** instantiates the *Preference Match* metric proposed in the [CUPID benchmark](https://huggingface.co/datasets/kixlab/CUPID) (COLM 2025). The model takes a preference description and an evaluation checklist to assess whether each checklist item matches or is covered by the preference. The model is trained using [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as its base model. PrefMatcher provides a high-fidelity, cost efficient judge for automatic evaluation on the CUPID benchmark. # Model Details PrefMatcher-7B was finetuned through QLoRA for 1 epoch on 4k data samples (i.e., prefernece-checklist matches). PrefMatcher achieved a Krippendorff's alpha of 0.748 with human annotations. The data samples were created through the synthesis pipeline for the CUPID benchmark, which were then evaluated or matched by GPT-4o. The model was trained through the [torchtune](https://github.com/pytorch/torchtune) library. ## Model Description - **Model type:** Language model - **Language(s) (NLP):** English - **License:** Apache 2.0 # Usage Here is example code to use the model with [VLLM](https://github.com/vllm-project/vllm) to predict the match between a preference and an evaluation checklist. ```python from vllm import LLM, SamplingParams model_name = "kixlab/prefmatcher-7b" # Load the model llm = LLM( model=model_name, load_format="safetensors", kv_cache_dtype="auto", max_model_len=512 ) # Prepare example input preference = "Analysis should focus exclusively on visible surface defects and their direct correlation to specific printer settings." checklist = [ "Does the training document provide a detailed framework?", "Does the training document provide a systematic framework?", "Does the framework link external and internal test cube measurements to specific diagnostics?", "Does the framework link external and internal test cube measurements to specific quality improvement actions?", ] checklist_str = "\n".join([f"{i+1}. {item}" for i, item in enumerate(checklist)]) messages = [{ "role": "system", "content": "You are an analytical and insightful assistant that can determine the similarity between **evaluation checklists** and **evaluation criteria**. A criterion describes an aspect of AI outputs that should be evaluated. A checklist contain questions that are used to evaluate more specific or fine-grained aspects of the AI outputs. You will be provided with pairs of checklists and criteria. For each pair, you should determine whether each entry in the checklist is **covered** by the criterion. **Covered** means that the criterion and the checklist entry will evaluate the same or similar aspects of an AI output, even if they use different wording or phrasing." }, { "role": "user", "content": f"#### Criterion\n\n{preference}\n\n#### Checklist\n\n{checklist_str}" }] sampling_params = SamplingParams( max_tokens=512, temperature=0.7 ) # Generate the output outputs = llm.chat(messages, sampling_params=sampling_params, use_tqdm=False) # Print the output print(outputs[0].outputs[0].text) ``` # Training Details ## Training hyperparameters The following hyperparameters were used for training: - learning_rate: 3e-4 - train_batch_size: 4 - gradient_accumulation_steps: 8 - weight_decay: 1e-2 - optimizer: AdamW - lr_scheduler_type: Cosine with warmup - num_warmup_steps: 100 - lora_rank: 64 - lora_alpha: 128 - lora_dropout: 0.0 - lora_attn_modules: ['q_proj', 'v_proj', 'output_proj'] - apply_lora_to_mlp: True # Citation If you find our work useful, please consider citing our paper! **BibTeX:** ```bibtex @article{kim2025cupid, title = {CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions}, author = {Kim, Tae Soo and Lee, Yoonjoo and Park, Yoonah and Kim, Jiho and Kim, Young-Ho and Kim, Juho}, journal = {arXiv preprint arXiv:2508.01674}, year = {2025}, } ```