---
language:
- en
- zh
license: apache-2.0
base_model: Kwai-Kolors/Keye-VL
tags:
- vision
- image-classification
- reward-model
- reinforcement-learning
- multimodal
- llama-factory
pipeline_tag: image-classification
library_name: transformers
---
# HUMOR-RM (Keye-VL Version)
**[Paper](https://arxiv.org/abs/2512.24555)** | **[HUMOR-COT](https://huggingface.co/OpenDILabCommunity/HUMOR-COT-Qwen2.5-VL)**
## Model Summary
**HUMOR-RM** is a pairwise reward model designed to evaluate and rank the humor quality of internet memes. It serves as the preference model in the **HUMOR** (Hierarchical Understanding and Meme Optimization) framework.
This specific version is fine-tuned on **Keye-VL**, utilizing a dataset of pairwise meme comparisons (ranked by human annotators). It takes two memes (sharing the same template) as input and predicts which one is funnier, providing a consistent proxy for human preference.
## Requirements
This model is built using the **LLaMA-Factory** framework structure. To run inference, you must have `llamafactory` installed.
```bash
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .
```
## How to Use
Since this model uses a custom classification head on top of Keye-VL, we recommend using the provided wrapper class for inference.
### 1. Configuration (`config.yaml`)
Create a `config.yaml` file pointing to the base model and this adapter:
```yaml
model_name_or_path: Kwai-Kolors/Keye-VL
adapter_name_or_path: path_to_this_repo # or Local Path
template: keye # Important: Must match Keye-VL template
trust_remote_code: true
finetuning_type: lora
```
### 2. Python Inference Code
```python
import torch
import yaml
from llamafactory.hparams import get_infer_args
from llamafactory.model import load_tokenizer, get_template_and_fix_tokenizer
from llamafactory.model import AutoModelForBinaryClassification
from llamafactory.model.model_utils.classification_head import prepare_classification_model
from llamafactory.model.patcher import patch_classification_model
from transformers import AutoConfig, AutoModel
class MemeScorer:
def __init__(self, config_path):
with open(config_path) as f:
config = yaml.safe_load(f)
# Force RM configuration
config.update({'stage': 'rm_class', 'finetuning_type': 'lora'})
model_args, data_args, _, _ = get_infer_args(config)
# 1. Load Tokenizer & Template
tokenizer_mod = load_tokenizer(model_args)
self.tokenizer = tokenizer_mod["tokenizer"]
self.processor = tokenizer_mod.get("processor")
self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args)
# 2. Load Base Model
self.model = AutoModel.from_pretrained(
model_args.model_name_or_path,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.float16
)
# 3. Attach & Load Reward Head
prepare_classification_model(self.model)
self.model = AutoModelForBinaryClassification.from_pretrained(self.model)
patch_classification_model(self.model)
if model_args.adapter_name_or_path:
self.model.load_classification_head(model_args.adapter_name_or_path[0])
print("Loaded Humor Adapter.")
self.model.eval()
def score(self, img1_path, img2_path, prompt="Which meme is funnier?"):
# Construct Input
messages = [{"role": "user", "content": prompt}, {"role": "assistant", "content": ""}]
images = [img1_path, img2_path]
# Tokenize using Template
proc_msgs = self.template.mm_plugin.process_messages(messages, images, [], [], self.processor)
input_ids, _ = self.template.mm_plugin.process_token_ids([], [], images, [], [], self.tokenizer, self.processor)
encoded = self.template.encode_multiturn(self.tokenizer, proc_msgs, None, None)
input_ids += encoded[0][0]
# Forward Pass
inputs = {
"input_ids": torch.tensor([input_ids]).to(self.model.device),
"attention_mask": torch.tensor([[1]*len(input_ids)]).to(self.model.device),
"images": [images] # Image processor handling depends on Keye-VL version
}
with torch.no_grad():
logits = self.model(**inputs).logits.cpu().numpy()[0]
# Logits: [Score_Pair_0, Score_Pair_1] (Depends on exact head config, usually prob(A>B))
return logits
# Usage
if __name__ == "__main__":
scorer = MemeScorer("assets/config.yaml")
scores = scorer.score("assets/meme_a.jpg", "assets/meme_b.jpg")
print(f"Scores: {scores} (Winner: {'A' if scores[0] > scores[1] else 'B'})")
```
## Intended Use
* **Group-wise Ranking:** Evaluating a set of generated captions for a single meme template to select the best punchline.
* **RLHF/RLAIF:** Providing reward signals for Reinforcement Learning training of meme generators.
## Training Data
The model was trained on the **HUMOR-Preference Dataset**, which consists of 5 difficulty tiers of meme pairs:
1. **Wrong Text:** Original vs. Random text.
2. **Wrong Location:** Correct text vs. Misplaced text box.
3. **Boring:** Original vs. Non-humorous description.
4. **Detailed Boring:** Subtle text changes that kill the joke.
5. **Generated:** Fine-grained comparison between model-generated memes.

## Citation
```bibtex
@article{li2025perception,
title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
journal={arXiv preprint arXiv:2512.24555},
year={2025}
}
```