File size: 5,858 Bytes
023588a a28a72c 03e60fb 69687e3 03e60fb 69687e3 03e60fb 423e30d 03e60fb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
---
language:
- en
- zh
license: apache-2.0
base_model: Kwai-Kolors/Keye-VL
tags:
- vision
- image-classification
- reward-model
- reinforcement-learning
- multimodal
- llama-factory
pipeline_tag: image-classification
library_name: transformers
---
# HUMOR-RM (Keye-VL Version)
<div align="center">
**[Paper](https://arxiv.org/abs/2512.24555)** | **[HUMOR-COT](https://huggingface.co/OpenDILabCommunity/HUMOR-COT-Qwen2.5-VL)**
</div>
## Model Summary
**HUMOR-RM** is a pairwise reward model designed to evaluate and rank the humor quality of internet memes. It serves as the preference model in the **HUMOR** (Hierarchical Understanding and Meme Optimization) framework.
This specific version is fine-tuned on **Keye-VL**, utilizing a dataset of pairwise meme comparisons (ranked by human annotators). It takes two memes (sharing the same template) as input and predicts which one is funnier, providing a consistent proxy for human preference.
## Requirements
This model is built using the **LLaMA-Factory** framework structure. To run inference, you must have `llamafactory` installed.
```bash
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .
```
## How to Use
Since this model uses a custom classification head on top of Keye-VL, we recommend using the provided wrapper class for inference.
### 1. Configuration (`config.yaml`)
Create a `config.yaml` file pointing to the base model and this adapter:
```yaml
model_name_or_path: Kwai-Kolors/Keye-VL
adapter_name_or_path: path_to_this_repo # or Local Path
template: keye # Important: Must match Keye-VL template
trust_remote_code: true
finetuning_type: lora
```
### 2. Python Inference Code
```python
import torch
import yaml
from llamafactory.hparams import get_infer_args
from llamafactory.model import load_tokenizer, get_template_and_fix_tokenizer
from llamafactory.model import AutoModelForBinaryClassification
from llamafactory.model.model_utils.classification_head import prepare_classification_model
from llamafactory.model.patcher import patch_classification_model
from transformers import AutoConfig, AutoModel
class MemeScorer:
def __init__(self, config_path):
with open(config_path) as f:
config = yaml.safe_load(f)
# Force RM configuration
config.update({'stage': 'rm_class', 'finetuning_type': 'lora'})
model_args, data_args, _, _ = get_infer_args(config)
# 1. Load Tokenizer & Template
tokenizer_mod = load_tokenizer(model_args)
self.tokenizer = tokenizer_mod["tokenizer"]
self.processor = tokenizer_mod.get("processor")
self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args)
# 2. Load Base Model
self.model = AutoModel.from_pretrained(
model_args.model_name_or_path,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.float16
)
# 3. Attach & Load Reward Head
prepare_classification_model(self.model)
self.model = AutoModelForBinaryClassification.from_pretrained(self.model)
patch_classification_model(self.model)
if model_args.adapter_name_or_path:
self.model.load_classification_head(model_args.adapter_name_or_path[0])
print("Loaded Humor Adapter.")
self.model.eval()
def score(self, img1_path, img2_path, prompt="Which meme is funnier?"):
# Construct Input
messages = [{"role": "user", "content": prompt}, {"role": "assistant", "content": ""}]
images = [img1_path, img2_path]
# Tokenize using Template
proc_msgs = self.template.mm_plugin.process_messages(messages, images, [], [], self.processor)
input_ids, _ = self.template.mm_plugin.process_token_ids([], [], images, [], [], self.tokenizer, self.processor)
encoded = self.template.encode_multiturn(self.tokenizer, proc_msgs, None, None)
input_ids += encoded[0][0]
# Forward Pass
inputs = {
"input_ids": torch.tensor([input_ids]).to(self.model.device),
"attention_mask": torch.tensor([[1]*len(input_ids)]).to(self.model.device),
"images": [images] # Image processor handling depends on Keye-VL version
}
with torch.no_grad():
logits = self.model(**inputs).logits.cpu().numpy()[0]
# Logits: [Score_Pair_0, Score_Pair_1] (Depends on exact head config, usually prob(A>B))
return logits
# Usage
if __name__ == "__main__":
scorer = MemeScorer("assets/config.yaml")
scores = scorer.score("assets/meme_a.jpg", "assets/meme_b.jpg")
print(f"Scores: {scores} (Winner: {'A' if scores[0] > scores[1] else 'B'})")
```
## Intended Use
* **Group-wise Ranking:** Evaluating a set of generated captions for a single meme template to select the best punchline.
* **RLHF/RLAIF:** Providing reward signals for Reinforcement Learning training of meme generators.
## Training Data
The model was trained on the **HUMOR-Preference Dataset**, which consists of 5 difficulty tiers of meme pairs:
1. **Wrong Text:** Original vs. Random text.
2. **Wrong Location:** Correct text vs. Misplaced text box.
3. **Boring:** Original vs. Non-humorous description.
4. **Detailed Boring:** Subtle text changes that kill the joke.
5. **Generated:** Fine-grained comparison between model-generated memes.

## Citation
```bibtex
@article{li2025perception,
title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
journal={arXiv preprint arXiv:2512.24555},
year={2025}
}
``` |