|
|
--- |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
license: apache-2.0 |
|
|
base_model: Kwai-Kolors/Keye-VL |
|
|
tags: |
|
|
- vision |
|
|
- image-classification |
|
|
- reward-model |
|
|
- reinforcement-learning |
|
|
- multimodal |
|
|
- llama-factory |
|
|
pipeline_tag: image-classification |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# HUMOR-RM (Keye-VL Version) |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**[Paper](https://arxiv.org/abs/2512.24555)** | **[HUMOR-COT](https://huggingface.co/OpenDILabCommunity/HUMOR-COT-Qwen2.5-VL)** |
|
|
|
|
|
</div> |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
**HUMOR-RM** is a pairwise reward model designed to evaluate and rank the humor quality of internet memes. It serves as the preference model in the **HUMOR** (Hierarchical Understanding and Meme Optimization) framework. |
|
|
|
|
|
This specific version is fine-tuned on **Keye-VL**, utilizing a dataset of pairwise meme comparisons (ranked by human annotators). It takes two memes (sharing the same template) as input and predicts which one is funnier, providing a consistent proxy for human preference. |
|
|
|
|
|
## Requirements |
|
|
|
|
|
This model is built using the **LLaMA-Factory** framework structure. To run inference, you must have `llamafactory` installed. |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/hiyouga/LLaMA-Factory.git |
|
|
cd LLaMA-Factory |
|
|
pip install -e . |
|
|
|
|
|
``` |
|
|
|
|
|
## How to Use |
|
|
|
|
|
Since this model uses a custom classification head on top of Keye-VL, we recommend using the provided wrapper class for inference. |
|
|
|
|
|
### 1. Configuration (`config.yaml`) |
|
|
|
|
|
Create a `config.yaml` file pointing to the base model and this adapter: |
|
|
|
|
|
```yaml |
|
|
model_name_or_path: Kwai-Kolors/Keye-VL |
|
|
adapter_name_or_path: path_to_this_repo # or Local Path |
|
|
template: keye # Important: Must match Keye-VL template |
|
|
trust_remote_code: true |
|
|
finetuning_type: lora |
|
|
|
|
|
``` |
|
|
|
|
|
### 2. Python Inference Code |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import yaml |
|
|
from llamafactory.hparams import get_infer_args |
|
|
from llamafactory.model import load_tokenizer, get_template_and_fix_tokenizer |
|
|
from llamafactory.model import AutoModelForBinaryClassification |
|
|
from llamafactory.model.model_utils.classification_head import prepare_classification_model |
|
|
from llamafactory.model.patcher import patch_classification_model |
|
|
from transformers import AutoConfig, AutoModel |
|
|
|
|
|
class MemeScorer: |
|
|
def __init__(self, config_path): |
|
|
with open(config_path) as f: |
|
|
config = yaml.safe_load(f) |
|
|
|
|
|
# Force RM configuration |
|
|
config.update({'stage': 'rm_class', 'finetuning_type': 'lora'}) |
|
|
model_args, data_args, _, _ = get_infer_args(config) |
|
|
|
|
|
# 1. Load Tokenizer & Template |
|
|
tokenizer_mod = load_tokenizer(model_args) |
|
|
self.tokenizer = tokenizer_mod["tokenizer"] |
|
|
self.processor = tokenizer_mod.get("processor") |
|
|
self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args) |
|
|
|
|
|
# 2. Load Base Model |
|
|
self.model = AutoModel.from_pretrained( |
|
|
model_args.model_name_or_path, |
|
|
trust_remote_code=True, |
|
|
device_map="auto", |
|
|
torch_dtype=torch.float16 |
|
|
) |
|
|
|
|
|
# 3. Attach & Load Reward Head |
|
|
prepare_classification_model(self.model) |
|
|
self.model = AutoModelForBinaryClassification.from_pretrained(self.model) |
|
|
patch_classification_model(self.model) |
|
|
|
|
|
if model_args.adapter_name_or_path: |
|
|
self.model.load_classification_head(model_args.adapter_name_or_path[0]) |
|
|
print("Loaded Humor Adapter.") |
|
|
|
|
|
self.model.eval() |
|
|
|
|
|
def score(self, img1_path, img2_path, prompt="Which meme is funnier?"): |
|
|
# Construct Input |
|
|
messages = [{"role": "user", "content": prompt}, {"role": "assistant", "content": ""}] |
|
|
images = [img1_path, img2_path] |
|
|
|
|
|
# Tokenize using Template |
|
|
proc_msgs = self.template.mm_plugin.process_messages(messages, images, [], [], self.processor) |
|
|
input_ids, _ = self.template.mm_plugin.process_token_ids([], [], images, [], [], self.tokenizer, self.processor) |
|
|
encoded = self.template.encode_multiturn(self.tokenizer, proc_msgs, None, None) |
|
|
input_ids += encoded[0][0] |
|
|
|
|
|
# Forward Pass |
|
|
inputs = { |
|
|
"input_ids": torch.tensor([input_ids]).to(self.model.device), |
|
|
"attention_mask": torch.tensor([[1]*len(input_ids)]).to(self.model.device), |
|
|
"images": [images] # Image processor handling depends on Keye-VL version |
|
|
} |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = self.model(**inputs).logits.cpu().numpy()[0] |
|
|
|
|
|
# Logits: [Score_Pair_0, Score_Pair_1] (Depends on exact head config, usually prob(A>B)) |
|
|
return logits |
|
|
|
|
|
# Usage |
|
|
if __name__ == "__main__": |
|
|
scorer = MemeScorer("assets/config.yaml") |
|
|
scores = scorer.score("assets/meme_a.jpg", "assets/meme_b.jpg") |
|
|
print(f"Scores: {scores} (Winner: {'A' if scores[0] > scores[1] else 'B'})") |
|
|
|
|
|
``` |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
* **Group-wise Ranking:** Evaluating a set of generated captions for a single meme template to select the best punchline. |
|
|
* **RLHF/RLAIF:** Providing reward signals for Reinforcement Learning training of meme generators. |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on the **HUMOR-Preference Dataset**, which consists of 5 difficulty tiers of meme pairs: |
|
|
|
|
|
1. **Wrong Text:** Original vs. Random text. |
|
|
2. **Wrong Location:** Correct text vs. Misplaced text box. |
|
|
3. **Boring:** Original vs. Non-humorous description. |
|
|
4. **Detailed Boring:** Subtle text changes that kill the joke. |
|
|
5. **Generated:** Fine-grained comparison between model-generated memes. |
|
|
|
|
|
 |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{li2025perception, |
|
|
title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme}, |
|
|
author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe}, |
|
|
journal={arXiv preprint arXiv:2512.24555}, |
|
|
year={2025} |
|
|
} |
|
|
|
|
|
``` |