File size: 5,858 Bytes
023588a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a28a72c
03e60fb
 
 
69687e3
03e60fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69687e3
 
03e60fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
423e30d
 
03e60fb
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
language:
- en
- zh
license: apache-2.0
base_model: Kwai-Kolors/Keye-VL
tags:
- vision
- image-classification
- reward-model
- reinforcement-learning
- multimodal
- llama-factory
pipeline_tag: image-classification
library_name: transformers
---

# HUMOR-RM (Keye-VL Version)

<div align="center">

**[Paper](https://arxiv.org/abs/2512.24555)** | **[HUMOR-COT](https://huggingface.co/OpenDILabCommunity/HUMOR-COT-Qwen2.5-VL)**

</div>

## Model Summary

**HUMOR-RM** is a pairwise reward model designed to evaluate and rank the humor quality of internet memes. It serves as the preference model in the **HUMOR** (Hierarchical Understanding and Meme Optimization) framework.

This specific version is fine-tuned on **Keye-VL**, utilizing a dataset of pairwise meme comparisons (ranked by human annotators). It takes two memes (sharing the same template) as input and predicts which one is funnier, providing a consistent proxy for human preference.

## Requirements

This model is built using the **LLaMA-Factory** framework structure. To run inference, you must have `llamafactory` installed.

```bash
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .

```

## How to Use

Since this model uses a custom classification head on top of Keye-VL, we recommend using the provided wrapper class for inference.

### 1. Configuration (`config.yaml`)

Create a `config.yaml` file pointing to the base model and this adapter:

```yaml
model_name_or_path: Kwai-Kolors/Keye-VL
adapter_name_or_path: path_to_this_repo  # or Local Path
template: keye  # Important: Must match Keye-VL template
trust_remote_code: true
finetuning_type: lora

```

### 2. Python Inference Code

```python
import torch
import yaml
from llamafactory.hparams import get_infer_args
from llamafactory.model import load_tokenizer, get_template_and_fix_tokenizer
from llamafactory.model import AutoModelForBinaryClassification
from llamafactory.model.model_utils.classification_head import prepare_classification_model
from llamafactory.model.patcher import patch_classification_model
from transformers import AutoConfig, AutoModel

class MemeScorer:
    def __init__(self, config_path):
        with open(config_path) as f:
            config = yaml.safe_load(f)
        
        # Force RM configuration
        config.update({'stage': 'rm_class', 'finetuning_type': 'lora'})
        model_args, data_args, _, _ = get_infer_args(config)
        
        # 1. Load Tokenizer & Template
        tokenizer_mod = load_tokenizer(model_args)
        self.tokenizer = tokenizer_mod["tokenizer"]
        self.processor = tokenizer_mod.get("processor")
        self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args)
        
        # 2. Load Base Model
        self.model = AutoModel.from_pretrained(
            model_args.model_name_or_path, 
            trust_remote_code=True, 
            device_map="auto", 
            torch_dtype=torch.float16
        )
        
        # 3. Attach & Load Reward Head
        prepare_classification_model(self.model)
        self.model = AutoModelForBinaryClassification.from_pretrained(self.model)
        patch_classification_model(self.model)
        
        if model_args.adapter_name_or_path:
            self.model.load_classification_head(model_args.adapter_name_or_path[0])
            print("Loaded Humor Adapter.")
            
        self.model.eval()

    def score(self, img1_path, img2_path, prompt="Which meme is funnier?"):
        # Construct Input
        messages = [{"role": "user", "content": prompt}, {"role": "assistant", "content": ""}]
        images = [img1_path, img2_path]
        
        # Tokenize using Template
        proc_msgs = self.template.mm_plugin.process_messages(messages, images, [], [], self.processor)
        input_ids, _ = self.template.mm_plugin.process_token_ids([], [], images, [], [], self.tokenizer, self.processor)
        encoded = self.template.encode_multiturn(self.tokenizer, proc_msgs, None, None)
        input_ids += encoded[0][0]
        
        # Forward Pass
        inputs = {
            "input_ids": torch.tensor([input_ids]).to(self.model.device),
            "attention_mask": torch.tensor([[1]*len(input_ids)]).to(self.model.device),
            "images": [images] # Image processor handling depends on Keye-VL version
        }
        
        with torch.no_grad():
            logits = self.model(**inputs).logits.cpu().numpy()[0]
            
        # Logits: [Score_Pair_0, Score_Pair_1] (Depends on exact head config, usually prob(A>B))
        return logits

# Usage
if __name__ == "__main__":
    scorer = MemeScorer("assets/config.yaml")
    scores = scorer.score("assets/meme_a.jpg", "assets/meme_b.jpg")
    print(f"Scores: {scores} (Winner: {'A' if scores[0] > scores[1] else 'B'})")

```

## Intended Use

* **Group-wise Ranking:** Evaluating a set of generated captions for a single meme template to select the best punchline.
* **RLHF/RLAIF:** Providing reward signals for Reinforcement Learning training of meme generators.

## Training Data

The model was trained on the **HUMOR-Preference Dataset**, which consists of 5 difficulty tiers of meme pairs:

1. **Wrong Text:** Original vs. Random text.
2. **Wrong Location:** Correct text vs. Misplaced text box.
3. **Boring:** Original vs. Non-humorous description.
4. **Detailed Boring:** Subtle text changes that kill the joke.
5. **Generated:** Fine-grained comparison between model-generated memes.

![Training Data Examples](assets/datasets_with_different_tier.png)

## Citation

```bibtex
@article{li2025perception,
  title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
  author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
  journal={arXiv preprint arXiv:2512.24555},
  year={2025}
}

```