File size: 5,858 Bytes

---
language:
- en
- zh
license: apache-2.0
base_model: Kwai-Kolors/Keye-VL
tags:
- vision
- image-classification
- reward-model
- reinforcement-learning
- multimodal
- llama-factory
pipeline_tag: image-classification
library_name: transformers
---

# HUMOR-RM (Keye-VL Version)

<div align="center">

**[Paper](https://arxiv.org/abs/2512.24555)** | **[HUMOR-COT](https://huggingface.co/OpenDILabCommunity/HUMOR-COT-Qwen2.5-VL)**

</div>

## Model Summary

**HUMOR-RM** is a pairwise reward model designed to evaluate and rank the humor quality of internet memes. It serves as the preference model in the **HUMOR** (Hierarchical Understanding and Meme Optimization) framework.

This specific version is fine-tuned on **Keye-VL**, utilizing a dataset of pairwise meme comparisons (ranked by human annotators). It takes two memes (sharing the same template) as input and predicts which one is funnier, providing a consistent proxy for human preference.

## Requirements

This model is built using the **LLaMA-Factory** framework structure. To run inference, you must have `llamafactory` installed.

```bash
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .

```

## How to Use

Since this model uses a custom classification head on top of Keye-VL, we recommend using the provided wrapper class for inference.

### 1. Configuration (`config.yaml`)

Create a `config.yaml` file pointing to the base model and this adapter:

```yaml
model_name_or_path: Kwai-Kolors/Keye-VL
adapter_name_or_path: path_to_this_repo  # or Local Path
template: keye  # Important: Must match Keye-VL template
trust_remote_code: true
finetuning_type: lora

```

### 2. Python Inference Code

```python
import torch
import yaml
from llamafactory.hparams import get_infer_args
from llamafactory.model import load_tokenizer, get_template_and_fix_tokenizer
from llamafactory.model import AutoModelForBinaryClassification
from llamafactory.model.model_utils.classification_head import prepare_classification_model
from llamafactory.model.patcher import patch_classification_model
from transformers import AutoConfig, AutoModel

class MemeScorer:
    def __init__(self, config_path):
        with open(config_path) as f:
            config = yaml.safe_load(f)
        
        # Force RM configuration
        config.update({'stage': 'rm_class', 'finetuning_type': 'lora'})
        model_args, data_args, _, _ = get_infer_args(config)
        
        # 1. Load Tokenizer & Template
        tokenizer_mod = load_tokenizer(model_args)
        self.tokenizer = tokenizer_mod["tokenizer"]
        self.processor = tokenizer_mod.get("processor")
        self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args)
        
        # 2. Load Base Model
        self.model = AutoModel.from_pretrained(
            model_args.model_name_or_path, 
            trust_remote_code=True, 
            device_map="auto", 
            torch_dtype=torch.float16
        )
        
        # 3. Attach & Load Reward Head
        prepare_classification_model(self.model)
        self.model = AutoModelForBinaryClassification.from_pretrained(self.model)
        patch_classification_model(self.model)
        
        if model_args.adapter_name_or_path:
            self.model.load_classification_head(model_args.adapter_name_or_path[0])
            print("Loaded Humor Adapter.")
            
        self.model.eval()

    def score(self, img1_path, img2_path, prompt="Which meme is funnier?"):
        # Construct Input
        messages = [{"role": "user", "content": prompt}, {"role": "assistant", "content": ""}]
        images = [img1_path, img2_path]
        
        # Tokenize using Template
        proc_msgs = self.template.mm_plugin.process_messages(messages, images, [], [], self.processor)
        input_ids, _ = self.template.mm_plugin.process_token_ids([], [], images, [], [], self.tokenizer, self.processor)
        encoded = self.template.encode_multiturn(self.tokenizer, proc_msgs, None, None)
        input_ids += encoded[0][0]
        
        # Forward Pass
        inputs = {
            "input_ids": torch.tensor([input_ids]).to(self.model.device),
            "attention_mask": torch.tensor([[1]*len(input_ids)]).to(self.model.device),
            "images": [images] # Image processor handling depends on Keye-VL version
        }
        
        with torch.no_grad():
            logits = self.model(**inputs).logits.cpu().numpy()[0]
            
        # Logits: [Score_Pair_0, Score_Pair_1] (Depends on exact head config, usually prob(A>B))
        return logits

# Usage
if __name__ == "__main__":
    scorer = MemeScorer("assets/config.yaml")
    scores = scorer.score("assets/meme_a.jpg", "assets/meme_b.jpg")
    print(f"Scores: {scores} (Winner: {'A' if scores[0] > scores[1] else 'B'})")

```

## Intended Use

* **Group-wise Ranking:** Evaluating a set of generated captions for a single meme template to select the best punchline.
* **RLHF/RLAIF:** Providing reward signals for Reinforcement Learning training of meme generators.

## Training Data

The model was trained on the **HUMOR-Preference Dataset**, which consists of 5 difficulty tiers of meme pairs:

1. **Wrong Text:** Original vs. Random text.
2. **Wrong Location:** Correct text vs. Misplaced text box.
3. **Boring:** Original vs. Non-humorous description.
4. **Detailed Boring:** Subtle text changes that kill the joke.
5. **Generated:** Fine-grained comparison between model-generated memes.

![Training Data Examples](assets/datasets_with_different_tier.png)

## Citation

```bibtex
@article{li2025perception,
  title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
  author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
  journal={arXiv preprint arXiv:2512.24555},
  year={2025}
}

```