--- language: - en - zh license: apache-2.0 base_model: Kwai-Kolors/Keye-VL tags: - vision - image-classification - reward-model - reinforcement-learning - multimodal - llama-factory pipeline_tag: image-classification library_name: transformers --- # HUMOR-RM (Keye-VL Version)
**[Paper](https://arxiv.org/abs/2512.24555)** | **[HUMOR-COT](https://huggingface.co/OpenDILabCommunity/HUMOR-COT-Qwen2.5-VL)**
## Model Summary **HUMOR-RM** is a pairwise reward model designed to evaluate and rank the humor quality of internet memes. It serves as the preference model in the **HUMOR** (Hierarchical Understanding and Meme Optimization) framework. This specific version is fine-tuned on **Keye-VL**, utilizing a dataset of pairwise meme comparisons (ranked by human annotators). It takes two memes (sharing the same template) as input and predicts which one is funnier, providing a consistent proxy for human preference. ## Requirements This model is built using the **LLaMA-Factory** framework structure. To run inference, you must have `llamafactory` installed. ```bash git clone https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory pip install -e . ``` ## How to Use Since this model uses a custom classification head on top of Keye-VL, we recommend using the provided wrapper class for inference. ### 1. Configuration (`config.yaml`) Create a `config.yaml` file pointing to the base model and this adapter: ```yaml model_name_or_path: Kwai-Kolors/Keye-VL adapter_name_or_path: path_to_this_repo # or Local Path template: keye # Important: Must match Keye-VL template trust_remote_code: true finetuning_type: lora ``` ### 2. Python Inference Code ```python import torch import yaml from llamafactory.hparams import get_infer_args from llamafactory.model import load_tokenizer, get_template_and_fix_tokenizer from llamafactory.model import AutoModelForBinaryClassification from llamafactory.model.model_utils.classification_head import prepare_classification_model from llamafactory.model.patcher import patch_classification_model from transformers import AutoConfig, AutoModel class MemeScorer: def __init__(self, config_path): with open(config_path) as f: config = yaml.safe_load(f) # Force RM configuration config.update({'stage': 'rm_class', 'finetuning_type': 'lora'}) model_args, data_args, _, _ = get_infer_args(config) # 1. Load Tokenizer & Template tokenizer_mod = load_tokenizer(model_args) self.tokenizer = tokenizer_mod["tokenizer"] self.processor = tokenizer_mod.get("processor") self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args) # 2. Load Base Model self.model = AutoModel.from_pretrained( model_args.model_name_or_path, trust_remote_code=True, device_map="auto", torch_dtype=torch.float16 ) # 3. Attach & Load Reward Head prepare_classification_model(self.model) self.model = AutoModelForBinaryClassification.from_pretrained(self.model) patch_classification_model(self.model) if model_args.adapter_name_or_path: self.model.load_classification_head(model_args.adapter_name_or_path[0]) print("Loaded Humor Adapter.") self.model.eval() def score(self, img1_path, img2_path, prompt="Which meme is funnier?"): # Construct Input messages = [{"role": "user", "content": prompt}, {"role": "assistant", "content": ""}] images = [img1_path, img2_path] # Tokenize using Template proc_msgs = self.template.mm_plugin.process_messages(messages, images, [], [], self.processor) input_ids, _ = self.template.mm_plugin.process_token_ids([], [], images, [], [], self.tokenizer, self.processor) encoded = self.template.encode_multiturn(self.tokenizer, proc_msgs, None, None) input_ids += encoded[0][0] # Forward Pass inputs = { "input_ids": torch.tensor([input_ids]).to(self.model.device), "attention_mask": torch.tensor([[1]*len(input_ids)]).to(self.model.device), "images": [images] # Image processor handling depends on Keye-VL version } with torch.no_grad(): logits = self.model(**inputs).logits.cpu().numpy()[0] # Logits: [Score_Pair_0, Score_Pair_1] (Depends on exact head config, usually prob(A>B)) return logits # Usage if __name__ == "__main__": scorer = MemeScorer("assets/config.yaml") scores = scorer.score("assets/meme_a.jpg", "assets/meme_b.jpg") print(f"Scores: {scores} (Winner: {'A' if scores[0] > scores[1] else 'B'})") ``` ## Intended Use * **Group-wise Ranking:** Evaluating a set of generated captions for a single meme template to select the best punchline. * **RLHF/RLAIF:** Providing reward signals for Reinforcement Learning training of meme generators. ## Training Data The model was trained on the **HUMOR-Preference Dataset**, which consists of 5 difficulty tiers of meme pairs: 1. **Wrong Text:** Original vs. Random text. 2. **Wrong Location:** Correct text vs. Misplaced text box. 3. **Boring:** Original vs. Non-humorous description. 4. **Detailed Boring:** Subtle text changes that kill the joke. 5. **Generated:** Fine-grained comparison between model-generated memes. ![Training Data Examples](assets/datasets_with_different_tier.png) ## Citation ```bibtex @article{li2025perception, title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme}, author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe}, journal={arXiv preprint arXiv:2512.24555}, year={2025} } ```