HUMOR-RM-Keye-VL / README.md

karroyan

doc(lxy): modify readme, add test image and config yaml

69687e3 2 days ago

5.86 kB

	---
	language:
	- en
	- zh
	license: apache-2.0
	base_model: Kwai-Kolors/Keye-VL
	tags:
	- vision
	- image-classification
	- reward-model
	- reinforcement-learning
	- multimodal
	- llama-factory
	pipeline_tag: image-classification
	library_name: transformers
	---

	# HUMOR-RM (Keye-VL Version)

	<div align="center">

	[Paper](https://arxiv.org/abs/2512.24555) \| [HUMOR-COT](https://huggingface.co/OpenDILabCommunity/HUMOR-COT-Qwen2.5-VL)

	</div>

	## Model Summary

	HUMOR-RM is a pairwise reward model designed to evaluate and rank the humor quality of internet memes. It serves as the preference model in the HUMOR (Hierarchical Understanding and Meme Optimization) framework.

	This specific version is fine-tuned on Keye-VL, utilizing a dataset of pairwise meme comparisons (ranked by human annotators). It takes two memes (sharing the same template) as input and predicts which one is funnier, providing a consistent proxy for human preference.

	## Requirements

	This model is built using the LLaMA-Factory framework structure. To run inference, you must have `llamafactory` installed.

	```bash
	git clone https://github.com/hiyouga/LLaMA-Factory.git
	cd LLaMA-Factory
	pip install -e .

	```

	## How to Use

	Since this model uses a custom classification head on top of Keye-VL, we recommend using the provided wrapper class for inference.

	### 1. Configuration (`config.yaml`)

	Create a `config.yaml` file pointing to the base model and this adapter:

	```yaml
	model_name_or_path: Kwai-Kolors/Keye-VL
	adapter_name_or_path: path_to_this_repo # or Local Path
	template: keye # Important: Must match Keye-VL template
	trust_remote_code: true
	finetuning_type: lora

	```

	### 2. Python Inference Code

	```python
	import torch
	import yaml
	from llamafactory.hparams import get_infer_args
	from llamafactory.model import load_tokenizer, get_template_and_fix_tokenizer
	from llamafactory.model import AutoModelForBinaryClassification
	from llamafactory.model.model_utils.classification_head import prepare_classification_model
	from llamafactory.model.patcher import patch_classification_model
	from transformers import AutoConfig, AutoModel

	class MemeScorer:
	def __init__(self, config_path):
	with open(config_path) as f:
	config = yaml.safe_load(f)

	# Force RM configuration
	config.update({'stage': 'rm_class', 'finetuning_type': 'lora'})
	model_args, data_args, _, _ = get_infer_args(config)

	# 1. Load Tokenizer & Template
	tokenizer_mod = load_tokenizer(model_args)
	self.tokenizer = tokenizer_mod["tokenizer"]
	self.processor = tokenizer_mod.get("processor")
	self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args)

	# 2. Load Base Model
	self.model = AutoModel.from_pretrained(
	model_args.model_name_or_path,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.float16
	)

	# 3. Attach & Load Reward Head
	prepare_classification_model(self.model)
	self.model = AutoModelForBinaryClassification.from_pretrained(self.model)
	patch_classification_model(self.model)

	if model_args.adapter_name_or_path:
	self.model.load_classification_head(model_args.adapter_name_or_path[0])
	print("Loaded Humor Adapter.")

	self.model.eval()

	def score(self, img1_path, img2_path, prompt="Which meme is funnier?"):
	# Construct Input
	messages = [{"role": "user", "content": prompt}, {"role": "assistant", "content": ""}]
	images = [img1_path, img2_path]

	# Tokenize using Template
	proc_msgs = self.template.mm_plugin.process_messages(messages, images, [], [], self.processor)
	input_ids, _ = self.template.mm_plugin.process_token_ids([], [], images, [], [], self.tokenizer, self.processor)
	encoded = self.template.encode_multiturn(self.tokenizer, proc_msgs, None, None)
	input_ids += encoded[0][0]

	# Forward Pass
	inputs = {
	"input_ids": torch.tensor([input_ids]).to(self.model.device),
	"attention_mask": torch.tensor([[1]*len(input_ids)]).to(self.model.device),
	"images": [images] # Image processor handling depends on Keye-VL version
	}

	with torch.no_grad():
	logits = self.model(**inputs).logits.cpu().numpy()[0]

	# Logits: [Score_Pair_0, Score_Pair_1] (Depends on exact head config, usually prob(A>B))
	return logits

	# Usage
	if __name__ == "__main__":
	scorer = MemeScorer("assets/config.yaml")
	scores = scorer.score("assets/meme_a.jpg", "assets/meme_b.jpg")
	print(f"Scores: {scores} (Winner: {'A' if scores[0] > scores[1] else 'B'})")

	```

	## Intended Use

	* Group-wise Ranking: Evaluating a set of generated captions for a single meme template to select the best punchline.
	* RLHF/RLAIF: Providing reward signals for Reinforcement Learning training of meme generators.

	## Training Data

	The model was trained on the HUMOR-Preference Dataset, which consists of 5 difficulty tiers of meme pairs:

	1. Wrong Text: Original vs. Random text.
	2. Wrong Location: Correct text vs. Misplaced text box.
	3. Boring: Original vs. Non-humorous description.
	4. Detailed Boring: Subtle text changes that kill the joke.
	5. Generated: Fine-grained comparison between model-generated memes.

	![Training Data Examples](assets/datasets_with_different_tier.png)

	## Citation

	```bibtex
	@article{li2025perception,
	title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
	author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
	journal={arXiv preprint arXiv:2512.24555},
	year={2025}
	}

	```