karroyan

doc(lxy): modify readme and add test image

2e865aa 8 days ago

5.09 kB

	---
	language:
	- en
	- zh
	license: apache-2.0
	base_model: Qwen/Qwen2.5-VL-7B-Instruct
	tags:
	- vision
	- image-text-to-text
	- multimodal
	- meme-generation
	- humor
	- chain-of-thought
	- qwen
	pipeline_tag: image-text-to-text
	library_name: vllm
	---

	# HUMOR-COT: Hierarchical Understanding and Meme Optimization with Chain-of-Thought

	<div align="center">

	[Paper](https://arxiv.org/abs/2512.24555) \| [HUMOR-RM](https://huggingface.co/OpenDILabCommunity/HUMOR-RM-Keye-VL)

	</div>

	## Model Summary

	HUMOR-COT is a multimodal generative model capable of creating humorous, context-aware memes. It is fine-tuned from Qwen2.5-VL-7B-Instruct using a novel Hierarchical Chain-of-Thought (CoT) approach.

	Unlike standard image captioning models that map images directly to text, HUMOR-COT mimics the human creative process in two stages:

	1. Template-Level Reasoning: Analyzes the image to infer latent intent, emotional tone, and layout.
	2. Context-Level Grounding: Generates specific, humorous captions (punchlines) grounded in user-supplied keywords or contexts.

	This model represents the Supervised Fine-Tuning (SFT) stage of the HUMOR framework, achieving state-of-the-art performance in humor, readability, and human-likeness (91.5%) compared to GPT-4o and other VLMs.

	## Uses

	### Intended Use

	* Meme Generation: Generating humorous captions for uploaded images based on specific topics or keywords.
	* Humor Understanding: Analyzing the punchline mechanics of existing memes.
	* Creative Writing Assist: Brainstorming metaphorical associations for visual content.

	### Out of Scope

	* Generation of hate speech, violence, or harmful stereotypes (filtered during training, but guardrails recommended for deployment).

	## How to Get Started

	The model is designed to be used with `vllm` for efficient inference. Below is a custom wrapper class designed to handle the hierarchical generation process.

	### Prerequisites

	You need to set up the following environment variables and files:

	* `NLP_MODEL_PATH`: Path to your Spacy model (e.g., `en_core_web_sm`).
	* `VLLM_MODEL_PATH`: Path to this model (local or HF hub ID).
	* `prompt/generate_meme.txt`: The text file containing the system prompt for CoT generation.

	### Inference Code

	```python
	from vllm import LLM, SamplingParams

	# 1. Configuration
	# Replace with your actual model path or Hugging Face ID
	MODEL_PATH = "Your-HF-Org/HUMOR-COT"

	# 2. Initialize the vLLM engine
	# Note: Qwen2.5-VL requires specific pixel arguments for the visual encoder
	llm = LLM(
	model=MODEL_PATH,
	trust_remote_code=True,
	limit_mm_per_prompt={"image": 1},
	mm_processor_kwargs={
	"min_pixels": 28 * 28,
	"max_pixels": 1280 * 28 * 28,
	"fps": 1,
	},
	gpu_memory_utilization=0.3
	)

	def generate_meme(image_path, prompt):
	"""
	Simple function to generate text from an image.
	"""
	# Construct the message in Qwen2.5-VL format
	messages = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": [
	{"type": "image", "image": image_path},
	{"type": "text", "text": prompt}
	]}
	]

	# Set sampling parameters
	sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

	# Run inference
	outputs = llm.chat(messages=messages, sampling_params=sampling_params)
	return outputs[0].outputs[0].text

	if __name__ == "__main__":
	# 3. Minimal Main Execution
	image_file = "assets/test_image.jpg"
	user_prompt = "Generate a humorous meme caption.\nTag: Work Life\nContext: Monday morning."

	# Run and print
	caption = generate_meme(image_file, user_prompt)
	print(caption)

	```

	## Training Data & Methodology

	The model was trained on a dataset of 3,713 high-quality, in-the-wild memes.

	* Data Processing: We utilized a Two-Stage CoT synthesis pipeline (powered by Doubao-1.5-vision-pro) to reverse-engineer the "thought process" behind each meme.
	* Format: The model is trained to output a reasoning trace followed by the final content `box_1: text, box_2: text`.



	## Evaluation Results

	Evaluation was conducted against strong baselines (Qwen2.5-7B-Instruct, GPT-4o) using both human evaluation and automated metrics.

	\| Model \| Humor (0-5) \| Readability (0-5) \| Human-Likeness Score (%) \|
	\| --- \| --- \| --- \| --- \|
	\| Qwen2.5-7B-Instruct (Base) \| 2.39 \| 3.35 \| 75.7% \|
	\| GPT-4o \| 2.70 \| 3.79 \| 91.3% \|
	\| HUMOR-COT (Ours) \| 2.68 \| 3.70 \| 91.5% \|

	HUMOR-COT significantly outperforms the base model and achieves parity with closed-source SOTA models in human-likeness.

	![Generation results of Models trained by different CoT methods](assets/generation_results.png)

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@article{li2025perception,
	title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
	author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
	journal={arXiv preprint arXiv:2512.24555},
	year={2025}
	}

	```