HUMOR-COT: Hierarchical Understanding and Meme Optimization with Chain-of-Thought
Model Summary
HUMOR-COT is a multimodal generative model capable of creating humorous, context-aware memes. It is fine-tuned from Qwen2.5-VL-7B-Instruct using a novel Hierarchical Chain-of-Thought (CoT) approach.
Unlike standard image captioning models that map images directly to text, HUMOR-COT mimics the human creative process in two stages:
- Template-Level Reasoning: Analyzes the image to infer latent intent, emotional tone, and layout.
- Context-Level Grounding: Generates specific, humorous captions (punchlines) grounded in user-supplied keywords or contexts.
This model represents the Supervised Fine-Tuning (SFT) stage of the HUMOR framework, achieving state-of-the-art performance in humor, readability, and human-likeness (91.5%) compared to GPT-4o and other VLMs.
Uses
Intended Use
- Meme Generation: Generating humorous captions for uploaded images based on specific topics or keywords.
- Humor Understanding: Analyzing the punchline mechanics of existing memes.
- Creative Writing Assist: Brainstorming metaphorical associations for visual content.
Out of Scope
- Generation of hate speech, violence, or harmful stereotypes (filtered during training, but guardrails recommended for deployment).
How to Get Started
The model is designed to be used with vllm for efficient inference. Below is a custom wrapper class designed to handle the hierarchical generation process.
Prerequisites
You need to set up the following environment variables and files:
NLP_MODEL_PATH: Path to your Spacy model (e.g.,en_core_web_sm).VLLM_MODEL_PATH: Path to this model (local or HF hub ID).prompt/generate_meme.txt: The text file containing the system prompt for CoT generation.
Inference Code
from vllm import LLM, SamplingParams
# 1. Configuration
# Replace with your actual model path or Hugging Face ID
MODEL_PATH = "Your-HF-Org/HUMOR-COT"
# 2. Initialize the vLLM engine
# Note: Qwen2.5-VL requires specific pixel arguments for the visual encoder
llm = LLM(
model=MODEL_PATH,
trust_remote_code=True,
limit_mm_per_prompt={"image": 1},
mm_processor_kwargs={
"min_pixels": 28 * 28,
"max_pixels": 1280 * 28 * 28,
"fps": 1,
},
gpu_memory_utilization=0.3
)
def generate_meme(image_path, prompt):
"""
Simple function to generate text from an image.
"""
# Construct the message in Qwen2.5-VL format
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt}
]}
]
# Set sampling parameters
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
# Run inference
outputs = llm.chat(messages=messages, sampling_params=sampling_params)
return outputs[0].outputs[0].text
if __name__ == "__main__":
# 3. Minimal Main Execution
image_file = "assets/test_image.jpg"
user_prompt = "Generate a humorous meme caption.\nTag: Work Life\nContext: Monday morning."
# Run and print
caption = generate_meme(image_file, user_prompt)
print(caption)
Training Data & Methodology
The model was trained on a dataset of 3,713 high-quality, in-the-wild memes.
- Data Processing: We utilized a Two-Stage CoT synthesis pipeline (powered by Doubao-1.5-vision-pro) to reverse-engineer the "thought process" behind each meme.
- Format: The model is trained to output a reasoning trace followed by the final content
box_1: text, box_2: text.
Evaluation Results
Evaluation was conducted against strong baselines (Qwen2.5-7B-Instruct, GPT-4o) using both human evaluation and automated metrics.
| Model | Humor (0-5) | Readability (0-5) | Human-Likeness Score (%) |
|---|---|---|---|
| Qwen2.5-7B-Instruct (Base) | 2.39 | 3.35 | 75.7% |
| GPT-4o | 2.70 | 3.79 | 91.3% |
| HUMOR-COT (Ours) | 2.68 | 3.70 | 91.5% |
HUMOR-COT significantly outperforms the base model and achieves parity with closed-source SOTA models in human-likeness.
Citation
If you use this model in your research, please cite:
@article{li2025perception,
title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
journal={arXiv preprint arXiv:2512.24555},
year={2025}
}
- Downloads last month
- 6
Model tree for OpenDILabCommunity/HUMOR-COT-Qwen2.5-VL
Base model
Qwen/Qwen2.5-VL-7B-Instruct