karroyan
doc(lxy): modify readme and add test image
2e865aa
---
language:
- en
- zh
license: apache-2.0
base_model: Qwen/Qwen2.5-VL-7B-Instruct
tags:
- vision
- image-text-to-text
- multimodal
- meme-generation
- humor
- chain-of-thought
- qwen
pipeline_tag: image-text-to-text
library_name: vllm
---
# HUMOR-COT: Hierarchical Understanding and Meme Optimization with Chain-of-Thought
<div align="center">
**[Paper](https://arxiv.org/abs/2512.24555)** | **[HUMOR-RM](https://huggingface.co/OpenDILabCommunity/HUMOR-RM-Keye-VL)**
</div>
## Model Summary
**HUMOR-COT** is a multimodal generative model capable of creating humorous, context-aware memes. It is fine-tuned from **Qwen2.5-VL-7B-Instruct** using a novel **Hierarchical Chain-of-Thought (CoT)** approach.
Unlike standard image captioning models that map images directly to text, HUMOR-COT mimics the human creative process in two stages:
1. **Template-Level Reasoning:** Analyzes the image to infer latent intent, emotional tone, and layout.
2. **Context-Level Grounding:** Generates specific, humorous captions (punchlines) grounded in user-supplied keywords or contexts.
This model represents the Supervised Fine-Tuning (SFT) stage of the HUMOR framework, achieving state-of-the-art performance in humor, readability, and human-likeness (91.5%) compared to GPT-4o and other VLMs.
## Uses
### Intended Use
* **Meme Generation:** Generating humorous captions for uploaded images based on specific topics or keywords.
* **Humor Understanding:** Analyzing the punchline mechanics of existing memes.
* **Creative Writing Assist:** Brainstorming metaphorical associations for visual content.
### Out of Scope
* Generation of hate speech, violence, or harmful stereotypes (filtered during training, but guardrails recommended for deployment).
## How to Get Started
The model is designed to be used with `vllm` for efficient inference. Below is a custom wrapper class designed to handle the hierarchical generation process.
### Prerequisites
You need to set up the following environment variables and files:
* `NLP_MODEL_PATH`: Path to your Spacy model (e.g., `en_core_web_sm`).
* `VLLM_MODEL_PATH`: Path to this model (local or HF hub ID).
* `prompt/generate_meme.txt`: The text file containing the system prompt for CoT generation.
### Inference Code
```python
from vllm import LLM, SamplingParams
# 1. Configuration
# Replace with your actual model path or Hugging Face ID
MODEL_PATH = "Your-HF-Org/HUMOR-COT"
# 2. Initialize the vLLM engine
# Note: Qwen2.5-VL requires specific pixel arguments for the visual encoder
llm = LLM(
model=MODEL_PATH,
trust_remote_code=True,
limit_mm_per_prompt={"image": 1},
mm_processor_kwargs={
"min_pixels": 28 * 28,
"max_pixels": 1280 * 28 * 28,
"fps": 1,
},
gpu_memory_utilization=0.3
)
def generate_meme(image_path, prompt):
"""
Simple function to generate text from an image.
"""
# Construct the message in Qwen2.5-VL format
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt}
]}
]
# Set sampling parameters
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
# Run inference
outputs = llm.chat(messages=messages, sampling_params=sampling_params)
return outputs[0].outputs[0].text
if __name__ == "__main__":
# 3. Minimal Main Execution
image_file = "assets/test_image.jpg"
user_prompt = "Generate a humorous meme caption.\nTag: Work Life\nContext: Monday morning."
# Run and print
caption = generate_meme(image_file, user_prompt)
print(caption)
```
## Training Data & Methodology
The model was trained on a dataset of **3,713** high-quality, in-the-wild memes.
* **Data Processing:** We utilized a Two-Stage CoT synthesis pipeline (powered by Doubao-1.5-vision-pro) to reverse-engineer the "thought process" behind each meme.
* **Format:** The model is trained to output a reasoning trace followed by the final content `box_1: text, box_2: text`.
## Evaluation Results
Evaluation was conducted against strong baselines (Qwen2.5-7B-Instruct, GPT-4o) using both human evaluation and automated metrics.
| Model | Humor (0-5) | Readability (0-5) | Human-Likeness Score (%) |
| --- | --- | --- | --- |
| Qwen2.5-7B-Instruct (Base) | 2.39 | 3.35 | 75.7% |
| GPT-4o | 2.70 | **3.79** | 91.3% |
| **HUMOR-COT (Ours)** | **2.68** | 3.70 | **91.5%** |
*HUMOR-COT significantly outperforms the base model and achieves parity with closed-source SOTA models in human-likeness.*
![Generation results of Models trained by different CoT methods](assets/generation_results.png)
## Citation
If you use this model in your research, please cite:
```bibtex
@article{li2025perception,
title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
journal={arXiv preprint arXiv:2512.24555},
year={2025}
}
```