|
|
--- |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
license: apache-2.0 |
|
|
base_model: Qwen/Qwen2.5-VL-7B-Instruct |
|
|
tags: |
|
|
- vision |
|
|
- image-text-to-text |
|
|
- multimodal |
|
|
- meme-generation |
|
|
- humor |
|
|
- chain-of-thought |
|
|
- qwen |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: vllm |
|
|
--- |
|
|
|
|
|
# HUMOR-COT: Hierarchical Understanding and Meme Optimization with Chain-of-Thought |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**[Paper](https://arxiv.org/abs/2512.24555)** | **[HUMOR-RM](https://huggingface.co/OpenDILabCommunity/HUMOR-RM-Keye-VL)** |
|
|
|
|
|
</div> |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
**HUMOR-COT** is a multimodal generative model capable of creating humorous, context-aware memes. It is fine-tuned from **Qwen2.5-VL-7B-Instruct** using a novel **Hierarchical Chain-of-Thought (CoT)** approach. |
|
|
|
|
|
Unlike standard image captioning models that map images directly to text, HUMOR-COT mimics the human creative process in two stages: |
|
|
|
|
|
1. **Template-Level Reasoning:** Analyzes the image to infer latent intent, emotional tone, and layout. |
|
|
2. **Context-Level Grounding:** Generates specific, humorous captions (punchlines) grounded in user-supplied keywords or contexts. |
|
|
|
|
|
This model represents the Supervised Fine-Tuning (SFT) stage of the HUMOR framework, achieving state-of-the-art performance in humor, readability, and human-likeness (91.5%) compared to GPT-4o and other VLMs. |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Intended Use |
|
|
|
|
|
* **Meme Generation:** Generating humorous captions for uploaded images based on specific topics or keywords. |
|
|
* **Humor Understanding:** Analyzing the punchline mechanics of existing memes. |
|
|
* **Creative Writing Assist:** Brainstorming metaphorical associations for visual content. |
|
|
|
|
|
### Out of Scope |
|
|
|
|
|
* Generation of hate speech, violence, or harmful stereotypes (filtered during training, but guardrails recommended for deployment). |
|
|
|
|
|
## How to Get Started |
|
|
|
|
|
The model is designed to be used with `vllm` for efficient inference. Below is a custom wrapper class designed to handle the hierarchical generation process. |
|
|
|
|
|
### Prerequisites |
|
|
|
|
|
You need to set up the following environment variables and files: |
|
|
|
|
|
* `NLP_MODEL_PATH`: Path to your Spacy model (e.g., `en_core_web_sm`). |
|
|
* `VLLM_MODEL_PATH`: Path to this model (local or HF hub ID). |
|
|
* `prompt/generate_meme.txt`: The text file containing the system prompt for CoT generation. |
|
|
|
|
|
### Inference Code |
|
|
|
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
# 1. Configuration |
|
|
# Replace with your actual model path or Hugging Face ID |
|
|
MODEL_PATH = "Your-HF-Org/HUMOR-COT" |
|
|
|
|
|
# 2. Initialize the vLLM engine |
|
|
# Note: Qwen2.5-VL requires specific pixel arguments for the visual encoder |
|
|
llm = LLM( |
|
|
model=MODEL_PATH, |
|
|
trust_remote_code=True, |
|
|
limit_mm_per_prompt={"image": 1}, |
|
|
mm_processor_kwargs={ |
|
|
"min_pixels": 28 * 28, |
|
|
"max_pixels": 1280 * 28 * 28, |
|
|
"fps": 1, |
|
|
}, |
|
|
gpu_memory_utilization=0.3 |
|
|
) |
|
|
|
|
|
def generate_meme(image_path, prompt): |
|
|
""" |
|
|
Simple function to generate text from an image. |
|
|
""" |
|
|
# Construct the message in Qwen2.5-VL format |
|
|
messages = [ |
|
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
|
{"role": "user", "content": [ |
|
|
{"type": "image", "image": image_path}, |
|
|
{"type": "text", "text": prompt} |
|
|
]} |
|
|
] |
|
|
|
|
|
# Set sampling parameters |
|
|
sampling_params = SamplingParams(temperature=0.7, max_tokens=256) |
|
|
|
|
|
# Run inference |
|
|
outputs = llm.chat(messages=messages, sampling_params=sampling_params) |
|
|
return outputs[0].outputs[0].text |
|
|
|
|
|
if __name__ == "__main__": |
|
|
# 3. Minimal Main Execution |
|
|
image_file = "assets/test_image.jpg" |
|
|
user_prompt = "Generate a humorous meme caption.\nTag: Work Life\nContext: Monday morning." |
|
|
|
|
|
# Run and print |
|
|
caption = generate_meme(image_file, user_prompt) |
|
|
print(caption) |
|
|
|
|
|
``` |
|
|
|
|
|
## Training Data & Methodology |
|
|
|
|
|
The model was trained on a dataset of **3,713** high-quality, in-the-wild memes. |
|
|
|
|
|
* **Data Processing:** We utilized a Two-Stage CoT synthesis pipeline (powered by Doubao-1.5-vision-pro) to reverse-engineer the "thought process" behind each meme. |
|
|
* **Format:** The model is trained to output a reasoning trace followed by the final content `box_1: text, box_2: text`. |
|
|
|
|
|
|
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
Evaluation was conducted against strong baselines (Qwen2.5-7B-Instruct, GPT-4o) using both human evaluation and automated metrics. |
|
|
|
|
|
| Model | Humor (0-5) | Readability (0-5) | Human-Likeness Score (%) | |
|
|
| --- | --- | --- | --- | |
|
|
| Qwen2.5-7B-Instruct (Base) | 2.39 | 3.35 | 75.7% | |
|
|
| GPT-4o | 2.70 | **3.79** | 91.3% | |
|
|
| **HUMOR-COT (Ours)** | **2.68** | 3.70 | **91.5%** | |
|
|
|
|
|
*HUMOR-COT significantly outperforms the base model and achieves parity with closed-source SOTA models in human-likeness.* |
|
|
|
|
|
 |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{li2025perception, |
|
|
title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme}, |
|
|
author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe}, |
|
|
journal={arXiv preprint arXiv:2512.24555}, |
|
|
year={2025} |
|
|
} |
|
|
|
|
|
``` |