File size: 5,091 Bytes
df58226
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e865aa
df58226
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9f00dfe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e865aa
9f00dfe
 
 
 
 
df58226
 
 
 
 
 
 
 
 
 
9f00dfe
 
df58226
 
 
 
 
 
 
 
 
 
 
 
9f00dfe
 
df58226
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
language:
- en
- zh
license: apache-2.0
base_model: Qwen/Qwen2.5-VL-7B-Instruct
tags:
- vision
- image-text-to-text
- multimodal
- meme-generation
- humor
- chain-of-thought
- qwen
pipeline_tag: image-text-to-text
library_name: vllm
---

# HUMOR-COT: Hierarchical Understanding and Meme Optimization with Chain-of-Thought

<div align="center">

**[Paper](https://arxiv.org/abs/2512.24555)** | **[HUMOR-RM](https://huggingface.co/OpenDILabCommunity/HUMOR-RM-Keye-VL)**

</div>

## Model Summary

**HUMOR-COT** is a multimodal generative model capable of creating humorous, context-aware memes. It is fine-tuned from **Qwen2.5-VL-7B-Instruct** using a novel **Hierarchical Chain-of-Thought (CoT)** approach.

Unlike standard image captioning models that map images directly to text, HUMOR-COT mimics the human creative process in two stages:

1. **Template-Level Reasoning:** Analyzes the image to infer latent intent, emotional tone, and layout.
2. **Context-Level Grounding:** Generates specific, humorous captions (punchlines) grounded in user-supplied keywords or contexts.

This model represents the Supervised Fine-Tuning (SFT) stage of the HUMOR framework, achieving state-of-the-art performance in humor, readability, and human-likeness (91.5%) compared to GPT-4o and other VLMs.

## Uses

### Intended Use

* **Meme Generation:** Generating humorous captions for uploaded images based on specific topics or keywords.
* **Humor Understanding:** Analyzing the punchline mechanics of existing memes.
* **Creative Writing Assist:** Brainstorming metaphorical associations for visual content.

### Out of Scope

* Generation of hate speech, violence, or harmful stereotypes (filtered during training, but guardrails recommended for deployment).

## How to Get Started

The model is designed to be used with `vllm` for efficient inference. Below is a custom wrapper class designed to handle the hierarchical generation process.

### Prerequisites

You need to set up the following environment variables and files:

* `NLP_MODEL_PATH`: Path to your Spacy model (e.g., `en_core_web_sm`).
* `VLLM_MODEL_PATH`: Path to this model (local or HF hub ID).
* `prompt/generate_meme.txt`: The text file containing the system prompt for CoT generation.

### Inference Code

```python
from vllm import LLM, SamplingParams

# 1. Configuration
# Replace with your actual model path or Hugging Face ID
MODEL_PATH = "Your-HF-Org/HUMOR-COT" 

# 2. Initialize the vLLM engine
# Note: Qwen2.5-VL requires specific pixel arguments for the visual encoder
llm = LLM(
    model=MODEL_PATH,
    trust_remote_code=True,
    limit_mm_per_prompt={"image": 1},
    mm_processor_kwargs={
        "min_pixels": 28 * 28,
        "max_pixels": 1280 * 28 * 28,
        "fps": 1,
    },
    gpu_memory_utilization=0.3
)

def generate_meme(image_path, prompt):
    """
    Simple function to generate text from an image.
    """
    # Construct the message in Qwen2.5-VL format
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt}
        ]}
    ]

    # Set sampling parameters
    sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

    # Run inference
    outputs = llm.chat(messages=messages, sampling_params=sampling_params)
    return outputs[0].outputs[0].text

if __name__ == "__main__":
    # 3. Minimal Main Execution
    image_file = "assets/test_image.jpg"
    user_prompt = "Generate a humorous meme caption.\nTag: Work Life\nContext: Monday morning."
    
    # Run and print
    caption = generate_meme(image_file, user_prompt)
    print(caption)

```

## Training Data & Methodology

The model was trained on a dataset of **3,713** high-quality, in-the-wild memes.

* **Data Processing:** We utilized a Two-Stage CoT synthesis pipeline (powered by Doubao-1.5-vision-pro) to reverse-engineer the "thought process" behind each meme.
* **Format:** The model is trained to output a reasoning trace followed by the final content `box_1: text, box_2: text`.



## Evaluation Results

Evaluation was conducted against strong baselines (Qwen2.5-7B-Instruct, GPT-4o) using both human evaluation and automated metrics.

| Model | Humor (0-5) | Readability (0-5) | Human-Likeness Score (%) |
| --- | --- | --- | --- |
| Qwen2.5-7B-Instruct (Base) | 2.39 | 3.35 | 75.7% |
| GPT-4o | 2.70 | **3.79** | 91.3% |
| **HUMOR-COT (Ours)** | **2.68** | 3.70 | **91.5%** |

*HUMOR-COT significantly outperforms the base model and achieves parity with closed-source SOTA models in human-likeness.*

![Generation results of Models trained by different CoT methods](assets/generation_results.png)

## Citation

If you use this model in your research, please cite:

```bibtex
@article{li2025perception,
  title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
  author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
  journal={arXiv preprint arXiv:2512.24555},
  year={2025}
}

```