File size: 5,091 Bytes

---
language:
- en
- zh
license: apache-2.0
base_model: Qwen/Qwen2.5-VL-7B-Instruct
tags:
- vision
- image-text-to-text
- multimodal
- meme-generation
- humor
- chain-of-thought
- qwen
pipeline_tag: image-text-to-text
library_name: vllm
---

# HUMOR-COT: Hierarchical Understanding and Meme Optimization with Chain-of-Thought

<div align="center">

**[Paper](https://arxiv.org/abs/2512.24555)** | **[HUMOR-RM](https://huggingface.co/OpenDILabCommunity/HUMOR-RM-Keye-VL)**

</div>

## Model Summary

**HUMOR-COT** is a multimodal generative model capable of creating humorous, context-aware memes. It is fine-tuned from **Qwen2.5-VL-7B-Instruct** using a novel **Hierarchical Chain-of-Thought (CoT)** approach.

Unlike standard image captioning models that map images directly to text, HUMOR-COT mimics the human creative process in two stages:

1. **Template-Level Reasoning:** Analyzes the image to infer latent intent, emotional tone, and layout.
2. **Context-Level Grounding:** Generates specific, humorous captions (punchlines) grounded in user-supplied keywords or contexts.

This model represents the Supervised Fine-Tuning (SFT) stage of the HUMOR framework, achieving state-of-the-art performance in humor, readability, and human-likeness (91.5%) compared to GPT-4o and other VLMs.

## Uses

### Intended Use

* **Meme Generation:** Generating humorous captions for uploaded images based on specific topics or keywords.
* **Humor Understanding:** Analyzing the punchline mechanics of existing memes.
* **Creative Writing Assist:** Brainstorming metaphorical associations for visual content.

### Out of Scope

* Generation of hate speech, violence, or harmful stereotypes (filtered during training, but guardrails recommended for deployment).

## How to Get Started

The model is designed to be used with `vllm` for efficient inference. Below is a custom wrapper class designed to handle the hierarchical generation process.

### Prerequisites

You need to set up the following environment variables and files:

* `NLP_MODEL_PATH`: Path to your Spacy model (e.g., `en_core_web_sm`).
* `VLLM_MODEL_PATH`: Path to this model (local or HF hub ID).
* `prompt/generate_meme.txt`: The text file containing the system prompt for CoT generation.

### Inference Code

```python
from vllm import LLM, SamplingParams

# 1. Configuration
# Replace with your actual model path or Hugging Face ID
MODEL_PATH = "Your-HF-Org/HUMOR-COT" 

# 2. Initialize the vLLM engine
# Note: Qwen2.5-VL requires specific pixel arguments for the visual encoder
llm = LLM(
    model=MODEL_PATH,
    trust_remote_code=True,
    limit_mm_per_prompt={"image": 1},
    mm_processor_kwargs={
        "min_pixels": 28 * 28,
        "max_pixels": 1280 * 28 * 28,
        "fps": 1,
    },
    gpu_memory_utilization=0.3
)

def generate_meme(image_path, prompt):
    """
    Simple function to generate text from an image.
    """
    # Construct the message in Qwen2.5-VL format
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt}
        ]}
    ]

    # Set sampling parameters
    sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

    # Run inference
    outputs = llm.chat(messages=messages, sampling_params=sampling_params)
    return outputs[0].outputs[0].text

if __name__ == "__main__":
    # 3. Minimal Main Execution
    image_file = "assets/test_image.jpg"
    user_prompt = "Generate a humorous meme caption.\nTag: Work Life\nContext: Monday morning."
    
    # Run and print
    caption = generate_meme(image_file, user_prompt)
    print(caption)

```

## Training Data & Methodology

The model was trained on a dataset of **3,713** high-quality, in-the-wild memes.

* **Data Processing:** We utilized a Two-Stage CoT synthesis pipeline (powered by Doubao-1.5-vision-pro) to reverse-engineer the "thought process" behind each meme.
* **Format:** The model is trained to output a reasoning trace followed by the final content `box_1: text, box_2: text`.



## Evaluation Results

Evaluation was conducted against strong baselines (Qwen2.5-7B-Instruct, GPT-4o) using both human evaluation and automated metrics.

| Model | Humor (0-5) | Readability (0-5) | Human-Likeness Score (%) |
| --- | --- | --- | --- |
| Qwen2.5-7B-Instruct (Base) | 2.39 | 3.35 | 75.7% |
| GPT-4o | 2.70 | **3.79** | 91.3% |
| **HUMOR-COT (Ours)** | **2.68** | 3.70 | **91.5%** |

*HUMOR-COT significantly outperforms the base model and achieves parity with closed-source SOTA models in human-likeness.*

![Generation results of Models trained by different CoT methods](assets/generation_results.png)

## Citation

If you use this model in your research, please cite:

```bibtex
@article{li2025perception,
  title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
  author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
  journal={arXiv preprint arXiv:2512.24555},
  year={2025}
}

```