Configuration Parsing Warning: Invalid JSON for config file config.json
Configuration Parsing Warning: Invalid JSON for config file tokenizer_config.json

HUMOR-COT: Hierarchical Understanding and Meme Optimization with Chain-of-Thought

Model Summary

HUMOR-COT is a multimodal generative model capable of creating humorous, context-aware memes. It is fine-tuned from Qwen2.5-VL-7B-Instruct using a novel Hierarchical Chain-of-Thought (CoT) approach.

Unlike standard image captioning models that map images directly to text, HUMOR-COT mimics the human creative process in two stages:

  1. Template-Level Reasoning: Analyzes the image to infer latent intent, emotional tone, and layout.
  2. Context-Level Grounding: Generates specific, humorous captions (punchlines) grounded in user-supplied keywords or contexts.

This model represents the Supervised Fine-Tuning (SFT) stage of the HUMOR framework, achieving state-of-the-art performance in humor, readability, and human-likeness (91.5%) compared to GPT-4o and other VLMs.

Uses

Intended Use

  • Meme Generation: Generating humorous captions for uploaded images based on specific topics or keywords.
  • Humor Understanding: Analyzing the punchline mechanics of existing memes.
  • Creative Writing Assist: Brainstorming metaphorical associations for visual content.

Out of Scope

  • Generation of hate speech, violence, or harmful stereotypes (filtered during training, but guardrails recommended for deployment).

How to Get Started

The model is designed to be used with vllm for efficient inference. Below is a custom wrapper class designed to handle the hierarchical generation process.

Prerequisites

You need to set up the following environment variables and files:

  • NLP_MODEL_PATH: Path to your Spacy model (e.g., en_core_web_sm).
  • VLLM_MODEL_PATH: Path to this model (local or HF hub ID).
  • prompt/generate_meme.txt: The text file containing the system prompt for CoT generation.

Inference Code

from vllm import LLM, SamplingParams

# 1. Configuration
# Replace with your actual model path or Hugging Face ID
MODEL_PATH = "Your-HF-Org/HUMOR-COT" 

# 2. Initialize the vLLM engine
# Note: Qwen2.5-VL requires specific pixel arguments for the visual encoder
llm = LLM(
    model=MODEL_PATH,
    trust_remote_code=True,
    limit_mm_per_prompt={"image": 1},
    mm_processor_kwargs={
        "min_pixels": 28 * 28,
        "max_pixels": 1280 * 28 * 28,
        "fps": 1,
    },
    gpu_memory_utilization=0.3
)

def generate_meme(image_path, prompt):
    """
    Simple function to generate text from an image.
    """
    # Construct the message in Qwen2.5-VL format
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt}
        ]}
    ]

    # Set sampling parameters
    sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

    # Run inference
    outputs = llm.chat(messages=messages, sampling_params=sampling_params)
    return outputs[0].outputs[0].text

if __name__ == "__main__":
    # 3. Minimal Main Execution
    image_file = "assets/test_image.jpg"
    user_prompt = "Generate a humorous meme caption.\nTag: Work Life\nContext: Monday morning."
    
    # Run and print
    caption = generate_meme(image_file, user_prompt)
    print(caption)

Training Data & Methodology

The model was trained on a dataset of 3,713 high-quality, in-the-wild memes.

  • Data Processing: We utilized a Two-Stage CoT synthesis pipeline (powered by Doubao-1.5-vision-pro) to reverse-engineer the "thought process" behind each meme.
  • Format: The model is trained to output a reasoning trace followed by the final content box_1: text, box_2: text.

Evaluation Results

Evaluation was conducted against strong baselines (Qwen2.5-7B-Instruct, GPT-4o) using both human evaluation and automated metrics.

Model Humor (0-5) Readability (0-5) Human-Likeness Score (%)
Qwen2.5-7B-Instruct (Base) 2.39 3.35 75.7%
GPT-4o 2.70 3.79 91.3%
HUMOR-COT (Ours) 2.68 3.70 91.5%

HUMOR-COT significantly outperforms the base model and achieves parity with closed-source SOTA models in human-likeness.

Generation results of Models trained by different CoT methods

Citation

If you use this model in your research, please cite:

@article{li2025perception,
  title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
  author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
  journal={arXiv preprint arXiv:2512.24555},
  year={2025}
}
Downloads last month
6
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenDILabCommunity/HUMOR-COT-Qwen2.5-VL

Finetuned
(955)
this model

Paper for OpenDILabCommunity/HUMOR-COT-Qwen2.5-VL