--- language: - en - zh license: apache-2.0 base_model: Qwen/Qwen2.5-VL-7B-Instruct tags: - vision - image-text-to-text - multimodal - meme-generation - humor - chain-of-thought - qwen pipeline_tag: image-text-to-text library_name: vllm --- # HUMOR-COT: Hierarchical Understanding and Meme Optimization with Chain-of-Thought
**[Paper](https://arxiv.org/abs/2512.24555)** | **[HUMOR-RM](https://huggingface.co/OpenDILabCommunity/HUMOR-RM-Keye-VL)**
## Model Summary **HUMOR-COT** is a multimodal generative model capable of creating humorous, context-aware memes. It is fine-tuned from **Qwen2.5-VL-7B-Instruct** using a novel **Hierarchical Chain-of-Thought (CoT)** approach. Unlike standard image captioning models that map images directly to text, HUMOR-COT mimics the human creative process in two stages: 1. **Template-Level Reasoning:** Analyzes the image to infer latent intent, emotional tone, and layout. 2. **Context-Level Grounding:** Generates specific, humorous captions (punchlines) grounded in user-supplied keywords or contexts. This model represents the Supervised Fine-Tuning (SFT) stage of the HUMOR framework, achieving state-of-the-art performance in humor, readability, and human-likeness (91.5%) compared to GPT-4o and other VLMs. ## Uses ### Intended Use * **Meme Generation:** Generating humorous captions for uploaded images based on specific topics or keywords. * **Humor Understanding:** Analyzing the punchline mechanics of existing memes. * **Creative Writing Assist:** Brainstorming metaphorical associations for visual content. ### Out of Scope * Generation of hate speech, violence, or harmful stereotypes (filtered during training, but guardrails recommended for deployment). ## How to Get Started The model is designed to be used with `vllm` for efficient inference. Below is a custom wrapper class designed to handle the hierarchical generation process. ### Prerequisites You need to set up the following environment variables and files: * `NLP_MODEL_PATH`: Path to your Spacy model (e.g., `en_core_web_sm`). * `VLLM_MODEL_PATH`: Path to this model (local or HF hub ID). * `prompt/generate_meme.txt`: The text file containing the system prompt for CoT generation. ### Inference Code ```python from vllm import LLM, SamplingParams # 1. Configuration # Replace with your actual model path or Hugging Face ID MODEL_PATH = "Your-HF-Org/HUMOR-COT" # 2. Initialize the vLLM engine # Note: Qwen2.5-VL requires specific pixel arguments for the visual encoder llm = LLM( model=MODEL_PATH, trust_remote_code=True, limit_mm_per_prompt={"image": 1}, mm_processor_kwargs={ "min_pixels": 28 * 28, "max_pixels": 1280 * 28 * 28, "fps": 1, }, gpu_memory_utilization=0.3 ) def generate_meme(image_path, prompt): """ Simple function to generate text from an image. """ # Construct the message in Qwen2.5-VL format messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": [ {"type": "image", "image": image_path}, {"type": "text", "text": prompt} ]} ] # Set sampling parameters sampling_params = SamplingParams(temperature=0.7, max_tokens=256) # Run inference outputs = llm.chat(messages=messages, sampling_params=sampling_params) return outputs[0].outputs[0].text if __name__ == "__main__": # 3. Minimal Main Execution image_file = "assets/test_image.jpg" user_prompt = "Generate a humorous meme caption.\nTag: Work Life\nContext: Monday morning." # Run and print caption = generate_meme(image_file, user_prompt) print(caption) ``` ## Training Data & Methodology The model was trained on a dataset of **3,713** high-quality, in-the-wild memes. * **Data Processing:** We utilized a Two-Stage CoT synthesis pipeline (powered by Doubao-1.5-vision-pro) to reverse-engineer the "thought process" behind each meme. * **Format:** The model is trained to output a reasoning trace followed by the final content `box_1: text, box_2: text`. ## Evaluation Results Evaluation was conducted against strong baselines (Qwen2.5-7B-Instruct, GPT-4o) using both human evaluation and automated metrics. | Model | Humor (0-5) | Readability (0-5) | Human-Likeness Score (%) | | --- | --- | --- | --- | | Qwen2.5-7B-Instruct (Base) | 2.39 | 3.35 | 75.7% | | GPT-4o | 2.70 | **3.79** | 91.3% | | **HUMOR-COT (Ours)** | **2.68** | 3.70 | **91.5%** | *HUMOR-COT significantly outperforms the base model and achieves parity with closed-source SOTA models in human-likeness.* ![Generation results of Models trained by different CoT methods](assets/generation_results.png) ## Citation If you use this model in your research, please cite: ```bibtex @article{li2025perception, title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme}, author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe}, journal={arXiv preprint arXiv:2512.24555}, year={2025} } ```