karroyan commited on
Commit
df58226
·
1 Parent(s): 276cd53

feature(lxy): add readme and model

Browse files
.gitattributes CHANGED
@@ -33,3 +33,18 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model-00001-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
37
+ model-00002-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
38
+ model-00003-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
39
+ model-00004-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
40
+ config.json filter=lfs diff=lfs merge=lfs -text
41
+ model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
42
+ preprocessor_config.json filter=lfs diff=lfs merge=lfs -text
43
+ tokenizer_config.json filter=lfs diff=lfs merge=lfs -text
44
+ vocab.json filter=lfs diff=lfs merge=lfs -text
45
+ added_tokens.json filter=lfs diff=lfs merge=lfs -text
46
+ generation_config.json filter=lfs diff=lfs merge=lfs -text
47
+ special_tokens_map.json filter=lfs diff=lfs merge=lfs -text
48
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
49
+ video_preprocessor_config.json filter=lfs diff=lfs merge=lfs -text
50
+ chat_template.jinja filter=lfs diff=lfs merge=lfs -text
Modelfile ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ollama modelfile auto-generated by llamafactory
2
+
3
+ FROM .
4
+
5
+ TEMPLATE """{{ if .System }}<|im_start|>system
6
+ {{ .System }}<|im_end|>
7
+ {{ end }}{{ range .Messages }}{{ if eq .Role "user" }}<|im_start|>user
8
+ {{ .Content }}<|im_end|>
9
+ <|im_start|>assistant
10
+ {{ else if eq .Role "assistant" }}{{ .Content }}<|im_end|>
11
+ {{ end }}{{ end }}"""
12
+
13
+ SYSTEM """You are a helpful assistant."""
14
+
15
+ PARAMETER stop "<|im_end|>"
16
+ PARAMETER num_ctx 4096
README.md ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ license: apache-2.0
6
+ base_model: Qwen/Qwen2.5-VL-7B-Instruct
7
+ tags:
8
+ - vision
9
+ - image-text-to-text
10
+ - multimodal
11
+ - meme-generation
12
+ - humor
13
+ - chain-of-thought
14
+ - qwen
15
+ pipeline_tag: image-text-to-text
16
+ library_name: vllm
17
+ ---
18
+
19
+ # HUMOR-COT: Hierarchical Understanding and Meme Optimization with Chain-of-Thought
20
+
21
+ <div align="center">
22
+
23
+ **[Paper](https://arxiv.org/abs/2512.24555)** | **[Project Page](https://github.com/karroyan/MemeGenerator)**
24
+
25
+ </div>
26
+
27
+ ## Model Summary
28
+
29
+ **HUMOR-COT** is a multimodal generative model capable of creating humorous, context-aware memes. It is fine-tuned from **Qwen2.5-VL-7B-Instruct** using a novel **Hierarchical Chain-of-Thought (CoT)** approach.
30
+
31
+ Unlike standard image captioning models that map images directly to text, HUMOR-COT mimics the human creative process in two stages:
32
+
33
+ 1. **Template-Level Reasoning:** Analyzes the image to infer latent intent, emotional tone, and layout.
34
+ 2. **Context-Level Grounding:** Generates specific, humorous captions (punchlines) grounded in user-supplied keywords or contexts.
35
+
36
+ This model represents the Supervised Fine-Tuning (SFT) stage of the HUMOR framework, achieving state-of-the-art performance in humor, readability, and human-likeness (91.5%) compared to GPT-4o and other VLMs.
37
+
38
+ ## Uses
39
+
40
+ ### Intended Use
41
+
42
+ * **Meme Generation:** Generating humorous captions for uploaded images based on specific topics or keywords.
43
+ * **Humor Understanding:** Analyzing the punchline mechanics of existing memes.
44
+ * **Creative Writing Assist:** Brainstorming metaphorical associations for visual content.
45
+
46
+ ### Out of Scope
47
+
48
+ * Generation of hate speech, violence, or harmful stereotypes (filtered during training, but guardrails recommended for deployment).
49
+
50
+ ## How to Get Started
51
+
52
+ The model is designed to be used with `vllm` for efficient inference. Below is a custom wrapper class designed to handle the hierarchical generation process.
53
+
54
+ ### Prerequisites
55
+
56
+ You need to set up the following environment variables and files:
57
+
58
+ * `NLP_MODEL_PATH`: Path to your Spacy model (e.g., `en_core_web_sm`).
59
+ * `VLLM_MODEL_PATH`: Path to this model (local or HF hub ID).
60
+ * `prompt/generate_meme.txt`: The text file containing the system prompt for CoT generation.
61
+
62
+ ### Inference Code
63
+
64
+ ```python
65
+ import os
66
+ import json
67
+ import logging
68
+ import numpy as np
69
+ import spacy
70
+ from vllm import LLM, SamplingParams
71
+ from transformers import AutoProcessor
72
+
73
+ # Note: Boxclipper and tag_config are custom dependencies from your codebase
74
+ # from utils import Boxclipper, tag_config
75
+
76
+ logger = logging.getLogger(__name__)
77
+
78
+ class HumorMemeGenerator:
79
+ def __init__(self, input_path, input_path_update, mask_api: bool = False, use_gemini_generate: bool = False):
80
+ """
81
+ Initializes the HUMOR-COT generator.
82
+
83
+ Args:
84
+ input_path (str): Path to initial dataset/config json.
85
+ input_path_update (str): Path to updated labels json.
86
+ mask_api (bool): Whether to mask API calls (for internal tools).
87
+ use_gemini_generate (bool): Toggle to use external API instead of local vLLM.
88
+ """
89
+ self.mask_api = mask_api
90
+ self.use_gemini_generate = use_gemini_generate
91
+
92
+ # Load configurations
93
+ with open(input_path, 'r') as f:
94
+ self.input_data = json.load(f)
95
+
96
+ with open(input_path_update, 'r') as f:
97
+ self.input_data_update = json.load(f)
98
+
99
+ # Environment configuration
100
+ self.nlp_path = os.getenv('NLP_MODEL_PATH', 'en_core_web_sm')
101
+ self.model_path = os.getenv('VLLM_MODEL_PATH', 'Your-HF-Org/HUMOR-COT') # Default to HF path
102
+
103
+ self.nlp = spacy.load(self.nlp_path)
104
+
105
+ # Initialize internal classifiers/tools (Placeholder for custom logic)
106
+ # self.scene_theme_classifier = self._init_scene_theme_classifier()
107
+ # self.boxclipper = Boxclipper(mask_api=self.mask_api)
108
+
109
+ # Load Prompt Template
110
+ try:
111
+ with open('prompt/generate_meme.txt', 'r') as prompt_file:
112
+ self.PROMPT = prompt_file.read()
113
+ except FileNotFoundError:
114
+ logger.warning("Prompt file not found. Using default prompt.")
115
+ self.PROMPT = "Generate a humorous meme caption based on the image..."
116
+
117
+ # Initialize Qwen2.5-VL via vLLM
118
+ if not self.use_gemini_generate:
119
+ logger.info(f"Loading Qwen2.5-VL from {self.model_path}...")
120
+ self.processor = AutoProcessor.from_pretrained(
121
+ self.model_path,
122
+ trust_remote_code=True
123
+ )
124
+
125
+ # vLLM Configuration for Multimodal
126
+ self.llm = LLM(
127
+ model=self.model_path,
128
+ trust_remote_code=True,
129
+ dtype="bfloat16",
130
+ max_model_len=4096,
131
+ max_num_seqs=5,
132
+ mm_processor_kwargs={
133
+ "min_pixels": 28 * 28,
134
+ "max_pixels": 1280 * 28 * 28,
135
+ "fps": 1,
136
+ },
137
+ limit_mm_per_prompt={"image": 1},
138
+ tensor_parallel_size=1,
139
+ gpu_memory_utilization=0.3,
140
+ )
141
+ else:
142
+ logger.info("Using External API (Gemini/GPT) for generation.")
143
+ self.llm = None
144
+
145
+ def inference(self, tag, keywords, question, image_path, modify, detections, history):
146
+ """
147
+ Internal inference method wrapping the vLLM generation call.
148
+ (Logic adapted for standalone usage)
149
+ """
150
+ if self.llm is None:
151
+ return "External API logic needed here", []
152
+
153
+ # Construct Prompt using CoT structure
154
+ prompt_text = self.PROMPT.format(
155
+ tag=tag,
156
+ keywords=keywords,
157
+ question=question
158
+ )
159
+
160
+ # Construct vLLM inputs
161
+ # Note: Qwen2.5-VL requires specific token formatting
162
+ messages = [
163
+ {"role": "system", "content": "You are a helpful assistant."},
164
+ {"role": "user", "content": [
165
+ {"type": "image", "image": image_path},
166
+ {"type": "text", "text": prompt_text}
167
+ ]}
168
+ ]
169
+
170
+ # Prepare inputs using processor logic (simplified for vLLM)
171
+ # Actual implementation depends on specific vLLM version requirements for Qwen2.5-VL
172
+ outputs = self.llm.chat(messages=messages, sampling_params=SamplingParams(temperature=0.7, max_tokens=256))
173
+
174
+ generated_text = outputs[0].outputs[0].text
175
+ # Parse generated_text to extract caption and bounding box (loc)
176
+ # return text, loc
177
+ return generated_text, []
178
+
179
+ def text_generate(self, state, chose_image_path, initial_info):
180
+ """
181
+ Main entry point for generating meme text.
182
+
183
+ Args:
184
+ state: Object containing history and modification state.
185
+ chose_image_path (dict): {'local_path': str, 'detections': ...}
186
+ initial_info (dict): {'tag': str, 'Text Content Keywords': str, 'question': str, ...}
187
+ """
188
+ tag = initial_info.get('tag', '')
189
+ keywords = initial_info.get('Text Content Keywords', '')
190
+ question = initial_info.get('question', '') + '\n' + initial_info.get('answer', '')
191
+ modify = state.modify
192
+
193
+ # Call inference
194
+ inference_result = self.inference(
195
+ tag,
196
+ keywords,
197
+ question,
198
+ chose_image_path['local_path'],
199
+ modify,
200
+ chose_image_path.get('detections'),
201
+ state.history_text_loc_info
202
+ )
203
+
204
+ gemini_result = None
205
+
206
+ # Handle Output Tuple
207
+ if len(inference_result) == 3:
208
+ text, loc, gemini_result = inference_result
209
+ else:
210
+ text, loc = inference_result[:2]
211
+
212
+ # Update State
213
+ state.original_text_loc_info = {'text': text, 'loc': loc}
214
+
215
+ if gemini_result:
216
+ state.gemini_text_loc_info = {
217
+ 'text': gemini_result['text'],
218
+ 'loc': gemini_result['loc'],
219
+ 'image_path': gemini_result.get('image_path', chose_image_path['local_path'])
220
+ }
221
+ else:
222
+ state.gemini_text_loc_info = None
223
+
224
+ return text, loc
225
+
226
+ ```
227
+
228
+ ## Training Data & Methodology
229
+
230
+ The model was trained on a dataset of **3,713** high-quality, in-the-wild memes.
231
+
232
+ * **Data Processing:** We utilized a Two-Stage CoT synthesis pipeline (powered by Doubao-1.5-vision-pro) to reverse-engineer the "thought process" behind each meme.
233
+ * **Format:** The model is trained to output a reasoning trace followed by the final content `box_1: text, box_2: text`.
234
+
235
+ ## Evaluation Results
236
+
237
+ Evaluation was conducted against strong baselines (Qwen2.5-7B-Instruct, GPT-4o) using both human evaluation and automated metrics.
238
+
239
+ | Model | Humor (0-5) | Readability (0-5) | Human-Likeness Score (%) |
240
+ | --- | --- | --- | --- |
241
+ | Qwen2.5-7B-Instruct (Base) | 2.39 | 3.35 | 75.7% |
242
+ | GPT-4o | 2.70 | **3.79** | 91.3% |
243
+ | **HUMOR-COT (Ours)** | **2.68** | 3.70 | **91.5%** |
244
+
245
+ *HUMOR-COT significantly outperforms the base model and achieves parity with closed-source SOTA models in human-likeness.*
246
+
247
+ ## Citation
248
+
249
+ If you use this model in your research, please cite:
250
+
251
+ ```bibtex
252
+ @article{li2025perception,
253
+ title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
254
+ author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
255
+ journal={arXiv preprint arXiv:2512.24555},
256
+ year={2025}
257
+ }
258
+
259
+ ```
added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:58b54bbe36fc752f79a24a271ef66a0a0830054b4dfad94bde757d851968060b
3
+ size 605
chat_template.jinja ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a0bc6f6fc7a29a80017a433e8f03a1cc1236e838a944a2d034295a60c4f2fddb
3
+ size 1017
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:41769145a1ac36f13c54710617f00143672d1bbc0d76792beec4c07d2d9f38c8
3
+ size 3219
generation_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:319521f8c6ab944bb1e33d8879079b772b1d6dc8455be4ceecdf7e4c52688a52
3
+ size 214
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3809881f7c49314cb93194900c696f8ced3a0c658864d18c654406b26b708b28
3
+ size 4968243304
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:746487cc31287ced3eeba0694addbd35835c7a37ec77da784b2b3abc7b4d2d8c
3
+ size 4991495816
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bb159f9dc13963266af8fe514580f57cc4ecbdf624fbca705a7fb39b0e3d6b39
3
+ size 4932751040
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3cf368d6376058703680ca7701d5e5528991dc9e5449078a474d58afa3c5e264
3
+ size 1691924384
model.safetensors.index.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4be35310ddc165e46e88de4c3fec8c1210014b0b8717f4544d82cc740814ae0c
3
+ size 57655
preprocessor_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:276e1dbe46dd567fce6e587665266ede535f42ab08d46f3d7febea17cb37abcd
3
+ size 791
special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:76862e765266b85aa9459767e33cbaf13970f327a0e88d1c65846c2ddd3a1ecd
3
+ size 613
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
3
+ size 11421896
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8160131af9f1a4b44ace4fb7a707d6315f90efa6bcb4828a82972dfafed6a458
3
+ size 4756
video_preprocessor_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:09e98526bcd1b8584217418253badf2824ecf2815933b0583cdceb2e8f79ebb0
3
+ size 907
vocab.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca10d7e9fb3ed18575dd1e277a2579c16d108e32f27439684afa0e10b1440910
3
+ size 2776833