| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen3-VL-2B-Thinking |
| library_name: transformers |
| tags: |
| - vision-language |
| - new-yorker |
| - humor |
| - rlhf |
| - sft-thinking |
| datasets: |
| - yguooo/newyorker_caption_ranking |
| language: |
| - en |
| --- |
| |
| # humor-r1 — SFT, with thinking (Qwen3-VL-2B-Thinking + LoRA, merged) (E1b) |
|
|
| LoRA-adapted Qwen3-VL-2B-Thinking supervised fine-tuned on (image, synthetic thinking, chosen caption) triples, then merged. Output format: `{thinking}</think>\n\n<caption>X</caption>`. |
|
|
| ## Training data |
|
|
| - 271 New Yorker contests, top-rated caption per contest |
| (`yguooo/newyorker_caption_ranking`). |
| - The 60k Bradley-Terry preference pairs underlying the reward model |
| (separate split). |
| - We deliberately do NOT use the dataset's GPT-4o-generated |
| Scene/Twist/Location/Entities descriptions in the prompt, since they |
| hand-feed scene content to a vision-language model that can already |
| see the image; this makes the policy and reward model usable on any |
| single-panel cartoon, not just the curated subset. |
|
|
| ## How it fits the project |
|
|
| Part of a 2x2 ablation over training method (SFT, GRPO) and output |
| format (no thinking, thinking) for humor caption generation. See |
| `HumorR1/rm-qwen25vl-3b-nodesc` for the reward model used to train (and |
| score) this policy. |
|
|
| ## Inference |
|
|
| Backbone: `Qwen/Qwen3-VL-2B-Thinking`. |
| This repo is a merged full model; load with `transformers.AutoModelForCausalLM.from_pretrained`. |
|
|
| ```python |
| from PIL import Image |
| from transformers import AutoProcessor |
| from vllm import LLM, SamplingParams |
| |
| |
| processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Thinking", trust_remote_code=True) |
| llm = LLM(model="Qwen/Qwen3-VL-2B-Thinking", trust_remote_code=True, dtype="bfloat16", |
| max_model_len=4096) |
| |
| # Caption format: <caption>X</caption>; thinking variant prefixes <think>...</think>. |
| ``` |
|
|
| ## Reward model used during training |
|
|
| - `HumorR1/rm-qwen25vl-3b-nodesc` (held-out pairwise accuracy 0.6635). |
|
|