Instructions to use HumorR1/policy-e1a-sft-no-thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use HumorR1/policy-e1a-sft-no-thinking with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-VL-2B-Instruct") model = PeftModel.from_pretrained(base_model, "HumorR1/policy-e1a-sft-no-thinking") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: Qwen/Qwen3-VL-2B-Instruct | |
| library_name: peft | |
| tags: | |
| - vision-language | |
| - new-yorker | |
| - humor | |
| - rlhf | |
| - sft-no-thinking | |
| datasets: | |
| - yguooo/newyorker_caption_ranking | |
| language: | |
| - en | |
| # humor-r1 — SFT, no thinking (Qwen3-VL-2B-Instruct + LoRA) (E1a) | |
| LoRA-adapted Qwen3-VL-2B-Instruct supervised fine-tuned on the chosen captions of 271 New Yorker contests. The model emits captions directly inside `<caption>...</caption>` tags, with no chain-of-thought. | |
| ## Training data | |
| - 271 New Yorker contests, top-rated caption per contest | |
| (`yguooo/newyorker_caption_ranking`). | |
| - The 60k Bradley-Terry preference pairs underlying the reward model | |
| (separate split). | |
| - We deliberately do NOT use the dataset's GPT-4o-generated | |
| Scene/Twist/Location/Entities descriptions in the prompt, since they | |
| hand-feed scene content to a vision-language model that can already | |
| see the image; this makes the policy and reward model usable on any | |
| single-panel cartoon, not just the curated subset. | |
| ## How it fits the project | |
| Part of a 2x2 ablation over training method (SFT, GRPO) and output | |
| format (no thinking, thinking) for humor caption generation. See | |
| `HumorR1/rm-qwen25vl-3b-nodesc` for the reward model used to train (and | |
| score) this policy. | |
| ## Inference | |
| Backbone: `Qwen/Qwen3-VL-2B-Instruct`. | |
| This repo is a LoRA adapter; load with `peft.PeftModel.from_pretrained`. | |
| ```python | |
| from PIL import Image | |
| from transformers import AutoProcessor | |
| from vllm import LLM, SamplingParams | |
| from vllm.lora.request import LoRARequest | |
| processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct", trust_remote_code=True) | |
| llm = LLM(model="Qwen/Qwen3-VL-2B-Instruct", trust_remote_code=True, dtype="bfloat16", | |
| enable_lora=True, max_lora_rank=32, max_model_len=4096) | |
| # Caption format: <caption>X</caption>; thinking variant prefixes <think>...</think>. | |
| ``` | |
| ## Reward model used during training | |
| - `HumorR1/rm-qwen25vl-3b-nodesc` (held-out pairwise accuracy 0.6635). | |