CycleGRPO-4B

CycleGRPO-4B is post-trained from zhouyik/Qwen3-VL-4B-SAMTok with caption ↔ grounding cycle-consistent reinforcement learning: a caption is rewarded by how well the model can ground it back to the region it describes (cycle IoU) plus GT-free regularizers — no reference-caption supervision in the RL stage. It produces descriptions with interleaved segmentation masks for the corresponding parts of the answer, decoded through the SAMTok mask tokenizer.

Code: github.com/devinxzhang/CycleGRPO

Quickstart

CycleGRPO-4B is a Qwen3-VL-4B that emits SAMTok mask tokens (<|mt_...|>). Plain text generation works with 🤗 Transformers directly; turning the mask tokens into segmentation masks needs the VQ-SAM2 decoder from the CycleGRPO repo (projects.transformers.vq_sam2), so clone and install it first:

pip install "transformers>=4.57"
git clone https://github.com/devinxzhang/CycleGRPO.git
cd CycleGRPO            # run from the repo root so `projects.transformers.vq_sam2` imports
pip install -e .

Generate (text + mask tokens)

import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model_id = "XinNUS/CycleGRPO-4B"
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id, dtype="auto", device_map="auto"
).eval()
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "figs/totoro.jpg"},
        {"type": "text", "text": "Describe the image with interleaved segmentation "
                                 "masks for the corresponding parts of the answer."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
).to(model.device)

out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
text = processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
print(text)   # answer text interleaved with <|mt_start|><|mt_XXXX|><|mt_YYYY|><|mt_end|> mask tokens

Decode mask tokens → segmentation masks

The <|mt_...|> tokens are decoded to masks by the VQ-SAM2 mask tokenizer. Use the reference implementation in the CycleGRPO repo rather than re-deriving it — see evaluation/groundingsuite/qwen3vl_groundingsuite_infer.py (or evaluation/dlc_bench/inference.py), which build the decoder and run the decode loop:

from projects.transformers.vq_sam2 import VQ_SAM2, VQ_SAM2Config, SAM2Config
# Those scripts also contain the `DirectResize` preprocessor, the mt-token parsing
# (extract_mt_token_ids / fix_mt_format), and the `VQ_SAM2.forward_with_codes(...)`
# decode step (codebook size 256, depth 2). Reuse them directly.

The decoder weights — mask_tokenizer_256x2.pth and sam2.1_hiera_large.pt — come from the base model Qwen3-VL-4B-SAMTok.

License

Released under Apache-2.0. Derived from Qwen3-VL-4B-SAMTok; use is also subject to the base model's license and terms.

Downloads last month
-
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for XinNUS/CycleGRPO-4B

Finetuned
(1)
this model
Quantizations
1 model