Instructions to use XinNUS/CycleGRPO-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use XinNUS/CycleGRPO-4B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="XinNUS/CycleGRPO-4B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("XinNUS/CycleGRPO-4B")
model = AutoModelForMultimodalLM.from_pretrained("XinNUS/CycleGRPO-4B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use XinNUS/CycleGRPO-4B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "XinNUS/CycleGRPO-4B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "XinNUS/CycleGRPO-4B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/XinNUS/CycleGRPO-4B

SGLang

How to use XinNUS/CycleGRPO-4B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "XinNUS/CycleGRPO-4B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "XinNUS/CycleGRPO-4B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "XinNUS/CycleGRPO-4B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "XinNUS/CycleGRPO-4B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use XinNUS/CycleGRPO-4B with Docker Model Runner:
```
docker model run hf.co/XinNUS/CycleGRPO-4B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

CycleGRPO-4B

CycleGRPO-4B is post-trained from zhouyik/Qwen3-VL-4B-SAMTok with caption ↔ grounding cycle-consistent reinforcement learning: a caption is rewarded by how well the model can ground it back to the region it describes (cycle IoU) plus GT-free regularizers — no reference-caption supervision in the RL stage. It produces descriptions with interleaved segmentation masks for the corresponding parts of the answer, decoded through the SAMTok mask tokenizer.

Code: github.com/devinxzhang/CycleGRPO

Quickstart

CycleGRPO-4B is a Qwen3-VL-4B that emits SAMTok mask tokens (<|mt_...|>). Plain text generation works with 🤗 Transformers directly; turning the mask tokens into segmentation masks needs the VQ-SAM2 decoder from the CycleGRPO repo (projects.transformers.vq_sam2), so clone and install it first:

pip install "transformers>=4.57"
git clone https://github.com/devinxzhang/CycleGRPO.git
cd CycleGRPO            # run from the repo root so `projects.transformers.vq_sam2` imports
pip install -e .

Generate (text + mask tokens)

import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model_id = "XinNUS/CycleGRPO-4B"
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id, dtype="auto", device_map="auto"
).eval()
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "figs/totoro.jpg"},
        {"type": "text", "text": "Describe the image with interleaved segmentation "
                                 "masks for the corresponding parts of the answer."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
).to(model.device)

out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
text = processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
print(text)   # answer text interleaved with <|mt_start|><|mt_XXXX|><|mt_YYYY|><|mt_end|> mask tokens

Decode mask tokens → segmentation masks

The <|mt_...|> tokens are decoded to masks by the VQ-SAM2 mask tokenizer. Use the reference implementation in the CycleGRPO repo rather than re-deriving it — see evaluation/groundingsuite/qwen3vl_groundingsuite_infer.py (or evaluation/dlc_bench/inference.py), which build the decoder and run the decode loop:

from projects.transformers.vq_sam2 import VQ_SAM2, VQ_SAM2Config, SAM2Config
# Those scripts also contain the `DirectResize` preprocessor, the mt-token parsing
# (extract_mt_token_ids / fix_mt_format), and the `VQ_SAM2.forward_with_codes(...)`
# decode step (codebook size 256, depth 2). Reuse them directly.

The decoder weights — mask_tokenizer_256x2.pth and sam2.1_hiera_large.pt — come from the base model Qwen3-VL-4B-SAMTok.

License

Released under Apache-2.0. Derived from Qwen3-VL-4B-SAMTok; use is also subject to the base model's license and terms.

Downloads last month: -

Safetensors

Model size

5B params

Tensor type

BF16

Model tree for XinNUS/CycleGRPO-4B

Base model

zhouyik/Qwen3-VL-4B-SAMTok

Finetuned

(1)

this model

Quantizations

1 model