Instructions to use fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct

SGLang

How to use fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct with Docker Model Runner:
```
docker model run hf.co/fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct
```

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Baoyou Chen^1,3 · Hanchen Xia¹ · Peng Tu¹ · Haojun Shi¹ · Liwei Zhang¹ · Weihao Yuan⁴ · Siyu Zhu^1,2,3,†

¹Shanghai Academy of AI for Science · ²Shanghai Innovation Institute · ³Fudan University · ⁴Nanjing University

🤗 Model | 🏠 Project Page | 📑 Paper | ✨ Code

Bard-VL-B4-Mask-2B-Instruct

Bard-VL-B4-Mask-2B-Instruct is a 2B-class vision-language instruction model with masked discrete-diffusion decoding.

It is part of the Bard-VL family and is designed to bridge autoregressive and diffusion-style vision-language models through Progressive Block Merging (PBM) and Stage-Wise Distillation (SWD).

Compared with a standard autoregressive VLM release style, Bard-VL emphasizes:

parallel block-wise decoding instead of token-by-token generation
controllable response generation through blockwise denoising

✨ Highlights

Progressive Block Merging: Bard-VL increases the decoding block size progressively instead of jumping directly from autoregressive decoding to large-block diffusion.
Stage-Wise dVLM Distillation: Bard-VL distills from a small-block diffusion anchor in the same denoising regime, reducing autoregressive-to-diffusion transfer mismatch.
Packed Multimodal Attention Mask: the packed attention layout reuses shared multimodal context across clean and noisy branches to reduce redundant computation.
Mixed-Noise Training: Bard-VL combines masked-token and uniform token corruption to support both token completion and visible-token revision.

🧭 Method Structure

Bard-VL method overview

Pipeline, block-wise attention mask, and mixed-noise scheduler used by Bard-VL.

📊 Evaluation Results

AutoRegressive Vision-Language Models

Model	Parameters	MMMU_val	MMMU-Pro_standard	MME_sum	RealWorldQA	MMStar	AI2D	ChartQA
Qwen3-VL	4B	47.9	35.0	2297	70.5	56.9	81.0	80.9
Qwen3-VL	8B	53.0	36.0	2379	69.5	59.9	83.5	84.0
InternVL3.5	4B	57.4	38.2	2236	66.7	65.6	80.6	86.2
InternVL3.5	8B	57.2	41.0	2359	63.1	66.3	82.1	87.0

Diffusion Vision-Language Models

Model	Parameters	MMMU_val	MMMU-Pro_standard	MME_sum	RealWorldQA	MMStar	AI2D	ChartQA
LLaDA-V	8B	48.8	35.4	1998	63.4	60.4	77.8	78.2
Dream-VL	7B	51.6	25.0	2179	67.7	59.9	80.4	86.2
LaviDa	8B	44.2	28.6	1711	40.3	47.0	70.1	64.6
SDAR-VL	8B	44.0	28.2	2142	66.1	53.3	79.6	82.4
MMaDA	8B	30.2	21.5	1287	28.2	25.7	54.9	43.2
Dimple-VL	7B	46.4	24.1	1924	51.9	47.7	74.2	58.4

Bard-VL Converted from Qwen3-VL

Model	Parameters	MMMU_val	MMMU-Pro_standard	MME_sum	RealWorldQA	MMStar	AI2D	ChartQA
Bard-VL (B = 32)	2B	42.0	27.9	2045	64.6	53.1	72.6	76.8
Bard-VL (B = 32)	4B	53.0	34.2	2305	71.9	63.6	82.8	80.2
Bard-VL (B = 32)	8B	54.6	37.6	2393	70.7	65.0	83.2	84.6

🛠️ Environment

Make sure your environment is aligned with the repository requirements.txt:

python>=3.10
torch==2.8.0
torchvision==0.23.0
transformers==4.57.3
diffusers==0.36.0
accelerate==1.12.0
deepspeed==0.17.0

Recommended runtime settings in the local repository:

dtype = bfloat16
attn_implementation = sdpa
block_size = 4
denoising_steps = 4

🚀 Inference Example

The official repository inference flow is implemented in inference.py. A minimal image understanding example aligned with that script is shown below.

import torch
from transformers import AutoProcessor

from qwen_vl_utils import process_vision_info
from nemo_automodel.components.models.bard_vl import BardVLForConditionalGeneration

model_id = "fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = BardVLForConditionalGeneration.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    _attn_implementation="sdpa",
).to(device).eval()
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant.",
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "assets/puzzle.jpg", "min_pixels": 256 * 256, "max_pixels": 2048 * 2048},
            {"type": "text", "text": "Please describe this image."},
        ],
    },
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, video_inputs, video_kwargs = process_vision_info(
    messages,
    return_video_kwargs=True,
    return_video_metadata=False,
    image_patch_size=processor.image_processor.patch_size,
)

batch = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=False,
    return_tensors="pt",
    **video_kwargs,
).to(device)

response_ids = model.generate(
    batch,
    max_new_tokens=1024,
    block_size=4,
    denoising_steps=4,
    temperature=0.0,
    top_k=0,
    top_p=1.0,
    remasking_strategy="low_confidence_dynamic",
    confidence_threshold=0.5,
    return_step_stats=False,
)

print(processor.tokenizer.batch_decode(response_ids, skip_special_tokens=True)[0].strip())

For video understanding, replace the image message with the video example in inference.py.

📚 Citation

@article{chen2026bard,
  title={BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation},
  author={Baoyou Chen and Hanchen Xia and Peng Tu and Haojun Shi and Liwei Zhang and Weihao Yuan and Siyu Zhu},
  journal={arXiv preprint arXiv:2604.16514},
  year={2026}
}

Downloads last month: 16

Safetensors

Model size

2B params

Tensor type

BF16

Collection including fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct

Bard-VL

Collection

7 items • Updated May 19 • 1

Paper for fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Paper • 2604.16514 • Published Apr 22