Instructions to use Feng613/SleepVLM-3B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Feng613/SleepVLM-3B-W4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Feng613/SleepVLM-3B-W4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Feng613/SleepVLM-3B-W4A16")
model = AutoModelForMultimodalLM.from_pretrained("Feng613/SleepVLM-3B-W4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Feng613/SleepVLM-3B-W4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Feng613/SleepVLM-3B-W4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Feng613/SleepVLM-3B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Feng613/SleepVLM-3B-W4A16

SGLang

How to use Feng613/SleepVLM-3B-W4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Feng613/SleepVLM-3B-W4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Feng613/SleepVLM-3B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Feng613/SleepVLM-3B-W4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Feng613/SleepVLM-3B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Feng613/SleepVLM-3B-W4A16 with Docker Model Runner:
```
docker model run hf.co/Feng613/SleepVLM-3B-W4A16
```

SleepVLM-3B-W4A16

Quantized Version — Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

Paper (coming soon) | GitHub | Full-Precision Version | MASS-EX Dataset | Collection

Associated Paper: Guifeng Deng, Pan Wang, Jiquan Wang, Tao Li, Haiteng Jiang. "SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model." In preparation. This repository will be made public upon release of the preprint.

Overview

SleepVLM-3B-W4A16 is the 4-bit weight-quantized version of SleepVLM-3B, a rule-grounded vision-language model for explainable automated sleep staging from polysomnography (PSG) recordings. This quantized variant achieves 2.2x faster inference and 55% model size reduction with minimal performance degradation (kappa drop ≤1.6 pp), enabling deployment on a single consumer-grade GPU (e.g., NVIDIA RTX 4090, 24 GB).

The quantization was performed using Intel AutoRound (W4A16: 4-bit weights, 16-bit activations) on the language model layers only. The vision encoder and lm_head are retained in float16 precision.

For full details about the SleepVLM framework and training pipeline, see the full-precision model card.

Model Details

Property	Value
Base model	SleepVLM-3B (fine-tuned from Qwen2.5-VL-3B-Instruct)
Model size	3.2 GB (vs 7.1 GB full-precision, -54.9%)
Inference speed	4.15 epoch/s (vs 1.89 epoch/s, +2.20x)
Precision	W4A16 (4-bit weights, 16-bit activations)
Quantization method	Intel AutoRound v0.9.2
Quantized layers	`model.language_model.layers` (36 transformer blocks)
Non-quantized layers	Vision encoder + lm_head (float16)
Group size	128
Calibration samples	5,000 (stratified by sleep stage)
Input	Three consecutive 30-s PSG epoch images (448 x 224 px)
PSG channels	F4-M1, C4-M1, O2-M1, LOC, ROC, Chin EMG

Intended Use

Primary use: Research on explainable automated sleep staging, especially in resource-constrained settings.
Intended users: Sleep medicine researchers, clinical informatics researchers, and AI/ML researchers working on interpretable medical AI.
Deployment scenario: Single consumer-grade GPU inference (e.g., NVIDIA RTX 4090, 24 GB).
Clinical note: This model is intended for research purposes. It has not been validated for clinical diagnostic use and should not replace professional sleep technologist scoring in clinical settings.

Citation

If you use SleepVLM in your research, please cite:

@article{deng2026sleepvlm,
  author    = {Deng, Guifeng and Wang, Pan and Wang, Jiquan and Li, Tao and Jiang, Haiteng},
  title     = {{SleepVLM}: Explainable and Rule-Grounded Sleep Staging
               via a Vision-Language Model},
  journal   = {}, % TODO: update after publication
  year      = {2026}
}

License

This model is released under the Apache 2.0 License.

Downloads last month: 4

Safetensors

Model size

1B params

Tensor type

I32

F16

Model tree for Feng613/SleepVLM-3B-W4A16

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Adapter

Feng613/SleepVLM-3B

Quantized

(1)

this model

Dataset used to train Feng613/SleepVLM-3B-W4A16

Collection including Feng613/SleepVLM-3B-W4A16

SleepVLM

Collection

Pre-trained weights and Dataset for paper "SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model." • 3 items • Updated Mar 17 • 1

Evaluation results

Accuracy on MASS-SS1
self-reported

0.827
Macro-F1 on MASS-SS1
self-reported

0.788
Cohen's Kappa on MASS-SS1
self-reported

0.758
Accuracy on ZUMS
self-reported

0.798
Macro-F1 on ZUMS
self-reported

0.751
Cohen's Kappa on ZUMS
self-reported

0.727