SleepVLM-3B-W4A16
Quantized Version — Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model
Paper (coming soon) | GitHub | Full-Precision Version | MASS-EX Dataset | Collection
Associated Paper: Guifeng Deng, Pan Wang, Jiquan Wang, Tao Li, Haiteng Jiang. "SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model." In preparation. This repository will be made public upon release of the preprint.
Overview
SleepVLM-3B-W4A16 is the 4-bit weight-quantized version of SleepVLM-3B, a rule-grounded vision-language model for explainable automated sleep staging from polysomnography (PSG) recordings. This quantized variant achieves 2.2x faster inference and 55% model size reduction with minimal performance degradation (kappa drop ≤1.6 pp), enabling deployment on a single consumer-grade GPU (e.g., NVIDIA RTX 4090, 24 GB).
The quantization was performed using Intel AutoRound (W4A16: 4-bit weights, 16-bit activations) on the language model layers only. The vision encoder and lm_head are retained in float16 precision.
For full details about the SleepVLM framework and training pipeline, see the full-precision model card.
Model Details
| Property | Value |
|---|---|
| Base model | SleepVLM-3B (fine-tuned from Qwen2.5-VL-3B-Instruct) |
| Model size | 3.2 GB (vs 7.1 GB full-precision, -54.9%) |
| Inference speed | 4.15 epoch/s (vs 1.89 epoch/s, +2.20x) |
| Precision | W4A16 (4-bit weights, 16-bit activations) |
| Quantization method | Intel AutoRound v0.9.2 |
| Quantized layers | model.language_model.layers (36 transformer blocks) |
| Non-quantized layers | Vision encoder + lm_head (float16) |
| Group size | 128 |
| Calibration samples | 5,000 (stratified by sleep stage) |
| Input | Three consecutive 30-s PSG epoch images (448 x 224 px) |
| PSG channels | F4-M1, C4-M1, O2-M1, LOC, ROC, Chin EMG |
Intended Use
- Primary use: Research on explainable automated sleep staging, especially in resource-constrained settings.
- Intended users: Sleep medicine researchers, clinical informatics researchers, and AI/ML researchers working on interpretable medical AI.
- Deployment scenario: Single consumer-grade GPU inference (e.g., NVIDIA RTX 4090, 24 GB).
- Clinical note: This model is intended for research purposes. It has not been validated for clinical diagnostic use and should not replace professional sleep technologist scoring in clinical settings.
Citation
If you use SleepVLM in your research, please cite:
@article{deng2026sleepvlm,
author = {Deng, Guifeng and Wang, Pan and Wang, Jiquan and Li, Tao and Jiang, Haiteng},
title = {{SleepVLM}: Explainable and Rule-Grounded Sleep Staging
via a Vision-Language Model},
journal = {}, % TODO: update after publication
year = {2026}
}
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 6
Model tree for Feng613/SleepVLM-3B-W4A16
Dataset used to train Feng613/SleepVLM-3B-W4A16
Collection including Feng613/SleepVLM-3B-W4A16
Evaluation results
- Accuracy on MASS-SS1self-reported0.827
- Macro-F1 on MASS-SS1self-reported0.788
- Cohen's Kappa on MASS-SS1self-reported0.758
- Accuracy on ZUMSself-reported0.798
- Macro-F1 on ZUMSself-reported0.751
- Cohen's Kappa on ZUMSself-reported0.727