reed-meyerson commited on
Commit
4257ccf
·
verified ·
1 Parent(s): 0e6061f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -31,9 +31,9 @@ This model is a quantized version of [Qwen/Qwen3.5-9B](https://huggingface.co/Qw
31
 
32
  This model was obtained by quantizing the weights and activations of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) to INT8 data type, ready for inference with vLLM.
33
 
34
- This optimization reduces the model weights from 19.3 GB to 14.0 GB on disk (~27% reduction). The reduction is less than the theoretical 50% because the vision encoder remains in BF16.
35
 
36
- Only the weights and activations of the linear operators within transformer blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). The vision encoder is not quantized.
37
 
38
  ## Deployment
39
 
 
31
 
32
  This model was obtained by quantizing the weights and activations of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) to INT8 data type, ready for inference with vLLM.
33
 
34
+ This optimization reduces the model weights from 19.3 GB to 14.0 GB on disk (~27% reduction). The reduction is less than the theoretical 50% because the vision encoder, token embeddings, and linear attention layers remain in BF16.
35
 
36
+ Only the weights and activations of the linear operators within transformer blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). The vision encoder, token embeddings, and linear attention layers are not quantized.
37
 
38
  ## Deployment
39