|
|
--- |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
tags: |
|
|
- fp8 |
|
|
- quantization |
|
|
- dynamic |
|
|
- vision-language |
|
|
- multimodal |
|
|
- vLLM |
|
|
- llm-compressor |
|
|
- skywork_chat |
|
|
- Skywork R1V |
|
|
pipeline_tag: image-text-to-text |
|
|
inference: false |
|
|
license: mit |
|
|
base_model: |
|
|
- Skywork/Skywork-R1V3-38B |
|
|
--- |
|
|
# π₯ Skywork-R1V3-38B-FP8-Dynamic: Optimized Vision-Language Model π₯ |
|
|
This is a **FP8 dynamic quantized** version of [Skywork/Skywork-R1V3-38B](https://huggingface.co/Skywork/Skywork-R1V3-38B), optimized for high-performance inference with vLLM. |
|
|
The model utilizes **dynamic FP8 quantization** for optimal ease of use and deployment, achieving significant speedup with minimal accuracy degradation on vision-language tasks. |
|
|
## π Key Features |
|
|
- **FP8 Dynamic Quantization**: No calibration required, ready to use immediately |
|
|
- **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding |
|
|
- **vLLM Ready**: Seamless integration with vLLM for production deployment |
|
|
- **Memory Efficient**: ~50% memory reduction compared to FP16 original |
|
|
- **Performance Boost**: Significant faster inference on H100/L40S GPUs |
|
|
## π Model Details |
|
|
- **Original Model**: [Skywork/Skywork-R1V3-38B](https://huggingface.co/Skywork/Skywork-R1V3-38B) |
|
|
- **Source Model**: Skywork/Skywork-R1V3-38B |
|
|
- **Quantized Model**: Skywork-R1V3-38B-FP8-Dynamic |
|
|
- **Quantization Method**: FP8 Dynamic (W8A8) |
|
|
- **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.6.1a20250708 |
|
|
- **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler) |
|
|
## π§ Usage |
|
|
### With vLLM (Recommended) |
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
# Load the quantized model |
|
|
|
|
|
model = LLM( |
|
|
model="brandonbeiler/Skywork-R1V3-38B-FP8-Dynamic", |
|
|
tensor_parallel_size=1, # Adjust based on your GPU setup |
|
|
limit_mm_per_prompt={"image": 20}, |
|
|
trust_remote_code=True, # required for older versions of vLLM |
|
|
max_model_len=32768, # Decrease if you run into memory issues |
|
|
gpu_memory_utilization=0.8, # Adjust based on your GPU memory |
|
|
) |
|
|
|
|
|
# Generate response |
|
|
sampling_params = SamplingParams(temperature=0.0, max_tokens=8000) # adjust temperature as desired |
|
|
response = model.generate("Describe this image: <image>", sampling_params) |
|
|
print(response[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
## ποΈ Technical Specifications |
|
|
### Hardware Requirements |
|
|
- **Inference**: ? VRAM (+ VRAM for context) |
|
|
- **Supported GPUs**: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism) |
|
|
- **GPU Architecture**: Ada Lovelace, Hopper (for optimal FP8 performance) |
|
|
### Quantization Details |
|
|
- **Weights**: FP8 E4M3 with dynamic per-tensor scales |
|
|
- **Activations**: FP8 E4M3 with dynamic per-tensor scales |
|
|
- **Preserved Components**: Vision tower, embeddings, normalization layers, mlp1 |
|
|
## π¬ Package Versions |
|
|
This model was created using: |
|
|
``` |
|
|
llmcompressor==0.6.1a20250708 |
|
|
compressed-tensors==latest |
|
|
transformers==4.52.4 |
|
|
torch==2.7.0 |
|
|
vllm==0.9.2 |
|
|
``` |
|
|
|
|
|
## vLLM Workaround for FP8 |
|
|
See: https://github.com/vllm-project/vllm/issues/19876 |
|
|
|
|
|
Currently, the Skywork Chat config (https://github.com/vllm-project/vllm/blob/e8cc53af5e17205470c04f442e67f276e08623a1/vllm/transformers_utils/configs/skyworkr1v.py#L14) |
|
|
uses a custom config (not a standard AutoConfig from transformers ), which doesn't take advantage of all the default values that the AutoConfig uses. |
|
|
When loading the raw model via transformers then quantizing and saving, transformers doesn't save out default values to the config, causing the saved |
|
|
config to be missing critical values (like tie_word_embeddings). This was patched in vLLM for InternVL models (https://github.com/vllm-project/vllm/pull/19992) but |
|
|
remains for Skywork still, and will hopefully be resolved soon. |
|
|
|
|
|
## vLLM Reasoning Parsing issues |
|
|
See: https://github.com/vllm-project/vllm/pull/21041 |
|
|
|
|
|
See: https://github.com/SkyworkAI/Skywork-R1V/issues/42 |
|
|
|
|
|
Due to Skywork models not using a single `<think></think>` token in the tokenizer, vLLM struggles to parse out the reasoning. Additionally, |
|
|
the chat chonfig for Skywork is `'<|im_start|>assistant\n<think>\n'` and includes the first `<think>` token so your generation output may |
|
|
not even include the first `<think>` token and only output `</think>`. There is ongoing work to add a string-based reasoning parser to vLLM |
|
|
that will allow for parsing out the `<think></think>` outputs as strings (multi-tokens) as a workaround to this issue. |
|
|
|
|
|
The Skywork team has mentioned that they will be utilizing single-token `<think>` in the next model version so this wont be an issue moving forward. |
|
|
|
|
|
*Quantized with β€οΈ using LLM Compressor for the open-source community* |