README.md · brandonbeiler/Skywork-R1V3-38B-FP8-Dynamic at main

Skywork-R1V3-38B-FP8-Dynamic / README.md

brandonbeiler

Update README.md

823a059 verified 5 months ago

preview code

raw

history blame contribute delete

4.63 kB

	---
	language:
	- en
	- zh
	tags:
	- fp8
	- quantization
	- dynamic
	- vision-language
	- multimodal
	- vLLM
	- llm-compressor
	- skywork_chat
	- Skywork R1V
	pipeline_tag: image-text-to-text
	inference: false
	license: mit
	base_model:
	- Skywork/Skywork-R1V3-38B
	---
	# 🔥 Skywork-R1V3-38B-FP8-Dynamic: Optimized Vision-Language Model 🔥
	This is a FP8 dynamic quantized version of [Skywork/Skywork-R1V3-38B](https://huggingface.co/Skywork/Skywork-R1V3-38B), optimized for high-performance inference with vLLM.
	The model utilizes dynamic FP8 quantization for optimal ease of use and deployment, achieving significant speedup with minimal accuracy degradation on vision-language tasks.
	## 🚀 Key Features
	- FP8 Dynamic Quantization: No calibration required, ready to use immediately
	- Vision-Language Optimized: Specialized quantization recipe that preserves visual understanding
	- vLLM Ready: Seamless integration with vLLM for production deployment
	- Memory Efficient: ~50% memory reduction compared to FP16 original
	- Performance Boost: Significant faster inference on H100/L40S GPUs
	## 📊 Model Details
	- Original Model: [Skywork/Skywork-R1V3-38B](https://huggingface.co/Skywork/Skywork-R1V3-38B)
	- Source Model: Skywork/Skywork-R1V3-38B
	- Quantized Model: Skywork-R1V3-38B-FP8-Dynamic
	- Quantization Method: FP8 Dynamic (W8A8)
	- Quantization Library: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.6.1a20250708
	- Quantized by: [brandonbeiler](https://huggingface.co/brandonbeiler)
	## 🔧 Usage
	### With vLLM (Recommended)
	```python
	from vllm import LLM, SamplingParams

	# Load the quantized model

	model = LLM(
	model="brandonbeiler/Skywork-R1V3-38B-FP8-Dynamic",
	tensor_parallel_size=1, # Adjust based on your GPU setup
	limit_mm_per_prompt={"image": 20},
	trust_remote_code=True, # required for older versions of vLLM
	max_model_len=32768, # Decrease if you run into memory issues
	gpu_memory_utilization=0.8, # Adjust based on your GPU memory
	)

	# Generate response
	sampling_params = SamplingParams(temperature=0.0, max_tokens=8000) # adjust temperature as desired
	response = model.generate("Describe this image: <image>", sampling_params)
	print(response[0].outputs[0].text)
	```

	## 🏗️ Technical Specifications
	### Hardware Requirements
	- Inference: ? VRAM (+ VRAM for context)
	- Supported GPUs: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism)
	- GPU Architecture: Ada Lovelace, Hopper (for optimal FP8 performance)
	### Quantization Details
	- Weights: FP8 E4M3 with dynamic per-tensor scales
	- Activations: FP8 E4M3 with dynamic per-tensor scales
	- Preserved Components: Vision tower, embeddings, normalization layers, mlp1
	## 🔬 Package Versions
	This model was created using:
	```
	llmcompressor==0.6.1a20250708
	compressed-tensors==latest
	transformers==4.52.4
	torch==2.7.0
	vllm==0.9.2
	```

	## vLLM Workaround for FP8
	See: https://github.com/vllm-project/vllm/issues/19876

	Currently, the Skywork Chat config (https://github.com/vllm-project/vllm/blob/e8cc53af5e17205470c04f442e67f276e08623a1/vllm/transformers_utils/configs/skyworkr1v.py#L14)
	uses a custom config (not a standard AutoConfig from transformers ), which doesn't take advantage of all the default values that the AutoConfig uses.
	When loading the raw model via transformers then quantizing and saving, transformers doesn't save out default values to the config, causing the saved
	config to be missing critical values (like tie_word_embeddings). This was patched in vLLM for InternVL models (https://github.com/vllm-project/vllm/pull/19992) but
	remains for Skywork still, and will hopefully be resolved soon.

	## vLLM Reasoning Parsing issues
	See: https://github.com/vllm-project/vllm/pull/21041

	See: https://github.com/SkyworkAI/Skywork-R1V/issues/42

	Due to Skywork models not using a single `<think></think>` token in the tokenizer, vLLM struggles to parse out the reasoning. Additionally,
	the chat chonfig for Skywork is `'<\|im_start\|>assistant\n<think>\n'` and includes the first `<think>` token so your generation output may
	not even include the first `<think>` token and only output `</think>`. There is ongoing work to add a string-based reasoning parser to vLLM
	that will allow for parsing out the `<think></think>` outputs as strings (multi-tokens) as a workaround to this issue.

	The Skywork team has mentioned that they will be utilizing single-token `<think>` in the next model version so this wont be an issue moving forward.

	Quantized with ❤️ using LLM Compressor for the open-source community