Update README.md
Browse files
README.md
CHANGED
|
@@ -14,26 +14,14 @@ tags:
|
|
| 14 |
pipeline_tag: image-text-to-text
|
| 15 |
inference: false
|
| 16 |
license: mit
|
|
|
|
|
|
|
| 17 |
---
|
| 18 |
-
# WIP: This FP8 Quantization does not yet work in vLLM
|
| 19 |
|
| 20 |
# π₯ InternVL3-8B-FP8-Dynamic: Optimized Vision-Language Model π₯
|
| 21 |
This is a **FP8 dynamic quantized** version of [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B), optimized for high-performance inference with vLLM.
|
| 22 |
The model utilizes **dynamic FP8 quantization** for optimal ease of use and deployment, achieving significant speedup with minimal accuracy degradation on vision-language tasks.
|
| 23 |
-
|
| 24 |
-
- **FP8 Dynamic Quantization**: No calibration required, ready to use immediately
|
| 25 |
-
- **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
|
| 26 |
-
- **vLLM Ready**: Seamless integration with vLLM for production deployment
|
| 27 |
-
- **Memory Efficient**: ~50% memory reduction compared to FP16 original
|
| 28 |
-
- **Performance Boost**: Significant faster inference on H100/L40S GPUs
|
| 29 |
-
- **Easy Deployment**: No calibration dataset needed for quantization
|
| 30 |
-
## π Model Details
|
| 31 |
-
- **Original Model**: [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B)
|
| 32 |
-
- **Source Model**: OpenGVLab/InternVL3-8B
|
| 33 |
-
- **Quantized Model**: InternVL3-8B-FP8-Dynamic
|
| 34 |
-
- **Quantization Method**: FP8 Dynamic (W8A8)
|
| 35 |
-
- **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.5.2.dev112+g6800f811
|
| 36 |
-
- **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
|
| 37 |
## π§ Usage
|
| 38 |
### With vLLM (Recommended)
|
| 39 |
```python
|
|
@@ -52,6 +40,19 @@ response = model.generate("Describe this image: <image>", sampling_params)
|
|
| 52 |
print(response[0].outputs[0].text)
|
| 53 |
```
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
## ποΈ Technical Specifications
|
| 56 |
### Hardware Requirements
|
| 57 |
- **Inference**: 7.8GB VRAM (+ Context)
|
|
@@ -71,4 +72,4 @@ torch==2.7.0
|
|
| 71 |
vllm==0.9.1
|
| 72 |
```
|
| 73 |
|
| 74 |
-
*Quantized with β€οΈ using LLM Compressor for the open-source community*
|
|
|
|
| 14 |
pipeline_tag: image-text-to-text
|
| 15 |
inference: false
|
| 16 |
license: mit
|
| 17 |
+
base_model:
|
| 18 |
+
- OpenGVLab/InternVL3-8B
|
| 19 |
---
|
|
|
|
| 20 |
|
| 21 |
# π₯ InternVL3-8B-FP8-Dynamic: Optimized Vision-Language Model π₯
|
| 22 |
This is a **FP8 dynamic quantized** version of [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B), optimized for high-performance inference with vLLM.
|
| 23 |
The model utilizes **dynamic FP8 quantization** for optimal ease of use and deployment, achieving significant speedup with minimal accuracy degradation on vision-language tasks.
|
| 24 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
## π§ Usage
|
| 26 |
### With vLLM (Recommended)
|
| 27 |
```python
|
|
|
|
| 40 |
print(response[0].outputs[0].text)
|
| 41 |
```
|
| 42 |
|
| 43 |
+
## π Key Features
|
| 44 |
+
- **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
|
| 45 |
+
- **vLLM Ready**: Seamless integration with vLLM for production deployment
|
| 46 |
+
- **Memory Efficient**: ~50% memory reduction compared to FP16 original
|
| 47 |
+
- **Performance Boost**: Significant faster inference on H100/L40S GPUs
|
| 48 |
+
## π Model Details
|
| 49 |
+
- **Original Model**: [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B)
|
| 50 |
+
- **Source Model**: OpenGVLab/InternVL3-8B
|
| 51 |
+
- **Quantized Model**: InternVL3-8B-FP8-Dynamic
|
| 52 |
+
- **Quantization Method**: FP8 Dynamic (W8A8)
|
| 53 |
+
- **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.5.2.dev112+g6800f811
|
| 54 |
+
- **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
|
| 55 |
+
|
| 56 |
## ποΈ Technical Specifications
|
| 57 |
### Hardware Requirements
|
| 58 |
- **Inference**: 7.8GB VRAM (+ Context)
|
|
|
|
| 72 |
vllm==0.9.1
|
| 73 |
```
|
| 74 |
|
| 75 |
+
*Quantized with β€οΈ using LLM Compressor for the open-source community*
|