Update README.md

Browse files

Files changed (1) hide show

README.md +17 -16

README.md CHANGED Viewed

@@ -14,26 +14,14 @@ tags:
 pipeline_tag: image-text-to-text
 inference: false
 license: mit
 ---
-# WIP: This FP8 Quantization does not yet work in vLLM
 # 🔥 InternVL3-8B-FP8-Dynamic: Optimized Vision-Language Model 🔥
 This is a **FP8 dynamic quantized** version of [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B), optimized for high-performance inference with vLLM.
 The model utilizes **dynamic FP8 quantization** for optimal ease of use and deployment, achieving significant speedup with minimal accuracy degradation on vision-language tasks.
-## 🚀 Key Features
-- **FP8 Dynamic Quantization**: No calibration required, ready to use immediately
-- **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
-- **vLLM Ready**: Seamless integration with vLLM for production deployment
-- **Memory Efficient**: ~50% memory reduction compared to FP16 original
-- **Performance Boost**: Significant faster inference on H100/L40S GPUs
-- **Easy Deployment**: No calibration dataset needed for quantization
-## 📊 Model Details
-- **Original Model**: [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B)
-- **Source Model**: OpenGVLab/InternVL3-8B
-- **Quantized Model**: InternVL3-8B-FP8-Dynamic
-- **Quantization Method**: FP8 Dynamic (W8A8)
-- **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.5.2.dev112+g6800f811
-- **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
 ## 🔧 Usage
 ### With vLLM (Recommended)
 ```python
@@ -52,6 +40,19 @@ response = model.generate("Describe this image: <image>", sampling_params)
 print(response[0].outputs[0].text)
 ```
 ## 🏗️ Technical Specifications
 ### Hardware Requirements
 - **Inference**: 7.8GB VRAM (+ Context)
@@ -71,4 +72,4 @@ torch==2.7.0
 vllm==0.9.1
 ```
-*Quantized with ❤️ using LLM Compressor for the open-source community*

 pipeline_tag: image-text-to-text
 inference: false
 license: mit
+base_model:
+- OpenGVLab/InternVL3-8B
 ---
 # 🔥 InternVL3-8B-FP8-Dynamic: Optimized Vision-Language Model 🔥
 This is a **FP8 dynamic quantized** version of [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B), optimized for high-performance inference with vLLM.
 The model utilizes **dynamic FP8 quantization** for optimal ease of use and deployment, achieving significant speedup with minimal accuracy degradation on vision-language tasks.
 ## 🔧 Usage
 ### With vLLM (Recommended)
 ```python
 print(response[0].outputs[0].text)
 ```
+## 🚀 Key Features
+- **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
+- **vLLM Ready**: Seamless integration with vLLM for production deployment
+- **Memory Efficient**: ~50% memory reduction compared to FP16 original
+- **Performance Boost**: Significant faster inference on H100/L40S GPUs
+## 📊 Model Details
+- **Original Model**: [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B)
+- **Source Model**: OpenGVLab/InternVL3-8B
+- **Quantized Model**: InternVL3-8B-FP8-Dynamic
+- **Quantization Method**: FP8 Dynamic (W8A8)
+- **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.5.2.dev112+g6800f811
+- **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
 ## 🏗️ Technical Specifications
 ### Hardware Requirements
 - **Inference**: 7.8GB VRAM (+ Context)
 vllm==0.9.1
 ```
+*Quantized with ❤️ using LLM Compressor for the open-source community*