brandonbeiler
/

InternVL3_5-38B-FP8-Dynamic

@@ -16,54 +16,81 @@ pipeline_tag: image-text-to-text
 inference: false
 license: mit
 ---
-# 🔥 InternVL3_5-38B-FP8-Dynamic 🔥
-This is a **fp8 dynamic (w8a8)** version of [OpenGVLab/InternVL3_5-38B](https://huggingface.co/OpenGVLab/InternVL3_5-38B), optimized for high-performance inference with vLLM.
-The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment.
-## 🚀 Key Features
-- **FP8 Dynamic Quantization**: No calibration required, ready to use immediately
-- **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
-- **vLLM Ready**: Seamless integration with vLLM for production deployment
-- **Memory Efficient**: ~50% memory reduction compared to FP16 original
-- **Performance Boost**: Significant faster inference on H100/L40S GPUs
-## 📊 Model Details
-- **Original Model**: [OpenGVLab/InternVL3_5-38B](https://huggingface.co/OpenGVLab/InternVL3_5-38B)
-- **Source Model**: OpenGVLab/InternVL3_5-38B
-- **Quantized Model**: InternVL3_5-38B-FP8-Dynamic
-- **Quantization Method**: FP8 Dynamic (W8A8)
-- **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1
-- **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
-## 🔧 Usage
-### With vLLM (Recommended)
 ```python
 from vllm import LLM, SamplingParams
 # Load the quantized model
 model = LLM(
     model="brandonbeiler/InternVL3_5-38B-FP8-Dynamic",
     trust_remote_code=True,
-    max_model_len=32768, # internvl 3.5 is 32k max context
-    tensor_parallel_size=1,  # Adjust based on your GPU setup
 )
-# Generate response
-sampling_params = SamplingParams(temperature=0.6, max_tokens=512) # internvl 3.5 recommends temp 0.6, especially for thinking mode
-response = model.generate("Describe this image: <image>", sampling_params)
 print(response[0].outputs[0].text)
 ```
-## 🏗️ Technical Specifications
 ### Hardware Requirements
-- **Inference**: 47GB VRAM (+ VRAM for context)
-   - 10k token context: ~1.3GB
-   - 32k token context: ~4GB
-   - 32k token context + fp8 kv cache: ~2GB
-- **Supported GPUs**: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism)
-- **GPU Architecture**: Latest NVIDIA GPUs (Ada Lovelace, Hopper and later) and latest AMD GPUs. Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell)
 ### Quantization Details
-- **Weights**: FP8 E4M3 with dynamic per-tensor scales
-- **Activations**: FP8 E4M3 with dynamic per-tensor scales
-- **Preserved Components**: Vision tower, embeddings, mlp1
-## 🔬 Package Versions
-This model was created using:
 ```
 llmcompressor==0.7.1
 compressed-tensors==latest
@@ -72,4 +99,4 @@ torch==2.7.1
 vllm==0.10.1.1
 ```
-*Quantized with ❤️ using LLM Compressor for the open-source community*

 inference: false
 license: mit
 ---
+# InternVL3.5 38B FP8
+This is an FP8 dynamically quantized (W8A8) version of `OpenGVLab/InternVL3_5-38B`optimized for high-performance inference with *vLLM*.
+The quantization process uses a specialized recipe that preserves the model's core visual understanding capabilities while reducing the memory footprint by nearly 50%.
+## Key Features
+*   **Calibration-Free FP8:** Dynamic W8A8 quantization. Weights are pre-quantized, and activations are quantized on the fly.
+*   **Vision-Language Optimized:** The vision tower, embeddings, and the first MLP layer are preserved in full precision to maintain high performance on vision-language tasks.
+*   **vLLM Ready:** Designed for seamless integration with vLLM for high-throughput serving.
+*   **Memory Efficient:** ~40% memory reduction compared to the original FP16 model.
+*   **Performance Boost:** Accelerated inference on FP8-compatible hardware (e.g., NVIDIA H100, L40S).
+## Model Details
+| Attribute | Value |
+| :--- | :--- |
+| **Original Model** | [OpenGVLab/InternVL3_5-38B](https://huggingface.co/OpenGVLab/InternVL3_5-38B) |
+| **Quantized Model** | `brandonbeiler/InternVL3_5-38B-FP8-Dynamic` |
+| **Quantization Method** | FP8 Dynamic (W8A8) |
+| **Quantization Library** | [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1 |
+| **Quantized By** | [brandonbeiler](https://huggingface.co/brandonbeiler) |
+## Usage with vLLM
+The following snippet demonstrates inference using the vLLM library.
 ```python
 from vllm import LLM, SamplingParams
 # Load the quantized model
+# trust_remote_code is required to load the custom model architecture. [32, 44, 45, 48]
 model = LLM(
     model="brandonbeiler/InternVL3_5-38B-FP8-Dynamic",
     trust_remote_code=True,
+    max_model_len=32768,          # InternVL 3.5 supports a 32k context length. [19, 41]
+    tensor_parallel_size=1,      # Adjust for your hardware setup. [11, 15, 38, 40]
 )
+# Set sampling parameters
+# A temperature of 0.6 is recommended for this model. [39]
+sampling_params = SamplingParams(temperature=0.6, max_tokens=512)
+# Generate a response
+# Note: Replace "<image>" with your image input
+prompt = "Describe this image: <image>"
+response = model.generate(prompt, sampling_params)
 print(response[0].outputs[0].text)
 ```
+## Technical Specifications
 ### Hardware Requirements
+*   **Base VRAM:** ~47GB (for model weights)
+*   **Context VRAM:**
+    *   \+ ~1.3GB for 10k token context
+    *   \+ ~2GB for 32k token context with FP8 KV cache
+*   **Recommended GPUs:** NVIDIA H100, L40S
+*   **Supported GPUs:** NVIDIA A100 (80GB), 2x RTX 4090 (with tensor parallelism), latest AMD GPUs.
+*   **Optimal Performance:** NVIDIA GPUs with Compute Capability >= 9.0 (Hopper, Blackwell).
 ### Quantization Details
+*   **Weights:** FP8 E4M3 with per-tensor scales.
+*   **Activations:** Dynamically quantized to FP8 E4M3 with per-tensor scales.
+*   **Preserved Modules (Full Precision):** Vision tower, embeddings, and the first MLP layer (mlp1).
+## Package Versions
+This model was quantized using the following environment:
 ```
 llmcompressor==0.7.1
 compressed-tensors==latest
 vllm==0.10.1.1
 ```
+*Quantized with ❤️ using [LLM Compressor](https://github.com/vllm-project/llm-compressor) for the open-source community.*