brandonbeiler commited on
Commit
e6e579b
·
verified ·
1 Parent(s): 9485dee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -36
README.md CHANGED
@@ -16,54 +16,81 @@ pipeline_tag: image-text-to-text
16
  inference: false
17
  license: mit
18
  ---
19
- # 🔥 InternVL3_5-38B-FP8-Dynamic 🔥
20
- This is a **fp8 dynamic (w8a8)** version of [OpenGVLab/InternVL3_5-38B](https://huggingface.co/OpenGVLab/InternVL3_5-38B), optimized for high-performance inference with vLLM.
21
- The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment.
22
- ## 🚀 Key Features
23
- - **FP8 Dynamic Quantization**: No calibration required, ready to use immediately
24
- - **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
25
- - **vLLM Ready**: Seamless integration with vLLM for production deployment
26
- - **Memory Efficient**: ~50% memory reduction compared to FP16 original
27
- - **Performance Boost**: Significant faster inference on H100/L40S GPUs
28
- ## 📊 Model Details
29
- - **Original Model**: [OpenGVLab/InternVL3_5-38B](https://huggingface.co/OpenGVLab/InternVL3_5-38B)
30
- - **Source Model**: OpenGVLab/InternVL3_5-38B
31
- - **Quantized Model**: InternVL3_5-38B-FP8-Dynamic
32
- - **Quantization Method**: FP8 Dynamic (W8A8)
33
- - **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1
34
- - **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
35
- ## 🔧 Usage
36
- ### With vLLM (Recommended)
 
 
 
 
 
 
 
 
 
 
 
37
  ```python
38
  from vllm import LLM, SamplingParams
39
 
40
  # Load the quantized model
 
41
  model = LLM(
42
  model="brandonbeiler/InternVL3_5-38B-FP8-Dynamic",
43
  trust_remote_code=True,
44
- max_model_len=32768, # internvl 3.5 is 32k max context
45
- tensor_parallel_size=1, # Adjust based on your GPU setup
46
  )
47
- # Generate response
48
- sampling_params = SamplingParams(temperature=0.6, max_tokens=512) # internvl 3.5 recommends temp 0.6, especially for thinking mode
49
- response = model.generate("Describe this image: <image>", sampling_params)
 
 
 
 
 
 
 
50
  print(response[0].outputs[0].text)
51
  ```
52
 
53
- ## 🏗️ Technical Specifications
 
54
  ### Hardware Requirements
55
- - **Inference**: 47GB VRAM (+ VRAM for context)
56
- - 10k token context: ~1.3GB
57
- - 32k token context: ~4GB
58
- - 32k token context + fp8 kv cache: ~2GB
59
- - **Supported GPUs**: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism)
60
- - **GPU Architecture**: Latest NVIDIA GPUs (Ada Lovelace, Hopper and later) and latest AMD GPUs. Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell)
 
 
 
61
  ### Quantization Details
62
- - **Weights**: FP8 E4M3 with dynamic per-tensor scales
63
- - **Activations**: FP8 E4M3 with dynamic per-tensor scales
64
- - **Preserved Components**: Vision tower, embeddings, mlp1
65
- ## 🔬 Package Versions
66
- This model was created using:
 
 
 
 
67
  ```
68
  llmcompressor==0.7.1
69
  compressed-tensors==latest
@@ -72,4 +99,4 @@ torch==2.7.1
72
  vllm==0.10.1.1
73
  ```
74
 
75
- *Quantized with ❤️ using LLM Compressor for the open-source community*
 
16
  inference: false
17
  license: mit
18
  ---
19
+
20
+ # InternVL3.5 38B FP8
21
+
22
+ This is an FP8 dynamically quantized (W8A8) version of `OpenGVLab/InternVL3_5-38B`optimized for high-performance inference with *vLLM*.
23
+
24
+ The quantization process uses a specialized recipe that preserves the model's core visual understanding capabilities while reducing the memory footprint by nearly 50%.
25
+
26
+ ## Key Features
27
+
28
+ * **Calibration-Free FP8:** Dynamic W8A8 quantization. Weights are pre-quantized, and activations are quantized on the fly.
29
+ * **Vision-Language Optimized:** The vision tower, embeddings, and the first MLP layer are preserved in full precision to maintain high performance on vision-language tasks.
30
+ * **vLLM Ready:** Designed for seamless integration with vLLM for high-throughput serving.
31
+ * **Memory Efficient:** ~40% memory reduction compared to the original FP16 model.
32
+ * **Performance Boost:** Accelerated inference on FP8-compatible hardware (e.g., NVIDIA H100, L40S).
33
+
34
+ ## Model Details
35
+
36
+ | Attribute | Value |
37
+ | :--- | :--- |
38
+ | **Original Model** | [OpenGVLab/InternVL3_5-38B](https://huggingface.co/OpenGVLab/InternVL3_5-38B) |
39
+ | **Quantized Model** | `brandonbeiler/InternVL3_5-38B-FP8-Dynamic` |
40
+ | **Quantization Method** | FP8 Dynamic (W8A8) |
41
+ | **Quantization Library** | [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1 |
42
+ | **Quantized By** | [brandonbeiler](https://huggingface.co/brandonbeiler) |
43
+
44
+ ## Usage with vLLM
45
+
46
+ The following snippet demonstrates inference using the vLLM library.
47
+
48
  ```python
49
  from vllm import LLM, SamplingParams
50
 
51
  # Load the quantized model
52
+ # trust_remote_code is required to load the custom model architecture. [32, 44, 45, 48]
53
  model = LLM(
54
  model="brandonbeiler/InternVL3_5-38B-FP8-Dynamic",
55
  trust_remote_code=True,
56
+ max_model_len=32768, # InternVL 3.5 supports a 32k context length. [19, 41]
57
+ tensor_parallel_size=1, # Adjust for your hardware setup. [11, 15, 38, 40]
58
  )
59
+
60
+ # Set sampling parameters
61
+ # A temperature of 0.6 is recommended for this model. [39]
62
+ sampling_params = SamplingParams(temperature=0.6, max_tokens=512)
63
+
64
+ # Generate a response
65
+ # Note: Replace "<image>" with your image input
66
+ prompt = "Describe this image: <image>"
67
+ response = model.generate(prompt, sampling_params)
68
+
69
  print(response[0].outputs[0].text)
70
  ```
71
 
72
+ ## Technical Specifications
73
+
74
  ### Hardware Requirements
75
+
76
+ * **Base VRAM:** ~47GB (for model weights)
77
+ * **Context VRAM:**
78
+ * \+ ~1.3GB for 10k token context
79
+ * \+ ~2GB for 32k token context with FP8 KV cache
80
+ * **Recommended GPUs:** NVIDIA H100, L40S
81
+ * **Supported GPUs:** NVIDIA A100 (80GB), 2x RTX 4090 (with tensor parallelism), latest AMD GPUs.
82
+ * **Optimal Performance:** NVIDIA GPUs with Compute Capability >= 9.0 (Hopper, Blackwell).
83
+
84
  ### Quantization Details
85
+
86
+ * **Weights:** FP8 E4M3 with per-tensor scales.
87
+ * **Activations:** Dynamically quantized to FP8 E4M3 with per-tensor scales.
88
+ * **Preserved Modules (Full Precision):** Vision tower, embeddings, and the first MLP layer (mlp1).
89
+
90
+ ## Package Versions
91
+
92
+ This model was quantized using the following environment:
93
+
94
  ```
95
  llmcompressor==0.7.1
96
  compressed-tensors==latest
 
99
  vllm==0.10.1.1
100
  ```
101
 
102
+ *Quantized with ❤️ using [LLM Compressor](https://github.com/vllm-project/llm-compressor) for the open-source community.*