rockylynnstein commited on
Commit
6cdf035
Β·
verified Β·
1 Parent(s): 3c73bd8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +206 -34
README.md CHANGED
@@ -15,70 +15,242 @@ pipeline_tag: text-generation
15
 
16
  # granite-8b-code-instruct-4k-FP8
17
 
18
- This is an FP8 quantized version of [granite-8b-code-instruct-4k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k) for efficient inference.
19
 
20
- ## Model Description
21
 
22
- - **Base Model:** [granite-8b-code-instruct-4k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k)
23
- - **Quantization:** FP8 (E4M3 format)
24
- - **Quantization Method:** llmcompressor oneshot with FP8 scheme
25
- - **Calibration Dataset:** open_platypus (512 samples)
26
- - **Quantization Time:** 21.6 minutes
27
 
28
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- ### With Transformers
31
  ```python
32
  from transformers import AutoModelForCausalLM, AutoTokenizer
33
  import torch
34
 
 
35
  model = AutoModelForCausalLM.from_pretrained(
36
  "TevunahAi/granite-8b-code-instruct-4k-FP8",
37
- torch_dtype=torch.bfloat16,
38
  device_map="auto",
 
39
  low_cpu_mem_usage=True,
40
  )
41
-
42
  tokenizer = AutoTokenizer.from_pretrained("TevunahAi/granite-8b-code-instruct-4k-FP8")
43
 
44
  # Generate
45
  prompt = "Write a Python function to calculate fibonacci numbers:"
46
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 
47
  outputs = model.generate(**inputs, max_new_tokens=256)
48
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
49
  ```
50
 
51
- ### With vLLM (Recommended for production)
52
- ```python
53
- from vllm import LLM, SamplingParams
 
54
 
55
- llm = LLM(model="TevunahAi/granite-8b-code-instruct-4k-FP8")
56
- sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
 
 
57
 
58
- prompts = ["Write a Python function to calculate fibonacci numbers:"]
59
- outputs = llm.generate(prompts, sampling_params)
60
- ```
61
 
62
- ## Quantization Details
63
 
64
- - **Target Layers:** All Linear layers except lm_head
65
- - **Precision:** FP8 (E4M3 format)
66
- - **Hardware Requirements:** NVIDIA Ada Lovelace or Hopper (native FP8) or Ampere with emulation
 
 
 
 
 
 
 
 
67
 
68
  ### Quantization Infrastructure
69
 
70
- Quantized on professional hardware to ensure quality and reliability:
71
- - **CPUs:** Dual Intel Xeon Max 9480 (224 threads, 128GB HBM2e)
72
- - **GPU:** NVIDIA RTX 5000 Ada Generation (32GB VRAM) with native FP8 support
73
- - **Memory:** 256GB DDR5 + 128GB HBM2e = 384GB total
74
- - **Software:** Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13 | llm-compressor
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
- ## License
77
 
78
- Apache 2.0 (same as original model)
79
 
80
- ## Credits
81
 
82
- - Original model by [IBM Granite](https://huggingface.co/ibm-granite)
83
- - Quantized by [TevunahAi](https://huggingface.co/TevunahAi)
84
- - Quantization powered by [llm-compressor](https://github.com/vllm-project/llm-compressor)
 
15
 
16
  # granite-8b-code-instruct-4k-FP8
17
 
18
+ **FP8 quantized version of IBM's Granite 8B Code model for efficient inference**
19
 
20
+ This is an FP8 (E4M3) quantized version of [ibm-granite/granite-8b-code-instruct-4k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k) using compressed_tensors format. Quantized by [TevunahAi](https://huggingface.co/TevunahAi) on enterprise-grade hardware.
21
 
22
+ ## 🎯 Recommended Usage: vLLM
 
 
 
 
23
 
24
+ For optimal performance with **full FP8 benefits** (2x memory savings + faster inference), use **vLLM** or **TensorRT-LLM**:
25
+
26
+ ### Quick Start with vLLM
27
+
28
+ ```bash
29
+ pip install vllm
30
+ ```
31
+
32
+ **Python API:**
33
+
34
+ ```python
35
+ from vllm import LLM, SamplingParams
36
+
37
+ # vLLM auto-detects FP8 from model config
38
+ llm = LLM(model="TevunahAi/granite-8b-code-instruct-4k-FP8", dtype="auto")
39
+
40
+ # Generate
41
+ prompt = "Write a Python function to calculate fibonacci numbers:"
42
+ sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
43
+
44
+ outputs = llm.generate([prompt], sampling_params)
45
+ for output in outputs:
46
+ print(output.outputs[0].text)
47
+ ```
48
+
49
+ **OpenAI-Compatible API Server:**
50
+
51
+ ```bash
52
+ vllm serve TevunahAi/granite-8b-code-instruct-4k-FP8 \
53
+ --dtype auto \
54
+ --max-model-len 4096
55
+ ```
56
+
57
+ Then use with OpenAI client:
58
+
59
+ ```python
60
+ from openai import OpenAI
61
+
62
+ client = OpenAI(
63
+ base_url="http://localhost:8000/v1",
64
+ api_key="token-abc123", # dummy key
65
+ )
66
+
67
+ response = client.chat.completions.create(
68
+ model="TevunahAi/granite-8b-code-instruct-4k-FP8",
69
+ messages=[
70
+ {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
71
+ ],
72
+ temperature=0.7,
73
+ max_tokens=256,
74
+ )
75
+
76
+ print(response.choices[0].message.content)
77
+ ```
78
+
79
+ ### vLLM Benefits
80
+
81
+ - βœ… **Weights, activations, and KV cache in FP8**
82
+ - βœ… **~8GB VRAM** (50% reduction vs BF16)
83
+ - βœ… **Native FP8 tensor core acceleration** on Ada/Hopper GPUs
84
+ - βœ… **Faster inference** with optimized CUDA kernels
85
+ - βœ… **Runs on consumer GPUs** (RTX 4070, RTX 4060 Ti 16GB, RTX 5000 Ada)
86
+
87
+ ## βš™οΈ Alternative: Transformers
88
+
89
+ This model can also be loaded with `transformers`. **Note:** Transformers will decompress FP8 β†’ BF16 during inference. However, at 8B parameters, this is manageable (~16GB VRAM).
90
+
91
+ <details>
92
+ <summary>Transformers Example (Click to expand)</summary>
93
 
 
94
  ```python
95
  from transformers import AutoModelForCausalLM, AutoTokenizer
96
  import torch
97
 
98
+ # Loads FP8 weights but decompresses to BF16 during compute
99
  model = AutoModelForCausalLM.from_pretrained(
100
  "TevunahAi/granite-8b-code-instruct-4k-FP8",
 
101
  device_map="auto",
102
+ torch_dtype="auto",
103
  low_cpu_mem_usage=True,
104
  )
 
105
  tokenizer = AutoTokenizer.from_pretrained("TevunahAi/granite-8b-code-instruct-4k-FP8")
106
 
107
  # Generate
108
  prompt = "Write a Python function to calculate fibonacci numbers:"
109
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
110
+
111
  outputs = model.generate(**inputs, max_new_tokens=256)
112
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
113
  ```
114
 
115
+ **Requirements:**
116
+ ```bash
117
+ pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors
118
+ ```
119
 
120
+ **System Requirements:**
121
+ - **~16GB VRAM** (decompressed to BF16)
122
+ - CUDA 11.8 or newer
123
+ - PyTorch 2.1+ with CUDA support
124
 
125
+ </details>
 
 
126
 
127
+ ## πŸ“Š Quantization Details
128
 
129
+ | Property | Value |
130
+ |----------|-------|
131
+ | **Base Model** | [ibm-granite/granite-8b-code-instruct-4k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k) |
132
+ | **Quantization Method** | FP8 E4M3 weight-only |
133
+ | **Framework** | llm-compressor + compressed_tensors |
134
+ | **Calibration Dataset** | open_platypus (512 samples) |
135
+ | **Storage Size** | ~8GB (sharded safetensors) |
136
+ | **VRAM (vLLM)** | ~8GB |
137
+ | **VRAM (Transformers)** | ~16GB (decompressed to BF16) |
138
+ | **Target Hardware** | NVIDIA Ada (RTX 4000/5000) or Hopper (H100/GH200) |
139
+ | **Quantization Time** | 21.6 minutes |
140
 
141
  ### Quantization Infrastructure
142
 
143
+ Professional hardware ensures consistent, high-quality quantization:
144
+
145
+ - **CPUs:** Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
146
+ - **GPU:** NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
147
+ - **Memory:** 256GB DDR5 + 128GB HBM2e = 384GB total system memory
148
+ - **Software Stack:** Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13.0 | llm-compressor
149
+
150
+ ## πŸ”§ Why FP8?
151
+
152
+ ### With vLLM/TensorRT-LLM:
153
+ - βœ… **50% memory reduction** vs BF16 (weights + activations + KV cache)
154
+ - βœ… **Faster inference** via native FP8 tensor cores
155
+ - βœ… **Better throughput** with optimized kernels
156
+ - βœ… **Minimal quality loss** for code generation tasks
157
+ - βœ… **Accessible on consumer GPUs** (RTX 4060 Ti 16GB+)
158
+
159
+ ### With Transformers:
160
+ - βœ… **Smaller download size** (~8GB vs ~16GB BF16)
161
+ - βœ… **Compatible** with standard transformers workflow
162
+ - ⚠️ **Decompresses to BF16** during inference (no runtime memory benefit)
163
+
164
+ **For production inference, use vLLM to realize the full FP8 benefits.**
165
+
166
+ ## πŸ’Ύ Model Files
167
+
168
+ This model is sharded into multiple safetensors files (all required for inference). The compressed format enables efficient storage and faster downloads.
169
+
170
+ ## πŸ”¬ IBM Granite Code Models
171
+
172
+ Granite Code models are specifically trained for code generation, editing, and explanation tasks. This 8B parameter version offers strong performance on:
173
+
174
+ - Code completion and generation
175
+ - Bug fixing and refactoring
176
+ - Code explanation and documentation
177
+ - Multiple programming languages
178
+ - 4K context window
179
+
180
+ **Granite 8B vs Larger Models:**
181
+ - βœ… **Fast iteration** - quick response times
182
+ - βœ… **Accessible** - runs on consumer GPUs
183
+ - βœ… **Good quality** - suitable for most coding tasks
184
+ - ⚠️ **Trade-off:** Less capable on very complex reasoning vs 20B/34B
185
+
186
+ ## πŸ“š Original Model
187
+
188
+ This quantization is based on [ibm-granite/granite-8b-code-instruct-4k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k) by IBM.
189
+
190
+ For comprehensive information about:
191
+ - Model architecture and training methodology
192
+ - Supported programming languages
193
+ - Evaluation benchmarks and results
194
+ - Ethical considerations and responsible AI guidelines
195
+
196
+ Please refer to the [original model card](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k).
197
+
198
+ ## πŸ”§ Hardware Requirements
199
+
200
+ ### Minimum (vLLM):
201
+ - **GPU:** NVIDIA RTX 4060 Ti (16GB) or better
202
+ - **VRAM:** 8GB minimum, 12GB+ recommended
203
+ - **CUDA:** 11.8 or newer
204
+
205
+ ### Recommended (vLLM):
206
+ - **GPU:** NVIDIA RTX 4070 / 4090 / RTX 5000 Ada
207
+ - **VRAM:** 12GB+
208
+ - **CUDA:** 12.0+
209
+
210
+ ### Transformers:
211
+ - **GPU:** Any CUDA-capable GPU
212
+ - **VRAM:** 16GB+
213
+ - Works but not optimal for performance
214
+
215
+ ## πŸ“– Additional Resources
216
+
217
+ - **vLLM Documentation:** [docs.vllm.ai](https://docs.vllm.ai/)
218
+ - **TensorRT-LLM:** [github.com/NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
219
+ - **TevunahAi Models:** [huggingface.co/TevunahAi](https://huggingface.co/TevunahAi)
220
+ - **llm-compressor:** [github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
221
+ - **IBM Granite:** [github.com/ibm-granite](https://github.com/ibm-granite)
222
+
223
+ ## πŸ“„ License
224
+
225
+ This model inherits the **Apache 2.0 License** from the original Granite model.
226
+
227
+ ## πŸ™ Acknowledgments
228
+
229
+ - **Original Model:** IBM Granite team
230
+ - **Quantization Framework:** Neural Magic's llm-compressor
231
+ - **Quantized by:** [TevunahAi](https://huggingface.co/TevunahAi)
232
+
233
+ ## πŸ“ Citation
234
+
235
+ If you use this model, please cite the original Granite work:
236
+
237
+ ```bibtex
238
+ @misc{granite2024,
239
+ title={Granite Code Models},
240
+ author={IBM Research},
241
+ year={2024},
242
+ url={https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k}
243
+ }
244
+ ```
245
+
246
+ ---
247
+
248
+ <div align="center">
249
 
250
+ **Professional AI Model Quantization by TevunahAi**
251
 
252
+ *Enterprise-grade quantization on specialized hardware*
253
 
254
+ [View all models](https://huggingface.co/TevunahAi) | [Contact for custom quantization](https://huggingface.co/TevunahAi)
255
 
256
+ </div>