rockylynnstein commited on
Commit
e0f87fa
Β·
verified Β·
1 Parent(s): d27c85f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +222 -39
README.md CHANGED
@@ -15,78 +15,261 @@ pipeline_tag: text-generation
15
 
16
  # granite-34b-code-instruct-8k-FP8
17
 
18
- This is an FP8 quantized version of [granite-34b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k) for efficient inference.
19
 
20
- ## Model Description
21
 
22
- - **Base Model:** [granite-34b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k)
23
- - **Quantization:** FP8 (E4M3 format)
24
- - **Quantization Method:** llmcompressor oneshot with FP8 scheme
25
- - **Calibration Dataset:** open_platypus (512 samples)
26
- - **Quantization Time:** 31.0 minutes
27
 
28
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- ### With Transformers
31
  ```python
32
  from transformers import AutoModelForCausalLM, AutoTokenizer
33
  import torch
34
 
 
35
  model = AutoModelForCausalLM.from_pretrained(
36
  "TevunahAi/granite-34b-code-instruct-8k-FP8",
37
- torch_dtype=torch.bfloat16,
38
- device_map="auto",
39
  low_cpu_mem_usage=True,
40
  )
41
-
42
  tokenizer = AutoTokenizer.from_pretrained("TevunahAi/granite-34b-code-instruct-8k-FP8")
43
 
44
  # Generate
45
  prompt = "Write a Python function to calculate fibonacci numbers:"
46
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 
47
  outputs = model.generate(**inputs, max_new_tokens=256)
48
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
49
  ```
50
 
51
- ### With vLLM (Recommended for production)
52
- ```python
53
- from vllm import LLM, SamplingParams
 
54
 
55
- llm = LLM(model="TevunahAi/granite-34b-code-instruct-8k-FP8")
56
- sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
 
 
57
 
58
- prompts = ["Write a Python function to calculate fibonacci numbers:"]
59
- outputs = llm.generate(prompts, sampling_params)
60
- ```
61
 
62
- ## Quantization Details
63
 
64
- - **Target Layers:** All Linear layers except lm_head
65
- - **Precision:** FP8 (E4M3 format)
66
- - **Hardware Requirements:** NVIDIA Ada Lovelace or Hopper (native FP8) or Ampere with emulation
 
 
 
 
 
 
 
 
67
 
68
  ### Quantization Infrastructure
69
 
70
- Quantized on professional hardware to ensure quality and reliability:
71
- - **CPUs:** Dual Intel Xeon Max 9480 (224 threads, 128GB HBM2e)
72
- - **GPU:** NVIDIA RTX 5000 Ada Generation (32GB VRAM) with native FP8 support
73
- - **Memory:** 256GB DDR5 + 128GB HBM2e = 384GB total
74
- - **Software:** Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13 | llm-compressor
 
75
 
76
  ### Performance Notes
77
 
78
- This 34B model demonstrates optimal HBM2e utilization:
 
79
  - Full CPU/HBM2e processing path for maximum efficiency
80
- - Superior per-parameter performance (0.91 min/B)
81
- - Counterintuitively faster than smaller 20B model due to pure HBM2e workflow
82
- - Ideal size for our hardware architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
- ## License
85
 
86
- Apache 2.0 (same as original model)
87
 
88
- ## Credits
89
 
90
- - Original model by [IBM Granite](https://huggingface.co/ibm-granite)
91
- - Quantized by [TevunahAi](https://huggingface.co/TevunahAi)
92
- - Quantization powered by [llm-compressor](https://github.com/vllm-project/llm-compressor)
 
15
 
16
  # granite-34b-code-instruct-8k-FP8
17
 
18
+ **FP8 quantized version of IBM's Granite 34B Code model for efficient inference**
19
 
20
+ This is an FP8 (E4M3) quantized version of [ibm-granite/granite-34b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k) using compressed_tensors format. Quantized by [TevunahAi](https://huggingface.co/TevunahAi) on enterprise-grade hardware.
21
 
22
+ ## 🎯 Recommended Usage: vLLM (Required)
 
 
 
 
23
 
24
+ For 34B models, **vLLM is essential** for practical deployment. FP8 quantization makes this flagship model accessible on high-end consumer GPUs.
25
+
26
+ ### Quick Start with vLLM
27
+
28
+ ```bash
29
+ pip install vllm
30
+ ```
31
+
32
+ **Python API:**
33
+
34
+ ```python
35
+ from vllm import LLM, SamplingParams
36
+
37
+ # vLLM auto-detects FP8 from model config
38
+ llm = LLM(model="TevunahAi/granite-34b-code-instruct-8k-FP8", dtype="auto")
39
+
40
+ # Generate
41
+ prompt = "Write a Python function to calculate fibonacci numbers:"
42
+ sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
43
+
44
+ outputs = llm.generate([prompt], sampling_params)
45
+ for output in outputs:
46
+ print(output.outputs[0].text)
47
+ ```
48
+
49
+ **OpenAI-Compatible API Server:**
50
+
51
+ ```bash
52
+ vllm serve TevunahAi/granite-34b-code-instruct-8k-FP8 \
53
+ --dtype auto \
54
+ --max-model-len 8192
55
+ ```
56
+
57
+ Then use with OpenAI client:
58
+
59
+ ```python
60
+ from openai import OpenAI
61
+
62
+ client = OpenAI(
63
+ base_url="http://localhost:8000/v1",
64
+ api_key="token-abc123", # dummy key
65
+ )
66
+
67
+ response = client.chat.completions.create(
68
+ model="TevunahAi/granite-34b-code-instruct-8k-FP8",
69
+ messages=[
70
+ {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
71
+ ],
72
+ temperature=0.7,
73
+ max_tokens=256,
74
+ )
75
+
76
+ print(response.choices[0].message.content)
77
+ ```
78
+
79
+ ### vLLM Benefits
80
+
81
+ - βœ… **Weights, activations, and KV cache in FP8**
82
+ - βœ… **~34GB VRAM** (50% reduction vs BF16's ~68GB)
83
+ - βœ… **Single high-end GPU deployment** (H100, RTX 6000 Ada, A100 80GB)
84
+ - βœ… **Native FP8 tensor core acceleration**
85
+ - βœ… **Production-grade performance**
86
+
87
+ ## ⚠️ Transformers: Not Practical
88
+
89
+ At 34B parameters, transformers will decompress to **~68GB+ VRAM**, requiring multi-GPU setups or data center GPUs. **This is not recommended for deployment.**
90
+
91
+ <details>
92
+ <summary>Transformers Example (Multi-GPU Required - Click to expand)</summary>
93
 
 
94
  ```python
95
  from transformers import AutoModelForCausalLM, AutoTokenizer
96
  import torch
97
 
98
+ # Requires multi-GPU or 80GB+ single GPU
99
  model = AutoModelForCausalLM.from_pretrained(
100
  "TevunahAi/granite-34b-code-instruct-8k-FP8",
101
+ device_map="auto", # Will distribute across GPUs
102
+ torch_dtype="auto",
103
  low_cpu_mem_usage=True,
104
  )
 
105
  tokenizer = AutoTokenizer.from_pretrained("TevunahAi/granite-34b-code-instruct-8k-FP8")
106
 
107
  # Generate
108
  prompt = "Write a Python function to calculate fibonacci numbers:"
109
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
110
+
111
  outputs = model.generate(**inputs, max_new_tokens=256)
112
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
113
  ```
114
 
115
+ **Requirements:**
116
+ ```bash
117
+ pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors
118
+ ```
119
 
120
+ **System Requirements:**
121
+ - **~68GB+ VRAM** (decompressed to BF16)
122
+ - Multi-GPU setup or A100 80GB / H100 80GB
123
+ - Not practical for most deployments
124
 
125
+ **⚠️ Critical:** Use vLLM instead. Transformers is only viable for research/testing with multi-GPU setups.
126
+
127
+ </details>
128
 
129
+ ## πŸ“Š Quantization Details
130
 
131
+ | Property | Value |
132
+ |----------|-------|
133
+ | **Base Model** | [ibm-granite/granite-34b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k) |
134
+ | **Quantization Method** | FP8 E4M3 weight-only |
135
+ | **Framework** | llm-compressor + compressed_tensors |
136
+ | **Calibration Dataset** | open_platypus (512 samples) |
137
+ | **Storage Size** | ~34GB (sharded safetensors) |
138
+ | **VRAM (vLLM)** | ~34GB |
139
+ | **VRAM (Transformers)** | ~68GB+ (decompressed to BF16) |
140
+ | **Target Hardware** | NVIDIA H100, A100 80GB, RTX 6000 Ada |
141
+ | **Quantization Time** | 31.0 minutes |
142
 
143
  ### Quantization Infrastructure
144
 
145
+ Professional hardware ensures consistent, high-quality quantization:
146
+
147
+ - **CPUs:** Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
148
+ - **GPU:** NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
149
+ - **Memory:** 256GB DDR5 + 128GB HBM2e = 384GB total system memory
150
+ - **Software Stack:** Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13.0 | llm-compressor
151
 
152
  ### Performance Notes
153
 
154
+ **Optimal HBM2e Utilization:**
155
+ - This 34B model demonstrates ideal sizing for our dual Xeon Max architecture
156
  - Full CPU/HBM2e processing path for maximum efficiency
157
+ - Superior per-parameter performance (0.91 min/B vs 1.1 min/B for 20B)
158
+ - Counterintuitively faster quantization than smaller models due to pure HBM2e workflow
159
+ - Sweet spot for our hardware infrastructure
160
+
161
+ ## πŸ”§ Why FP8 for 34B Models?
162
+
163
+ ### With vLLM/TensorRT-LLM:
164
+ - βœ… **Enables single-GPU deployment** (~34GB vs ~68GB BF16)
165
+ - βœ… **50% memory reduction** across weights, activations, and KV cache
166
+ - βœ… **Faster inference** via native FP8 tensor cores
167
+ - βœ… **Makes flagship model accessible** on high-end consumer/prosumer GPUs
168
+ - βœ… **Minimal quality loss** for code generation tasks
169
+
170
+ ### Without FP8:
171
+ - ❌ BF16 requires ~68GB VRAM (H100 80GB or multi-GPU)
172
+ - ❌ Limited deployment options
173
+ - ❌ Higher infrastructure costs
174
+
175
+ **FP8 quantization transforms 34B from "data center only" to "high-end workstation deployable".**
176
+
177
+ ## πŸ’Ύ Model Files
178
+
179
+ This model is sharded into multiple safetensors files (all required for inference). The compressed format enables efficient storage and faster downloads.
180
+
181
+ ## πŸš€ Granite Code Model Family
182
+
183
+ IBM's Granite Code models are specifically trained for enterprise code generation. The 34B version represents the flagship tier:
184
+
185
+ | Model | VRAM (vLLM) | Quality | Use Case |
186
+ |-------|-------------|---------|----------|
187
+ | **8B-FP8** | ~8GB | Good | Fast iteration, prototyping |
188
+ | **20B-FP8** | ~20GB | Better | Complex tasks, better reasoning |
189
+ | **34B-FP8** | ~34GB | Best | Flagship performance, production |
190
+
191
+ **34B Benefits:**
192
+ - βœ… **State-of-the-art code quality** for Granite family
193
+ - βœ… **Superior reasoning** and complex problem solving
194
+ - βœ… **Enterprise-grade completions** for mission-critical applications
195
+ - βœ… **Best context understanding** across the model family
196
+ - βœ… **8K context window** for larger codebases
197
+
198
+ ## πŸ”¬ Quality Assurance
199
+
200
+ - **Professional calibration:** 512 diverse code samples
201
+ - **Validation:** Tested on code generation benchmarks
202
+ - **Format:** Standard compressed_tensors for broad compatibility
203
+ - **Optimization:** Hardware-optimized quantization workflow
204
+
205
+ ## πŸ“š Original Model
206
+
207
+ This quantization is based on [ibm-granite/granite-34b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k) by IBM.
208
+
209
+ For comprehensive information about:
210
+ - Model architecture and training methodology
211
+ - Supported programming languages
212
+ - Evaluation benchmarks and results
213
+ - Ethical considerations and responsible AI guidelines
214
+
215
+ Please refer to the [original model card](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k).
216
+
217
+ ## πŸ”§ Hardware Requirements
218
+
219
+ ### Minimum (vLLM):
220
+ - **GPU:** NVIDIA A100 40GB or RTX 6000 Ada (48GB)
221
+ - **VRAM:** 34GB minimum, 40GB+ recommended
222
+ - **CUDA:** 11.8 or newer
223
+
224
+ ### Recommended (vLLM):
225
+ - **GPU:** NVIDIA H100 (80GB) / A100 80GB / RTX 6000 Ada (48GB)
226
+ - **VRAM:** 40GB+
227
+ - **CUDA:** 12.0+
228
+
229
+ ### Transformers:
230
+ - **GPU:** Multi-GPU setup (2x A100 40GB) or single A100/H100 80GB
231
+ - **VRAM:** 68GB+ total
232
+ - **Not recommended** - use vLLM instead
233
+
234
+ ## πŸ“– Additional Resources
235
+
236
+ - **vLLM Documentation:** [docs.vllm.ai](https://docs.vllm.ai/)
237
+ - **TensorRT-LLM:** [github.com/NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
238
+ - **TevunahAi Models:** [huggingface.co/TevunahAi](https://huggingface.co/TevunahAi)
239
+ - **llm-compressor:** [github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
240
+ - **IBM Granite:** [github.com/ibm-granite](https://github.com/ibm-granite)
241
+
242
+ ## πŸ“„ License
243
+
244
+ This model inherits the **Apache 2.0 License** from the original Granite model.
245
+
246
+ ## πŸ™ Acknowledgments
247
+
248
+ - **Original Model:** IBM Granite team
249
+ - **Quantization Framework:** Neural Magic's llm-compressor
250
+ - **Quantized by:** [TevunahAi](https://huggingface.co/TevunahAi)
251
+
252
+ ## πŸ“ Citation
253
+
254
+ If you use this model, please cite the original Granite work:
255
+
256
+ ```bibtex
257
+ @misc{granite2024,
258
+ title={Granite Code Models},
259
+ author={IBM Research},
260
+ year={2024},
261
+ url={https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k}
262
+ }
263
+ ```
264
+
265
+ ---
266
+
267
+ <div align="center">
268
 
269
+ **Professional AI Model Quantization by TevunahAi**
270
 
271
+ *Making flagship models accessible through enterprise-grade quantization*
272
 
273
+ [View all models](https://huggingface.co/TevunahAi) | [Contact for custom quantization](https://huggingface.co/TevunahAi)
274
 
275
+ </div>