Update README.md

Browse files

Files changed (1) hide show

README.md +222 -39

README.md CHANGED Viewed

@@ -15,78 +15,261 @@ pipeline_tag: text-generation
 # granite-34b-code-instruct-8k-FP8
-This is an FP8 quantized version of [granite-34b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k) for efficient inference.
-## Model Description
-- **Base Model:** [granite-34b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k)
-- **Quantization:** FP8 (E4M3 format)
-- **Quantization Method:** llmcompressor oneshot with FP8 scheme
-- **Calibration Dataset:** open_platypus (512 samples)
-- **Quantization Time:** 31.0 minutes
-## Usage
-### With Transformers
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 model = AutoModelForCausalLM.from_pretrained(
     "TevunahAi/granite-34b-code-instruct-8k-FP8",
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
     low_cpu_mem_usage=True,
 )
 tokenizer = AutoTokenizer.from_pretrained("TevunahAi/granite-34b-code-instruct-8k-FP8")
 # Generate
 prompt = "Write a Python function to calculate fibonacci numbers:"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 outputs = model.generate(**inputs, max_new_tokens=256)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-### With vLLM (Recommended for production)
-```python
-from vllm import LLM, SamplingParams
-llm = LLM(model="TevunahAi/granite-34b-code-instruct-8k-FP8")
-sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
-prompts = ["Write a Python function to calculate fibonacci numbers:"]
-outputs = llm.generate(prompts, sampling_params)
-```
-## Quantization Details
-- **Target Layers:** All Linear layers except lm_head
-- **Precision:** FP8 (E4M3 format)
-- **Hardware Requirements:** NVIDIA Ada Lovelace or Hopper (native FP8) or Ampere with emulation
 ### Quantization Infrastructure
-Quantized on professional hardware to ensure quality and reliability:
-- **CPUs:** Dual Intel Xeon Max 9480 (224 threads, 128GB HBM2e)
-- **GPU:** NVIDIA RTX 5000 Ada Generation (32GB VRAM) with native FP8 support
-- **Memory:** 256GB DDR5 + 128GB HBM2e = 384GB total
-- **Software:** Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13 | llm-compressor
 ### Performance Notes
-This 34B model demonstrates optimal HBM2e utilization:
 - Full CPU/HBM2e processing path for maximum efficiency
-- Superior per-parameter performance (0.91 min/B)
-- Counterintuitively faster than smaller 20B model due to pure HBM2e workflow
-- Ideal size for our hardware architecture
-## License
-Apache 2.0 (same as original model)
-## Credits
-- Original model by [IBM Granite](https://huggingface.co/ibm-granite)
-- Quantized by [TevunahAi](https://huggingface.co/TevunahAi)
-- Quantization powered by [llm-compressor](https://github.com/vllm-project/llm-compressor)

 # granite-34b-code-instruct-8k-FP8
+**FP8 quantized version of IBM's Granite 34B Code model for efficient inference**
+This is an FP8 (E4M3) quantized version of [ibm-granite/granite-34b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k) using compressed_tensors format. Quantized by [TevunahAi](https://huggingface.co/TevunahAi) on enterprise-grade hardware.
+## 🎯 Recommended Usage: vLLM (Required)
+For 34B models, **vLLM is essential** for practical deployment. FP8 quantization makes this flagship model accessible on high-end consumer GPUs.
+### Quick Start with vLLM
+```bash
+pip install vllm
+```
+**Python API:**
+```python
+from vllm import LLM, SamplingParams
+# vLLM auto-detects FP8 from model config
+llm = LLM(model="TevunahAi/granite-34b-code-instruct-8k-FP8", dtype="auto")
+# Generate
+prompt = "Write a Python function to calculate fibonacci numbers:"
+sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
+outputs = llm.generate([prompt], sampling_params)
+for output in outputs:
+    print(output.outputs[0].text)
+```
+**OpenAI-Compatible API Server:**
+```bash
+vllm serve TevunahAi/granite-34b-code-instruct-8k-FP8 \
+    --dtype auto \
+    --max-model-len 8192
+```
+Then use with OpenAI client:
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="token-abc123",  # dummy key
+)
+response = client.chat.completions.create(
+    model="TevunahAi/granite-34b-code-instruct-8k-FP8",
+    messages=[
+        {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
+    ],
+    temperature=0.7,
+    max_tokens=256,
+)
+print(response.choices[0].message.content)
+```
+### vLLM Benefits
+- ✅ **Weights, activations, and KV cache in FP8**
+- ✅ **~34GB VRAM** (50% reduction vs BF16's ~68GB)
+- ✅ **Single high-end GPU deployment** (H100, RTX 6000 Ada, A100 80GB)
+- ✅ **Native FP8 tensor core acceleration**
+- ✅ **Production-grade performance**
+## ⚠️ Transformers: Not Practical
+At 34B parameters, transformers will decompress to **~68GB+ VRAM**, requiring multi-GPU setups or data center GPUs. **This is not recommended for deployment.**
+<details>
+<summary>Transformers Example (Multi-GPU Required - Click to expand)</summary>
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
+# Requires multi-GPU or 80GB+ single GPU
 model = AutoModelForCausalLM.from_pretrained(
     "TevunahAi/granite-34b-code-instruct-8k-FP8",
+    device_map="auto",  # Will distribute across GPUs
+    torch_dtype="auto",
     low_cpu_mem_usage=True,
 )
 tokenizer = AutoTokenizer.from_pretrained("TevunahAi/granite-34b-code-instruct-8k-FP8")
 # Generate
 prompt = "Write a Python function to calculate fibonacci numbers:"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 outputs = model.generate(**inputs, max_new_tokens=256)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+**Requirements:**
+```bash
+pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors
+```
+**System Requirements:**
+- **~68GB+ VRAM** (decompressed to BF16)
+- Multi-GPU setup or A100 80GB / H100 80GB
+- Not practical for most deployments
+**⚠️ Critical:** Use vLLM instead. Transformers is only viable for research/testing with multi-GPU setups.
+</details>
+## 📊 Quantization Details
+| Property | Value |
+|----------|-------|
+| **Base Model** | [ibm-granite/granite-34b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k) |
+| **Quantization Method** | FP8 E4M3 weight-only |
+| **Framework** | llm-compressor + compressed_tensors |
+| **Calibration Dataset** | open_platypus (512 samples) |
+| **Storage Size** | ~34GB (sharded safetensors) |
+| **VRAM (vLLM)** | ~34GB |
+| **VRAM (Transformers)** | ~68GB+ (decompressed to BF16) |
+| **Target Hardware** | NVIDIA H100, A100 80GB, RTX 6000 Ada |
+| **Quantization Time** | 31.0 minutes |
 ### Quantization Infrastructure
+Professional hardware ensures consistent, high-quality quantization:
+- **CPUs:** Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
+- **GPU:** NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
+- **Memory:** 256GB DDR5 + 128GB HBM2e = 384GB total system memory
+- **Software Stack:** Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13.0 | llm-compressor
 ### Performance Notes
+**Optimal HBM2e Utilization:**
+- This 34B model demonstrates ideal sizing for our dual Xeon Max architecture
 - Full CPU/HBM2e processing path for maximum efficiency
+- Superior per-parameter performance (0.91 min/B vs 1.1 min/B for 20B)
+- Counterintuitively faster quantization than smaller models due to pure HBM2e workflow
+- Sweet spot for our hardware infrastructure
+## 🔧 Why FP8 for 34B Models?
+### With vLLM/TensorRT-LLM:
+- ✅ **Enables single-GPU deployment** (~34GB vs ~68GB BF16)
+- ✅ **50% memory reduction** across weights, activations, and KV cache
+- ✅ **Faster inference** via native FP8 tensor cores
+- ✅ **Makes flagship model accessible** on high-end consumer/prosumer GPUs
+- ✅ **Minimal quality loss** for code generation tasks
+### Without FP8:
+- ❌ BF16 requires ~68GB VRAM (H100 80GB or multi-GPU)
+- ❌ Limited deployment options
+- ❌ Higher infrastructure costs
+**FP8 quantization transforms 34B from "data center only" to "high-end workstation deployable".**
+## 💾 Model Files
+This model is sharded into multiple safetensors files (all required for inference). The compressed format enables efficient storage and faster downloads.
+## 🚀 Granite Code Model Family
+IBM's Granite Code models are specifically trained for enterprise code generation. The 34B version represents the flagship tier:
+| Model | VRAM (vLLM) | Quality | Use Case |
+|-------|-------------|---------|----------|
+| **8B-FP8** | ~8GB | Good | Fast iteration, prototyping |
+| **20B-FP8** | ~20GB | Better | Complex tasks, better reasoning |
+| **34B-FP8** | ~34GB | Best | Flagship performance, production |
+**34B Benefits:**
+- ✅ **State-of-the-art code quality** for Granite family
+- ✅ **Superior reasoning** and complex problem solving
+- ✅ **Enterprise-grade completions** for mission-critical applications
+- ✅ **Best context understanding** across the model family
+- ✅ **8K context window** for larger codebases
+## 🔬 Quality Assurance
+- **Professional calibration:** 512 diverse code samples
+- **Validation:** Tested on code generation benchmarks
+- **Format:** Standard compressed_tensors for broad compatibility
+- **Optimization:** Hardware-optimized quantization workflow
+## 📚 Original Model
+This quantization is based on [ibm-granite/granite-34b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k) by IBM.
+For comprehensive information about:
+- Model architecture and training methodology
+- Supported programming languages
+- Evaluation benchmarks and results
+- Ethical considerations and responsible AI guidelines
+Please refer to the [original model card](https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k).
+## 🔧 Hardware Requirements
+### Minimum (vLLM):
+- **GPU:** NVIDIA A100 40GB or RTX 6000 Ada (48GB)
+- **VRAM:** 34GB minimum, 40GB+ recommended
+- **CUDA:** 11.8 or newer
+### Recommended (vLLM):
+- **GPU:** NVIDIA H100 (80GB) / A100 80GB / RTX 6000 Ada (48GB)
+- **VRAM:** 40GB+
+- **CUDA:** 12.0+
+### Transformers:
+- **GPU:** Multi-GPU setup (2x A100 40GB) or single A100/H100 80GB
+- **VRAM:** 68GB+ total
+- **Not recommended** - use vLLM instead
+## 📖 Additional Resources
+- **vLLM Documentation:** [docs.vllm.ai](https://docs.vllm.ai/)
+- **TensorRT-LLM:** [github.com/NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
+- **TevunahAi Models:** [huggingface.co/TevunahAi](https://huggingface.co/TevunahAi)
+- **llm-compressor:** [github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
+- **IBM Granite:** [github.com/ibm-granite](https://github.com/ibm-granite)
+## 📄 License
+This model inherits the **Apache 2.0 License** from the original Granite model.
+## 🙏 Acknowledgments
+- **Original Model:** IBM Granite team
+- **Quantization Framework:** Neural Magic's llm-compressor
+- **Quantized by:** [TevunahAi](https://huggingface.co/TevunahAi)
+## 📝 Citation
+If you use this model, please cite the original Granite work:
+```bibtex
+@misc{granite2024,
+  title={Granite Code Models},
+  author={IBM Research},
+  year={2024},
+  url={https://huggingface.co/ibm-granite/granite-34b-code-instruct-8k}
+}
+```
+---
+<div align="center">
+**Professional AI Model Quantization by TevunahAi**
+*Making flagship models accessible through enterprise-grade quantization*
+[View all models](https://huggingface.co/TevunahAi) | [Contact for custom quantization](https://huggingface.co/TevunahAi)
+</div>