Update README.md

Browse files

Files changed (1) hide show

README.md +206 -34

README.md CHANGED Viewed

@@ -15,70 +15,242 @@ pipeline_tag: text-generation
 # granite-8b-code-instruct-4k-FP8
-This is an FP8 quantized version of [granite-8b-code-instruct-4k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k) for efficient inference.
-## Model Description
-- **Base Model:** [granite-8b-code-instruct-4k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k)
-- **Quantization:** FP8 (E4M3 format)
-- **Quantization Method:** llmcompressor oneshot with FP8 scheme
-- **Calibration Dataset:** open_platypus (512 samples)
-- **Quantization Time:** 21.6 minutes
-## Usage
-### With Transformers
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 model = AutoModelForCausalLM.from_pretrained(
     "TevunahAi/granite-8b-code-instruct-4k-FP8",
-    torch_dtype=torch.bfloat16,
     device_map="auto",
     low_cpu_mem_usage=True,
 )
 tokenizer = AutoTokenizer.from_pretrained("TevunahAi/granite-8b-code-instruct-4k-FP8")
 # Generate
 prompt = "Write a Python function to calculate fibonacci numbers:"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 outputs = model.generate(**inputs, max_new_tokens=256)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-### With vLLM (Recommended for production)
-```python
-from vllm import LLM, SamplingParams
-llm = LLM(model="TevunahAi/granite-8b-code-instruct-4k-FP8")
-sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
-prompts = ["Write a Python function to calculate fibonacci numbers:"]
-outputs = llm.generate(prompts, sampling_params)
-```
-## Quantization Details
-- **Target Layers:** All Linear layers except lm_head
-- **Precision:** FP8 (E4M3 format)
-- **Hardware Requirements:** NVIDIA Ada Lovelace or Hopper (native FP8) or Ampere with emulation
 ### Quantization Infrastructure
-Quantized on professional hardware to ensure quality and reliability:
-- **CPUs:** Dual Intel Xeon Max 9480 (224 threads, 128GB HBM2e)
-- **GPU:** NVIDIA RTX 5000 Ada Generation (32GB VRAM) with native FP8 support
-- **Memory:** 256GB DDR5 + 128GB HBM2e = 384GB total
-- **Software:** Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13 | llm-compressor
-## License
-Apache 2.0 (same as original model)
-## Credits
-- Original model by [IBM Granite](https://huggingface.co/ibm-granite)
-- Quantized by [TevunahAi](https://huggingface.co/TevunahAi)
-- Quantization powered by [llm-compressor](https://github.com/vllm-project/llm-compressor)

 # granite-8b-code-instruct-4k-FP8
+**FP8 quantized version of IBM's Granite 8B Code model for efficient inference**
+This is an FP8 (E4M3) quantized version of [ibm-granite/granite-8b-code-instruct-4k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k) using compressed_tensors format. Quantized by [TevunahAi](https://huggingface.co/TevunahAi) on enterprise-grade hardware.
+## 🎯 Recommended Usage: vLLM
+For optimal performance with **full FP8 benefits** (2x memory savings + faster inference), use **vLLM** or **TensorRT-LLM**:
+### Quick Start with vLLM
+```bash
+pip install vllm
+```
+**Python API:**
+```python
+from vllm import LLM, SamplingParams
+# vLLM auto-detects FP8 from model config
+llm = LLM(model="TevunahAi/granite-8b-code-instruct-4k-FP8", dtype="auto")
+# Generate
+prompt = "Write a Python function to calculate fibonacci numbers:"
+sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
+outputs = llm.generate([prompt], sampling_params)
+for output in outputs:
+    print(output.outputs[0].text)
+```
+**OpenAI-Compatible API Server:**
+```bash
+vllm serve TevunahAi/granite-8b-code-instruct-4k-FP8 \
+    --dtype auto \
+    --max-model-len 4096
+```
+Then use with OpenAI client:
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="token-abc123",  # dummy key
+)
+response = client.chat.completions.create(
+    model="TevunahAi/granite-8b-code-instruct-4k-FP8",
+    messages=[
+        {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
+    ],
+    temperature=0.7,
+    max_tokens=256,
+)
+print(response.choices[0].message.content)
+```
+### vLLM Benefits
+- ✅ **Weights, activations, and KV cache in FP8**
+- ✅ **~8GB VRAM** (50% reduction vs BF16)
+- ✅ **Native FP8 tensor core acceleration** on Ada/Hopper GPUs
+- ✅ **Faster inference** with optimized CUDA kernels
+- ✅ **Runs on consumer GPUs** (RTX 4070, RTX 4060 Ti 16GB, RTX 5000 Ada)
+## ⚙️ Alternative: Transformers
+This model can also be loaded with `transformers`. **Note:** Transformers will decompress FP8 → BF16 during inference. However, at 8B parameters, this is manageable (~16GB VRAM).
+<details>
+<summary>Transformers Example (Click to expand)</summary>
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
+# Loads FP8 weights but decompresses to BF16 during compute
 model = AutoModelForCausalLM.from_pretrained(
     "TevunahAi/granite-8b-code-instruct-4k-FP8",
     device_map="auto",
+    torch_dtype="auto",
     low_cpu_mem_usage=True,
 )
 tokenizer = AutoTokenizer.from_pretrained("TevunahAi/granite-8b-code-instruct-4k-FP8")
 # Generate
 prompt = "Write a Python function to calculate fibonacci numbers:"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 outputs = model.generate(**inputs, max_new_tokens=256)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+**Requirements:**
+```bash
+pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors
+```
+**System Requirements:**
+- **~16GB VRAM** (decompressed to BF16)
+- CUDA 11.8 or newer
+- PyTorch 2.1+ with CUDA support
+</details>
+## 📊 Quantization Details
+| Property | Value |
+|----------|-------|
+| **Base Model** | [ibm-granite/granite-8b-code-instruct-4k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k) |
+| **Quantization Method** | FP8 E4M3 weight-only |
+| **Framework** | llm-compressor + compressed_tensors |
+| **Calibration Dataset** | open_platypus (512 samples) |
+| **Storage Size** | ~8GB (sharded safetensors) |
+| **VRAM (vLLM)** | ~8GB |
+| **VRAM (Transformers)** | ~16GB (decompressed to BF16) |
+| **Target Hardware** | NVIDIA Ada (RTX 4000/5000) or Hopper (H100/GH200) |
+| **Quantization Time** | 21.6 minutes |
 ### Quantization Infrastructure
+Professional hardware ensures consistent, high-quality quantization:
+- **CPUs:** Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
+- **GPU:** NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
+- **Memory:** 256GB DDR5 + 128GB HBM2e = 384GB total system memory
+- **Software Stack:** Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13.0 | llm-compressor
+## 🔧 Why FP8?
+### With vLLM/TensorRT-LLM:
+- ✅ **50% memory reduction** vs BF16 (weights + activations + KV cache)
+- ✅ **Faster inference** via native FP8 tensor cores
+- ✅ **Better throughput** with optimized kernels
+- ✅ **Minimal quality loss** for code generation tasks
+- ✅ **Accessible on consumer GPUs** (RTX 4060 Ti 16GB+)
+### With Transformers:
+- ✅ **Smaller download size** (~8GB vs ~16GB BF16)
+- ✅ **Compatible** with standard transformers workflow
+- ⚠️ **Decompresses to BF16** during inference (no runtime memory benefit)
+**For production inference, use vLLM to realize the full FP8 benefits.**
+## 💾 Model Files
+This model is sharded into multiple safetensors files (all required for inference). The compressed format enables efficient storage and faster downloads.
+## 🔬 IBM Granite Code Models
+Granite Code models are specifically trained for code generation, editing, and explanation tasks. This 8B parameter version offers strong performance on:
+- Code completion and generation
+- Bug fixing and refactoring
+- Code explanation and documentation
+- Multiple programming languages
+- 4K context window
+**Granite 8B vs Larger Models:**
+- ✅ **Fast iteration** - quick response times
+- ✅ **Accessible** - runs on consumer GPUs
+- ✅ **Good quality** - suitable for most coding tasks
+- ⚠️ **Trade-off:** Less capable on very complex reasoning vs 20B/34B
+## 📚 Original Model
+This quantization is based on [ibm-granite/granite-8b-code-instruct-4k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k) by IBM.
+For comprehensive information about:
+- Model architecture and training methodology
+- Supported programming languages
+- Evaluation benchmarks and results
+- Ethical considerations and responsible AI guidelines
+Please refer to the [original model card](https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k).
+## 🔧 Hardware Requirements
+### Minimum (vLLM):
+- **GPU:** NVIDIA RTX 4060 Ti (16GB) or better
+- **VRAM:** 8GB minimum, 12GB+ recommended
+- **CUDA:** 11.8 or newer
+### Recommended (vLLM):
+- **GPU:** NVIDIA RTX 4070 / 4090 / RTX 5000 Ada
+- **VRAM:** 12GB+
+- **CUDA:** 12.0+
+### Transformers:
+- **GPU:** Any CUDA-capable GPU
+- **VRAM:** 16GB+
+- Works but not optimal for performance
+## 📖 Additional Resources
+- **vLLM Documentation:** [docs.vllm.ai](https://docs.vllm.ai/)
+- **TensorRT-LLM:** [github.com/NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
+- **TevunahAi Models:** [huggingface.co/TevunahAi](https://huggingface.co/TevunahAi)
+- **llm-compressor:** [github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
+- **IBM Granite:** [github.com/ibm-granite](https://github.com/ibm-granite)
+## 📄 License
+This model inherits the **Apache 2.0 License** from the original Granite model.
+## 🙏 Acknowledgments
+- **Original Model:** IBM Granite team
+- **Quantization Framework:** Neural Magic's llm-compressor
+- **Quantized by:** [TevunahAi](https://huggingface.co/TevunahAi)
+## 📝 Citation
+If you use this model, please cite the original Granite work:
+```bibtex
+@misc{granite2024,
+  title={Granite Code Models},
+  author={IBM Research},
+  year={2024},
+  url={https://huggingface.co/ibm-granite/granite-8b-code-instruct-4k}
+}
+```
+---
+<div align="center">
+**Professional AI Model Quantization by TevunahAi**
+*Enterprise-grade quantization on specialized hardware*
+[View all models](https://huggingface.co/TevunahAi) | [Contact for custom quantization](https://huggingface.co/TevunahAi)
+</div>