| --- |
| base_model: Steelskull/L3.3-Cu-Mai-R1-70b |
| tags: |
| - fp8 |
| - vllm |
| - compressed-tensors |
| - quantized |
| - llmcompressor |
| license: apache-2.0 |
| inference: |
| parameters: |
| temperature: 0.7 |
| top_p: 0.9 |
| max_new_tokens: 2048 |
| library_name: transformers |
| pipeline_tag: text-generation |
| --- |
| |
| # L3.3-Cu-Mai-R1-70b - FP8 Dynamic Quantization |
|
|
| This is an FP8 quantized version of [Steelskull/L3.3-Cu-Mai-R1-70b](https://huggingface.co/Steelskull/L3.3-Cu-Mai-R1-70b) using `llmcompressor` with the FP8_DYNAMIC scheme. |
| |
| ## Model Details |
| |
| - **Base Model**: Steelskull/L3.3-Cu-Mai-R1-70b |
| - **Quantization**: FP8_DYNAMIC (W8A8) |
| - **Format**: compressed-tensors (SafeTensors) |
| - **Memory**: ~50% of original BF16 size |
| - **Quality**: <1-2% degradation on benchmarks (typical) |
|
|
| ## Quick Start |
|
|
| ### vLLM (Recommended) |
|
|
| ```bash |
| pip install vllm |
| |
| # Serve the model |
| vllm serve REPO_ID \ |
| --max-model-len 32768 \ |
| --gpu-memory-utilization 0.95 |
| |
| # Python API |
| from vllm import LLM |
| llm = LLM(model="REPO_ID") |
| outputs = llm.generate("Hello, how are you?") |
| print(outputs[0].outputs[0].text) |
| ``` |
|
|
| ### Transformers |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "REPO_ID", |
| device_map="auto", |
| torch_dtype="auto" |
| ) |
| tokenizer = AutoTokenizer.from_pretrained("REPO_ID") |
| |
| messages = [{'role': 'user', 'content': 'Hello!'}] |
| inputs = tokenizer.apply_chat_template(messages, return_tensors='pt').to(model.device) |
| outputs = model.generate(inputs, max_new_tokens=512) |
| print(tokenizer.decode(outputs[0])) |
| ``` |
|
|
| ## Quantization Details |
|
|
| This model was quantized using: |
| - **Tool**: [llmcompressor](https://github.com/vllm-project/llm-compressor) |
| - **Method**: FP8_DYNAMIC (Round-to-Nearest) |
| - **Targets**: All Linear layers except `lm_head` |
| - **Scheme**: W8A8 (8-bit weights and activations) |
|
|
|
|
| ## Performance |
|
|
| ### Memory Usage |
| - **Original BF16**: ~2× size of FP8 |
| - **FP8 Quantized**: ~50% of original |
| - **Savings**: ~50% VRAM reduction |
|
|
| ### Inference Speed |
| - Expect 1.3-1.8× faster inference vs BF16 |
| - 2× higher throughput (more KV cache available) |
|
|
| ## Use Cases |
|
|
| Perfect for: |
| - ✅ Production inference on limited VRAM |
| - ✅ Running larger models on single GPU |
| - ✅ Cost-effective API serving |
| - ✅ High-throughput applications |
| - ✅ Extended context lengths (more KV cache) |
|
|
| ## Hardware Requirements |
|
|
| **Minimum VRAM** (approximate): |
| - 70B model: ~40 GB (RTX A6000, A100 40GB) |
| - 123B model: ~70 GB (A100 80GB, H100, H200) |
|
|
| **Recommended**: |
| - H100/H200 for best performance |
| - vLLM for optimized serving |
| - Enable FP8 KV cache for extended context |
|
|
| ## Important Notes |
|
|
| ⚠️ **Quantization Trade-offs**: |
| - Slight quality degradation (typically <1-2%) |
| - Not suitable for fine-tuning (inference only) |
| - Best with vLLM (has FP8 kernel optimizations) |
|
|
| ✅ **Best Practices**: |
| - Use `--kv-cache-dtype fp8` for longer contexts |
| - Set `--gpu-memory-utilization 0.90-0.95` |
| - Add `--enforce-eager` if you encounter compilation issues |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{model_name-fp8, |
| author = {author}, |
| title = {model_name FP8 Dynamic Quantization}, |
| year = {2025}, |
| publisher = {HuggingFace}, |
| url = {https://huggingface.co/repo_id} |
| } |
| ``` |
|
|
| ## License |
|
|
| Inherits license from base model: [Steelskull/L3.3-Cu-Mai-R1-70b](https://huggingface.co/Steelskull/L3.3-Cu-Mai-R1-70b) |
|
|
| ## Acknowledgments |
|
|
| - Base model by [Steelskull](https://huggingface.co/Steelskull) |
| - Quantization via [llmcompressor](https://github.com/vllm-project/llm-compressor) |
| - Serving optimized for [vLLM](https://github.com/vllm-project/vllm) |
|
|
|
|
|
|
| --- |
|
|
| **Want more FP8 models?** Check out my other quantizations! |
|
|