|
|
--- |
|
|
base_model: Steelskull/L3.3-Damascus-R1 |
|
|
tags: |
|
|
- fp8 |
|
|
- vllm |
|
|
- compressed-tensors |
|
|
- quantized |
|
|
- llmcompressor |
|
|
license: apache-2.0 |
|
|
inference: |
|
|
parameters: |
|
|
temperature: 0.7 |
|
|
top_p: 0.9 |
|
|
max_new_tokens: 2048 |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# L3.3-Damascus-R1 - FP8 Dynamic Quantization |
|
|
|
|
|
This is an FP8 quantized version of [Steelskull/L3.3-Damascus-R1](https://huggingface.co/Steelskull/L3.3-Damascus-R1) using `llmcompressor` with the FP8_DYNAMIC scheme. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: Steelskull/L3.3-Damascus-R1 |
|
|
- **Quantization**: FP8_DYNAMIC (W8A8) |
|
|
- **Format**: compressed-tensors (SafeTensors) |
|
|
- **Memory**: ~50% of original BF16 size |
|
|
- **Quality**: <1-2% degradation on benchmarks (typical) |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### vLLM (Recommended) |
|
|
|
|
|
```bash |
|
|
pip install vllm |
|
|
|
|
|
# Serve the model |
|
|
vllm serve REPO_ID \ |
|
|
--max-model-len 32768 \ |
|
|
--gpu-memory-utilization 0.95 |
|
|
|
|
|
# Python API |
|
|
from vllm import LLM |
|
|
llm = LLM(model="REPO_ID") |
|
|
outputs = llm.generate("Hello, how are you?") |
|
|
print(outputs[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
### Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"REPO_ID", |
|
|
device_map="auto", |
|
|
torch_dtype="auto" |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("REPO_ID") |
|
|
|
|
|
messages = [{'role': 'user', 'content': 'Hello!'}] |
|
|
inputs = tokenizer.apply_chat_template(messages, return_tensors='pt').to(model.device) |
|
|
outputs = model.generate(inputs, max_new_tokens=512) |
|
|
print(tokenizer.decode(outputs[0])) |
|
|
``` |
|
|
|
|
|
## Quantization Details |
|
|
|
|
|
This model was quantized using: |
|
|
- **Tool**: [llmcompressor](https://github.com/vllm-project/llm-compressor) |
|
|
- **Method**: FP8_DYNAMIC (Round-to-Nearest) |
|
|
- **Targets**: All Linear layers except `lm_head` |
|
|
- **Scheme**: W8A8 (8-bit weights and activations) |
|
|
|
|
|
|
|
|
## Performance |
|
|
|
|
|
### Memory Usage |
|
|
- **Original BF16**: ~2× size of FP8 |
|
|
- **FP8 Quantized**: ~50% of original |
|
|
- **Savings**: ~50% VRAM reduction |
|
|
|
|
|
### Inference Speed |
|
|
- Expect 1.3-1.8× faster inference vs BF16 |
|
|
- 2× higher throughput (more KV cache available) |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
Perfect for: |
|
|
- ✅ Production inference on limited VRAM |
|
|
- ✅ Running larger models on single GPU |
|
|
- ✅ Cost-effective API serving |
|
|
- ✅ High-throughput applications |
|
|
- ✅ Extended context lengths (more KV cache) |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
**Minimum VRAM** (approximate): |
|
|
- 70B model: ~40 GB (RTX A6000, A100 40GB) |
|
|
- 123B model: ~70 GB (A100 80GB, H100, H200) |
|
|
|
|
|
**Recommended**: |
|
|
- H100/H200 for best performance |
|
|
- vLLM for optimized serving |
|
|
- Enable FP8 KV cache for extended context |
|
|
|
|
|
## Important Notes |
|
|
|
|
|
⚠️ **Quantization Trade-offs**: |
|
|
- Slight quality degradation (typically <1-2%) |
|
|
- Not suitable for fine-tuning (inference only) |
|
|
- Best with vLLM (has FP8 kernel optimizations) |
|
|
|
|
|
✅ **Best Practices**: |
|
|
- Use `--kv-cache-dtype fp8` for longer contexts |
|
|
- Set `--gpu-memory-utilization 0.90-0.95` |
|
|
- Add `--enforce-eager` if you encounter compilation issues |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{model_name-fp8, |
|
|
author = {author}, |
|
|
title = {model_name FP8 Dynamic Quantization}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
url = {https://huggingface.co/repo_id} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Inherits license from base model: [Steelskull/L3.3-Damascus-R1](https://huggingface.co/Steelskull/L3.3-Damascus-R1) |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Base model by [Steelskull](https://huggingface.co/Steelskull) |
|
|
- Quantization via [llmcompressor](https://github.com/vllm-project/llm-compressor) |
|
|
- Serving optimized for [vLLM](https://github.com/vllm-project/vllm) |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
**Want more FP8 models?** Check out my other quantizations! |
|
|
|