sh0ck0r
/

Fallen-Command-A-111B-v1.1-FP8-Dynamic

 ---
+base_model: TheDrummer/Fallen-Command-A-111B-v1.1
+tags:
+  - fp8
+  - vllm
+  - compressed-tensors
+  - quantized
+  - llmcompressor
+license: apache-2.0
+inference:
+  parameters:
+    temperature: 0.7
+    top_p: 0.9
+    max_new_tokens: 2048
+library_name: transformers
+pipeline_tag: text-generation
+---
+# Fallen-Command-A-111B-v1.1 - FP8 Dynamic Quantization
+This is an FP8 quantized version of [TheDrummer/Fallen-Command-A-111B-v1.1](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1) using `llmcompressor` with the FP8_DYNAMIC scheme.
+## Model Details
+- **Base Model**: TheDrummer/Fallen-Command-A-111B-v1.1
+- **Quantization**: FP8_DYNAMIC (W8A8)
+- **Format**: compressed-tensors (SafeTensors)
+- **Memory**: ~50% of original BF16 size
+- **Quality**: <1-2% degradation on benchmarks (typical)
+## Quick Start
+### vLLM (Recommended)
+```bash
+pip install vllm
+# Serve the model
+vllm serve REPO_ID \
+  --max-model-len 32768 \
+  --gpu-memory-utilization 0.95
+# Python API
+from vllm import LLM
+llm = LLM(model="REPO_ID")
+outputs = llm.generate("Hello, how are you?")
+print(outputs[0].outputs[0].text)
+```
+### Transformers
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained(
+    "REPO_ID",
+    device_map="auto",
+    torch_dtype="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained("REPO_ID")
+messages = [{'role': 'user', 'content': 'Hello!'}]
+inputs = tokenizer.apply_chat_template(messages, return_tensors='pt').to(model.device)
+outputs = model.generate(inputs, max_new_tokens=512)
+print(tokenizer.decode(outputs[0]))
+```
+## Quantization Details
+This model was quantized using:
+- **Tool**: [llmcompressor](https://github.com/vllm-project/llm-compressor)
+- **Method**: FP8_DYNAMIC (Round-to-Nearest)
+- **Targets**: All Linear layers except `lm_head`
+- **Scheme**: W8A8 (8-bit weights and activations)
+## Performance
+### Memory Usage
+- **Original BF16**: ~2× size of FP8
+- **FP8 Quantized**: ~50% of original
+- **Savings**: ~50% VRAM reduction
+### Inference Speed
+- Expect 1.3-1.8× faster inference vs BF16
+- 2× higher throughput (more KV cache available)
+## Use Cases
+Perfect for:
+- ✅ Production inference on limited VRAM
+- ✅ Running larger models on single GPU
+- ✅ Cost-effective API serving
+- ✅ High-throughput applications
+- ✅ Extended context lengths (more KV cache)
+## Hardware Requirements
+**Minimum VRAM** (approximate):
+- 70B model: ~40 GB (RTX A6000, A100 40GB)
+- 123B model: ~70 GB (A100 80GB, H100, H200)
+**Recommended**:
+- H100/H200 for best performance
+- vLLM for optimized serving
+- Enable FP8 KV cache for extended context
+## Important Notes
+⚠️ **Quantization Trade-offs**:
+- Slight quality degradation (typically <1-2%)
+- Not suitable for fine-tuning (inference only)
+- Best with vLLM (has FP8 kernel optimizations)
+✅ **Best Practices**:
+- Use `--kv-cache-dtype fp8` for longer contexts
+- Set `--gpu-memory-utilization 0.90-0.95`
+- Add `--enforce-eager` if you encounter compilation issues
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{model_name-fp8,
+  author = {author},
+  title = {model_name FP8 Dynamic Quantization},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/repo_id}
+}
+```
+## License
+Inherits license from base model: [TheDrummer/Fallen-Command-A-111B-v1.1](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1)
+## Acknowledgments
+- Base model by [TheDrummer](https://huggingface.co/TheDrummer)
+- Quantization via [llmcompressor](https://github.com/vllm-project/llm-compressor)
+- Serving optimized for [vLLM](https://github.com/vllm-project/vllm)
+---
+**Want more FP8 models?** Check out my other quantizations!