sh0ck0r commited on
Commit
8d95a1c
·
verified ·
1 Parent(s): 57a2fa1

Upload FP8 quantized version of TheDrummer/Fallen-Command-A-111B-v1.1

Browse files
Files changed (1) hide show
  1. README.md +147 -3
README.md CHANGED
@@ -1,4 +1,148 @@
1
  ---
2
- base_model:
3
- - TheDrummer/Fallen-Command-A-111B-v1.1
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: TheDrummer/Fallen-Command-A-111B-v1.1
3
+ tags:
4
+ - fp8
5
+ - vllm
6
+ - compressed-tensors
7
+ - quantized
8
+ - llmcompressor
9
+ license: apache-2.0
10
+ inference:
11
+ parameters:
12
+ temperature: 0.7
13
+ top_p: 0.9
14
+ max_new_tokens: 2048
15
+ library_name: transformers
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # Fallen-Command-A-111B-v1.1 - FP8 Dynamic Quantization
20
+
21
+ This is an FP8 quantized version of [TheDrummer/Fallen-Command-A-111B-v1.1](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1) using `llmcompressor` with the FP8_DYNAMIC scheme.
22
+
23
+ ## Model Details
24
+
25
+ - **Base Model**: TheDrummer/Fallen-Command-A-111B-v1.1
26
+ - **Quantization**: FP8_DYNAMIC (W8A8)
27
+ - **Format**: compressed-tensors (SafeTensors)
28
+ - **Memory**: ~50% of original BF16 size
29
+ - **Quality**: <1-2% degradation on benchmarks (typical)
30
+
31
+ ## Quick Start
32
+
33
+ ### vLLM (Recommended)
34
+
35
+ ```bash
36
+ pip install vllm
37
+
38
+ # Serve the model
39
+ vllm serve REPO_ID \
40
+ --max-model-len 32768 \
41
+ --gpu-memory-utilization 0.95
42
+
43
+ # Python API
44
+ from vllm import LLM
45
+ llm = LLM(model="REPO_ID")
46
+ outputs = llm.generate("Hello, how are you?")
47
+ print(outputs[0].outputs[0].text)
48
+ ```
49
+
50
+ ### Transformers
51
+
52
+ ```python
53
+ from transformers import AutoTokenizer, AutoModelForCausalLM
54
+
55
+ model = AutoModelForCausalLM.from_pretrained(
56
+ "REPO_ID",
57
+ device_map="auto",
58
+ torch_dtype="auto"
59
+ )
60
+ tokenizer = AutoTokenizer.from_pretrained("REPO_ID")
61
+
62
+ messages = [{'role': 'user', 'content': 'Hello!'}]
63
+ inputs = tokenizer.apply_chat_template(messages, return_tensors='pt').to(model.device)
64
+ outputs = model.generate(inputs, max_new_tokens=512)
65
+ print(tokenizer.decode(outputs[0]))
66
+ ```
67
+
68
+ ## Quantization Details
69
+
70
+ This model was quantized using:
71
+ - **Tool**: [llmcompressor](https://github.com/vllm-project/llm-compressor)
72
+ - **Method**: FP8_DYNAMIC (Round-to-Nearest)
73
+ - **Targets**: All Linear layers except `lm_head`
74
+ - **Scheme**: W8A8 (8-bit weights and activations)
75
+
76
+
77
+ ## Performance
78
+
79
+ ### Memory Usage
80
+ - **Original BF16**: ~2× size of FP8
81
+ - **FP8 Quantized**: ~50% of original
82
+ - **Savings**: ~50% VRAM reduction
83
+
84
+ ### Inference Speed
85
+ - Expect 1.3-1.8× faster inference vs BF16
86
+ - 2× higher throughput (more KV cache available)
87
+
88
+ ## Use Cases
89
+
90
+ Perfect for:
91
+ - ✅ Production inference on limited VRAM
92
+ - ✅ Running larger models on single GPU
93
+ - ✅ Cost-effective API serving
94
+ - ✅ High-throughput applications
95
+ - ✅ Extended context lengths (more KV cache)
96
+
97
+ ## Hardware Requirements
98
+
99
+ **Minimum VRAM** (approximate):
100
+ - 70B model: ~40 GB (RTX A6000, A100 40GB)
101
+ - 123B model: ~70 GB (A100 80GB, H100, H200)
102
+
103
+ **Recommended**:
104
+ - H100/H200 for best performance
105
+ - vLLM for optimized serving
106
+ - Enable FP8 KV cache for extended context
107
+
108
+ ## Important Notes
109
+
110
+ ⚠️ **Quantization Trade-offs**:
111
+ - Slight quality degradation (typically <1-2%)
112
+ - Not suitable for fine-tuning (inference only)
113
+ - Best with vLLM (has FP8 kernel optimizations)
114
+
115
+ ✅ **Best Practices**:
116
+ - Use `--kv-cache-dtype fp8` for longer contexts
117
+ - Set `--gpu-memory-utilization 0.90-0.95`
118
+ - Add `--enforce-eager` if you encounter compilation issues
119
+
120
+ ## Citation
121
+
122
+ If you use this model, please cite:
123
+
124
+ ```bibtex
125
+ @misc{model_name-fp8,
126
+ author = {author},
127
+ title = {model_name FP8 Dynamic Quantization},
128
+ year = {2025},
129
+ publisher = {HuggingFace},
130
+ url = {https://huggingface.co/repo_id}
131
+ }
132
+ ```
133
+
134
+ ## License
135
+
136
+ Inherits license from base model: [TheDrummer/Fallen-Command-A-111B-v1.1](https://huggingface.co/TheDrummer/Fallen-Command-A-111B-v1.1)
137
+
138
+ ## Acknowledgments
139
+
140
+ - Base model by [TheDrummer](https://huggingface.co/TheDrummer)
141
+ - Quantization via [llmcompressor](https://github.com/vllm-project/llm-compressor)
142
+ - Serving optimized for [vLLM](https://github.com/vllm-project/vllm)
143
+
144
+
145
+
146
+ ---
147
+
148
+ **Want more FP8 models?** Check out my other quantizations!