sh0ck0r commited on
Commit
866d18c
·
verified ·
1 Parent(s): 4cfb951

Upload FP8 quantized version of MaziyarPanahi/calme-3.2-instruct-78b

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: MaziyarPanahi/calme-3.2-instruct-78b
3
+ tags:
4
+ - fp8
5
+ - vllm
6
+ - compressed-tensors
7
+ - quantized
8
+ - llmcompressor
9
+ license: apache-2.0
10
+ inference:
11
+ parameters:
12
+ temperature: 0.7
13
+ top_p: 0.9
14
+ max_new_tokens: 2048
15
+ library_name: transformers
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # calme-3.2-instruct-78b - FP8 Dynamic Quantization
20
+
21
+ This is an FP8 quantized version of [MaziyarPanahi/calme-3.2-instruct-78b](https://huggingface.co/MaziyarPanahi/calme-3.2-instruct-78b) using `llmcompressor` with the FP8_DYNAMIC scheme.
22
+
23
+ ## Model Details
24
+
25
+ - **Base Model**: MaziyarPanahi/calme-3.2-instruct-78b
26
+ - **Quantization**: FP8_DYNAMIC (W8A8)
27
+ - **Format**: compressed-tensors (SafeTensors)
28
+ - **Memory**: ~50% of original BF16 size
29
+ - **Quality**: <1-2% degradation on benchmarks (typical)
30
+
31
+ ## Quick Start
32
+
33
+ ### vLLM (Recommended)
34
+
35
+ ```bash
36
+ pip install vllm
37
+
38
+ # Serve the model
39
+ vllm serve REPO_ID \
40
+ --max-model-len 32768 \
41
+ --gpu-memory-utilization 0.95
42
+
43
+ # Python API
44
+ from vllm import LLM
45
+ llm = LLM(model="REPO_ID")
46
+ outputs = llm.generate("Hello, how are you?")
47
+ print(outputs[0].outputs[0].text)
48
+ ```
49
+
50
+ ### Transformers
51
+
52
+ ```python
53
+ from transformers import AutoTokenizer, AutoModelForCausalLM
54
+
55
+ model = AutoModelForCausalLM.from_pretrained(
56
+ "REPO_ID",
57
+ device_map="auto",
58
+ torch_dtype="auto"
59
+ )
60
+ tokenizer = AutoTokenizer.from_pretrained("REPO_ID")
61
+
62
+ messages = [{'role': 'user', 'content': 'Hello!'}]
63
+ inputs = tokenizer.apply_chat_template(messages, return_tensors='pt').to(model.device)
64
+ outputs = model.generate(inputs, max_new_tokens=512)
65
+ print(tokenizer.decode(outputs[0]))
66
+ ```
67
+
68
+ ## Quantization Details
69
+
70
+ This model was quantized using:
71
+ - **Tool**: [llmcompressor](https://github.com/vllm-project/llm-compressor)
72
+ - **Method**: FP8_DYNAMIC (Round-to-Nearest)
73
+ - **Targets**: All Linear layers except `lm_head`
74
+ - **Scheme**: W8A8 (8-bit weights and activations)
75
+
76
+
77
+ ## Performance
78
+
79
+ ### Memory Usage
80
+ - **Original BF16**: ~2× size of FP8
81
+ - **FP8 Quantized**: ~50% of original
82
+ - **Savings**: ~50% VRAM reduction
83
+
84
+ ### Inference Speed
85
+ - Expect 1.3-1.8× faster inference vs BF16
86
+ - 2× higher throughput (more KV cache available)
87
+
88
+ ## Use Cases
89
+
90
+ Perfect for:
91
+ - ✅ Production inference on limited VRAM
92
+ - ✅ Running larger models on single GPU
93
+ - ✅ Cost-effective API serving
94
+ - ✅ High-throughput applications
95
+ - ✅ Extended context lengths (more KV cache)
96
+
97
+ ## Hardware Requirements
98
+
99
+ **Minimum VRAM** (approximate):
100
+ - 70B model: ~40 GB (RTX A6000, A100 40GB)
101
+ - 123B model: ~70 GB (A100 80GB, H100, H200)
102
+
103
+ **Recommended**:
104
+ - H100/H200 for best performance
105
+ - vLLM for optimized serving
106
+ - Enable FP8 KV cache for extended context
107
+
108
+ ## Important Notes
109
+
110
+ ⚠️ **Quantization Trade-offs**:
111
+ - Slight quality degradation (typically <1-2%)
112
+ - Not suitable for fine-tuning (inference only)
113
+ - Best with vLLM (has FP8 kernel optimizations)
114
+
115
+ ✅ **Best Practices**:
116
+ - Use `--kv-cache-dtype fp8` for longer contexts
117
+ - Set `--gpu-memory-utilization 0.90-0.95`
118
+ - Add `--enforce-eager` if you encounter compilation issues
119
+
120
+ ## Citation
121
+
122
+ If you use this model, please cite:
123
+
124
+ ```bibtex
125
+ @misc{model_name-fp8,
126
+ author = {author},
127
+ title = {model_name FP8 Dynamic Quantization},
128
+ year = {2025},
129
+ publisher = {HuggingFace},
130
+ url = {https://huggingface.co/repo_id}
131
+ }
132
+ ```
133
+
134
+ ## License
135
+
136
+ Inherits license from base model: [MaziyarPanahi/calme-3.2-instruct-78b](https://huggingface.co/MaziyarPanahi/calme-3.2-instruct-78b)
137
+
138
+ ## Acknowledgments
139
+
140
+ - Base model by [MaziyarPanahi](https://huggingface.co/MaziyarPanahi)
141
+ - Quantization via [llmcompressor](https://github.com/vllm-project/llm-compressor)
142
+ - Serving optimized for [vLLM](https://github.com/vllm-project/vllm)
143
+
144
+
145
+
146
+ ---
147
+
148
+ **Want more FP8 models?** Check out my other quantizations!
added_tokens.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "<|endoftext|>": 151643,
3
+ "<|im_end|>": 151645,
4
+ "<|im_start|>": 151644
5
+ }
chat_template.jinja ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '
2
+ ' + message['content'] + '<|im_end|>' + '
3
+ '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
4
+ ' }}{% endif %}
config.json ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2ForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "eos_token_id": 151645,
7
+ "hidden_act": "silu",
8
+ "hidden_size": 8192,
9
+ "initializer_range": 0.02,
10
+ "intermediate_size": 29568,
11
+ "layer_types": [
12
+ "full_attention",
13
+ "full_attention",
14
+ "full_attention",
15
+ "full_attention",
16
+ "full_attention",
17
+ "full_attention",
18
+ "full_attention",
19
+ "full_attention",
20
+ "full_attention",
21
+ "full_attention",
22
+ "full_attention",
23
+ "full_attention",
24
+ "full_attention",
25
+ "full_attention",
26
+ "full_attention",
27
+ "full_attention",
28
+ "full_attention",
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention",
37
+ "full_attention",
38
+ "full_attention",
39
+ "full_attention",
40
+ "full_attention",
41
+ "full_attention",
42
+ "full_attention",
43
+ "full_attention",
44
+ "full_attention",
45
+ "full_attention",
46
+ "full_attention",
47
+ "full_attention",
48
+ "full_attention",
49
+ "full_attention",
50
+ "full_attention",
51
+ "full_attention",
52
+ "full_attention",
53
+ "full_attention",
54
+ "full_attention",
55
+ "full_attention",
56
+ "full_attention",
57
+ "full_attention",
58
+ "full_attention",
59
+ "full_attention",
60
+ "full_attention",
61
+ "full_attention",
62
+ "full_attention",
63
+ "full_attention",
64
+ "full_attention",
65
+ "full_attention",
66
+ "full_attention",
67
+ "full_attention",
68
+ "full_attention",
69
+ "full_attention",
70
+ "full_attention",
71
+ "full_attention",
72
+ "full_attention",
73
+ "full_attention",
74
+ "full_attention",
75
+ "full_attention",
76
+ "full_attention",
77
+ "full_attention",
78
+ "full_attention",
79
+ "full_attention",
80
+ "full_attention",
81
+ "full_attention",
82
+ "full_attention",
83
+ "full_attention",
84
+ "full_attention",
85
+ "full_attention",
86
+ "full_attention",
87
+ "full_attention",
88
+ "full_attention",
89
+ "full_attention",
90
+ "full_attention",
91
+ "full_attention",
92
+ "full_attention",
93
+ "full_attention",
94
+ "full_attention",
95
+ "full_attention",
96
+ "full_attention",
97
+ "full_attention"
98
+ ],
99
+ "max_position_embeddings": 32768,
100
+ "max_window_layers": 80,
101
+ "model_type": "qwen2",
102
+ "num_attention_heads": 64,
103
+ "num_hidden_layers": 86,
104
+ "num_key_value_heads": 8,
105
+ "quantization_config": {
106
+ "config_groups": {
107
+ "group_0": {
108
+ "format": "float-quantized",
109
+ "input_activations": {
110
+ "actorder": null,
111
+ "block_structure": null,
112
+ "dynamic": true,
113
+ "group_size": null,
114
+ "num_bits": 8,
115
+ "observer": null,
116
+ "observer_kwargs": {},
117
+ "strategy": "token",
118
+ "symmetric": true,
119
+ "type": "float"
120
+ },
121
+ "output_activations": null,
122
+ "targets": [
123
+ "Linear"
124
+ ],
125
+ "weights": {
126
+ "actorder": null,
127
+ "block_structure": null,
128
+ "dynamic": false,
129
+ "group_size": null,
130
+ "num_bits": 8,
131
+ "observer": "minmax",
132
+ "observer_kwargs": {},
133
+ "strategy": "channel",
134
+ "symmetric": true,
135
+ "type": "float"
136
+ }
137
+ }
138
+ },
139
+ "format": "float-quantized",
140
+ "global_compression_ratio": null,
141
+ "ignore": [
142
+ "lm_head"
143
+ ],
144
+ "kv_cache_scheme": null,
145
+ "quant_method": "compressed-tensors",
146
+ "quantization_status": "compressed",
147
+ "sparsity_config": {},
148
+ "transform_config": {},
149
+ "version": "0.11.0"
150
+ },
151
+ "rms_norm_eps": 1e-06,
152
+ "rope_scaling": null,
153
+ "rope_theta": 1000000.0,
154
+ "sliding_window": null,
155
+ "tie_word_embeddings": false,
156
+ "torch_dtype": "bfloat16",
157
+ "transformers_version": "4.55.2",
158
+ "use_cache": false,
159
+ "use_sliding_window": false,
160
+ "vocab_size": 151646
161
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "eos_token_id": 151645,
4
+ "transformers_version": "4.55.2",
5
+ "use_cache": false
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:58fb23b7f6d640fd8758d311c4f190c565ac052c27bacd53f6924a8752b91e70
3
+ size 4875952840
model-00002-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:77b353471b4b0529197fa5a0cb5ad34c45b2333e84a85451c4594c15ced6c4fb
3
+ size 4782749264
model-00003-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:890a91921810569afc03130a4cf0a8c762c01f209c341b028cfa4ab8d890596d
3
+ size 4873985992
model-00004-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5c73ba5900e96dad41a646a3751fa9f19ee74634618724c174129b431709da78
3
+ size 4782749368
model-00005-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:116b72d494d3aa27a99e9bd1658939a792fd185d7768534887cba92ef1fd1fef
3
+ size 4873986032
model-00006-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:740b69c6a5bce08796b72446261eea2b7a978c1d62bd5bef113e78750a2077ea
3
+ size 4782749368
model-00007-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b260c8a5430689830923771eac76c678a61ca7de339c8f13ac590f0e2c791e02
3
+ size 4873986032
model-00008-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cea9a3e6d0dc58753c3d64733076a9bfe079f39a46793a4e7ae2b08938347968
3
+ size 4782749368
model-00009-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:71646681a467673f1a55467657e7f2e7d9c94c3f4a8a0f7e7d19f2b8de4b1160
3
+ size 4873986032
model-00010-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f3e9efb684caf69851c020d0241689eafc368cfffd1ff4b4cb5b34eccc7c36aa
3
+ size 4782749368
model-00011-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0a18a5fe9274130d11edf24c690ece9d4f53ea31ba070866eb736f67c88345c2
3
+ size 4873986032
model-00012-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2176a61ef966e934a14d0ca056db3b01070df5defe88294732683ecb1d4b5c2
3
+ size 4782749368
model-00013-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d091f82e6dd088f5f70e1c1a45c8a4a4c43a8c8daf182b0365c8fb117cdf7a62
3
+ size 4873986032
model-00014-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9dda5f237f94344d893a8d5be4cfa68e62e63647aabc04d03e76941fbcdd82be
3
+ size 4782749368
model-00015-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dfba1fb0f8b3e554a5380db0c8c8a2660d85c30e21cd88273e8584921adf947e
3
+ size 4873986032
model-00016-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f061042a6be5c5926503fc6400356095ce574942a213bc832e84ac2ce4fcec65
3
+ size 4782749368
model-00017-of-00017.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79927f27c4c65629f891da9c7de2a3789d061ebe9fbcaf44af2767b78053b2d4
3
+ size 3211416200
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
recipe.yaml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ default_stage:
2
+ default_modifiers:
3
+ QuantizationModifier:
4
+ targets: [Linear]
5
+ ignore: [lm_head]
6
+ scheme: FP8_DYNAMIC
special_tokens_map.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>"
5
+ ],
6
+ "eos_token": {
7
+ "content": "<|im_end|>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false
12
+ },
13
+ "pad_token": {
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ }
20
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bcfe42da0a4497e8b2b172c1f9f4ec423a46dc12907f4349c55025f670422ba9
3
+ size 11418266
tokenizer_config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "151643": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "151644": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "151645": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ }
28
+ },
29
+ "additional_special_tokens": [
30
+ "<|im_start|>",
31
+ "<|im_end|>"
32
+ ],
33
+ "bos_token": null,
34
+ "clean_up_tokenization_spaces": false,
35
+ "eos_token": "<|im_end|>",
36
+ "errors": "replace",
37
+ "extra_special_tokens": {},
38
+ "model_max_length": 131072,
39
+ "pad_token": "<|endoftext|>",
40
+ "split_special_tokens": false,
41
+ "tokenizer_class": "Qwen2Tokenizer",
42
+ "unk_token": null
43
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff