sh0ck0r commited on
Commit
88381da
·
verified ·
1 Parent(s): 4c64bc8

Upload FP8 quantized version of deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct

Browse files
README.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
3
+ tags:
4
+ - fp8
5
+ - vllm
6
+ - compressed-tensors
7
+ - quantized
8
+ - llmcompressor
9
+ license: apache-2.0
10
+ inference:
11
+ parameters:
12
+ temperature: 0.7
13
+ top_p: 0.9
14
+ max_new_tokens: 2048
15
+ library_name: transformers
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # DeepSeek-Coder-V2-Lite-Instruct - FP8 Dynamic Quantization
20
+
21
+ This is an FP8 quantized version of [deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct) using `llmcompressor` with the FP8_DYNAMIC scheme.
22
+
23
+ ## Model Details
24
+
25
+ - **Base Model**: deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
26
+ - **Quantization**: FP8_DYNAMIC (W8A8)
27
+ - **Format**: compressed-tensors (SafeTensors)
28
+ - **Memory**: ~50% of original BF16 size
29
+ - **Quality**: <1-2% degradation on benchmarks (typical)
30
+
31
+ ## Quick Start
32
+
33
+ ### vLLM (Recommended)
34
+
35
+ ```bash
36
+ pip install vllm
37
+
38
+ # Serve the model
39
+ vllm serve REPO_ID \
40
+ --max-model-len 32768 \
41
+ --gpu-memory-utilization 0.95
42
+
43
+ # Python API
44
+ from vllm import LLM
45
+ llm = LLM(model="REPO_ID")
46
+ outputs = llm.generate("Hello, how are you?")
47
+ print(outputs[0].outputs[0].text)
48
+ ```
49
+
50
+ ### Transformers
51
+
52
+ ```python
53
+ from transformers import AutoTokenizer, AutoModelForCausalLM
54
+
55
+ model = AutoModelForCausalLM.from_pretrained(
56
+ "REPO_ID",
57
+ device_map="auto",
58
+ torch_dtype="auto"
59
+ )
60
+ tokenizer = AutoTokenizer.from_pretrained("REPO_ID")
61
+
62
+ messages = [{'role': 'user', 'content': 'Hello!'}]
63
+ inputs = tokenizer.apply_chat_template(messages, return_tensors='pt').to(model.device)
64
+ outputs = model.generate(inputs, max_new_tokens=512)
65
+ print(tokenizer.decode(outputs[0]))
66
+ ```
67
+
68
+ ## Quantization Details
69
+
70
+ This model was quantized using:
71
+ - **Tool**: [llmcompressor](https://github.com/vllm-project/llm-compressor)
72
+ - **Method**: FP8_DYNAMIC (Round-to-Nearest)
73
+ - **Targets**: All Linear layers except `lm_head`
74
+ - **Scheme**: W8A8 (8-bit weights and activations)
75
+
76
+
77
+ ## Performance
78
+
79
+ ### Memory Usage
80
+ - **Original BF16**: ~2× size of FP8
81
+ - **FP8 Quantized**: ~50% of original
82
+ - **Savings**: ~50% VRAM reduction
83
+
84
+ ### Inference Speed
85
+ - Expect 1.3-1.8× faster inference vs BF16
86
+ - 2× higher throughput (more KV cache available)
87
+
88
+ ## Use Cases
89
+
90
+ Perfect for:
91
+ - ✅ Production inference on limited VRAM
92
+ - ✅ Running larger models on single GPU
93
+ - ✅ Cost-effective API serving
94
+ - ✅ High-throughput applications
95
+ - ✅ Extended context lengths (more KV cache)
96
+
97
+ ## Hardware Requirements
98
+
99
+ **Minimum VRAM** (approximate):
100
+ - 70B model: ~40 GB (RTX A6000, A100 40GB)
101
+ - 123B model: ~70 GB (A100 80GB, H100, H200)
102
+
103
+ **Recommended**:
104
+ - H100/H200 for best performance
105
+ - vLLM for optimized serving
106
+ - Enable FP8 KV cache for extended context
107
+
108
+ ## Important Notes
109
+
110
+ ⚠️ **Quantization Trade-offs**:
111
+ - Slight quality degradation (typically <1-2%)
112
+ - Not suitable for fine-tuning (inference only)
113
+ - Best with vLLM (has FP8 kernel optimizations)
114
+
115
+ ✅ **Best Practices**:
116
+ - Use `--kv-cache-dtype fp8` for longer contexts
117
+ - Set `--gpu-memory-utilization 0.90-0.95`
118
+ - Add `--enforce-eager` if you encounter compilation issues
119
+
120
+ ## Citation
121
+
122
+ If you use this model, please cite:
123
+
124
+ ```bibtex
125
+ @misc{model_name-fp8,
126
+ author = {author},
127
+ title = {model_name FP8 Dynamic Quantization},
128
+ year = {2025},
129
+ publisher = {HuggingFace},
130
+ url = {https://huggingface.co/repo_id}
131
+ }
132
+ ```
133
+
134
+ ## License
135
+
136
+ Inherits license from base model: [deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct)
137
+
138
+ ## Acknowledgments
139
+
140
+ - Base model by [deepseek-ai](https://huggingface.co/deepseek-ai)
141
+ - Quantization via [llmcompressor](https://github.com/vllm-project/llm-compressor)
142
+ - Serving optimized for [vLLM](https://github.com/vllm-project/vllm)
143
+
144
+
145
+
146
+ ---
147
+
148
+ **Want more FP8 models?** Check out my other quantizations!
chat_template.jinja ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}{{ 'User: ' + message['content'] + '
2
+
3
+ ' }}{% elif message['role'] == 'assistant' %}{{ 'Assistant: ' + message['content'] + eos_token }}{% elif message['role'] == 'system' %}{{ message['content'] + '
4
+
5
+ ' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
config.json ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "DeepseekV2ForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_deepseek.DeepseekV2Config",
9
+ "AutoModel": "modeling_deepseek.DeepseekV2Model",
10
+ "AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
11
+ },
12
+ "aux_loss_alpha": 0.001,
13
+ "bos_token_id": 100000,
14
+ "eos_token_id": 100001,
15
+ "first_k_dense_replace": 1,
16
+ "head_dim": 64,
17
+ "hidden_act": "silu",
18
+ "hidden_size": 2048,
19
+ "initializer_range": 0.02,
20
+ "intermediate_size": 10944,
21
+ "kv_lora_rank": 512,
22
+ "max_position_embeddings": 163840,
23
+ "mlp_bias": false,
24
+ "model_type": "deepseek_v2",
25
+ "moe_intermediate_size": 1408,
26
+ "moe_layer_freq": 1,
27
+ "n_group": 1,
28
+ "n_routed_experts": 64,
29
+ "n_shared_experts": 2,
30
+ "norm_topk_prob": false,
31
+ "num_attention_heads": 16,
32
+ "num_experts_per_tok": 6,
33
+ "num_hidden_layers": 27,
34
+ "num_key_value_heads": 16,
35
+ "pretraining_tp": 1,
36
+ "q_lora_rank": null,
37
+ "qk_nope_head_dim": 128,
38
+ "qk_rope_head_dim": 64,
39
+ "quantization_config": {
40
+ "config_groups": {
41
+ "group_0": {
42
+ "format": "float-quantized",
43
+ "input_activations": {
44
+ "actorder": null,
45
+ "block_structure": null,
46
+ "dynamic": true,
47
+ "group_size": null,
48
+ "num_bits": 8,
49
+ "observer": null,
50
+ "observer_kwargs": {},
51
+ "strategy": "token",
52
+ "symmetric": true,
53
+ "type": "float"
54
+ },
55
+ "output_activations": null,
56
+ "targets": [
57
+ "Linear"
58
+ ],
59
+ "weights": {
60
+ "actorder": null,
61
+ "block_structure": null,
62
+ "dynamic": false,
63
+ "group_size": null,
64
+ "num_bits": 8,
65
+ "observer": "minmax",
66
+ "observer_kwargs": {},
67
+ "strategy": "channel",
68
+ "symmetric": true,
69
+ "type": "float"
70
+ }
71
+ }
72
+ },
73
+ "format": "float-quantized",
74
+ "global_compression_ratio": null,
75
+ "ignore": [
76
+ "lm_head"
77
+ ],
78
+ "kv_cache_scheme": null,
79
+ "quant_method": "compressed-tensors",
80
+ "quantization_status": "compressed",
81
+ "sparsity_config": {},
82
+ "transform_config": {},
83
+ "version": "0.11.0"
84
+ },
85
+ "rms_norm_eps": 1e-06,
86
+ "rope_scaling": {
87
+ "beta_fast": 32,
88
+ "beta_slow": 1,
89
+ "factor": 40,
90
+ "mscale": 0.707,
91
+ "mscale_all_dim": 0.707,
92
+ "original_max_position_embeddings": 4096,
93
+ "rope_type": "yarn",
94
+ "type": "yarn"
95
+ },
96
+ "rope_theta": 10000,
97
+ "routed_scaling_factor": 1.0,
98
+ "scoring_func": "softmax",
99
+ "seq_aux": true,
100
+ "tie_word_embeddings": false,
101
+ "topk_group": 1,
102
+ "topk_method": "greedy",
103
+ "torch_dtype": "bfloat16",
104
+ "transformers_version": "4.55.2",
105
+ "use_cache": true,
106
+ "v_head_dim": 128,
107
+ "vocab_size": 102400
108
+ }
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 100000,
4
+ "do_sample": true,
5
+ "eos_token_id": 100001,
6
+ "temperature": 0.3,
7
+ "top_p": 0.95,
8
+ "transformers_version": "4.55.2"
9
+ }
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a079cb0d2e82d66714b9494d9c887a7e3ca124e929dd26a4233bad95e2c4e27
3
+ size 4998118952
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:910bb14e7858bd9ec6ecf98a1b4b860c483f40388cd2e3d4932a81977146316c
3
+ size 4999835464
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2cd0f9de0722d320f1a57e3f5c9383aa6c1b5360295cd1059c5d7deec23820cf
3
+ size 5000220496
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:768dccd2a8897004a344d576234c99f2497adce9b1dd7d7c73a1e1590fa3d4f8
3
+ size 1149746800
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
recipe.yaml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ default_stage:
2
+ default_modifiers:
3
+ QuantizationModifier:
4
+ targets: [Linear]
5
+ ignore: [lm_head]
6
+ scheme: FP8_DYNAMIC
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|begin▁of▁sentence|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|end▁of▁sentence|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|end▁of▁sentence|>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": null,
5
+ "added_tokens_decoder": {
6
+ "100000": {
7
+ "content": "<|begin▁of▁sentence|>",
8
+ "lstrip": false,
9
+ "normalized": true,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "100001": {
15
+ "content": "<|end▁of▁sentence|>",
16
+ "lstrip": false,
17
+ "normalized": true,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "100002": {
23
+ "content": "<|fim▁hole|>",
24
+ "lstrip": false,
25
+ "normalized": true,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": false
29
+ },
30
+ "100003": {
31
+ "content": "<|fim▁begin|>",
32
+ "lstrip": false,
33
+ "normalized": true,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": false
37
+ },
38
+ "100004": {
39
+ "content": "<|fim▁end|>",
40
+ "lstrip": false,
41
+ "normalized": true,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": false
45
+ },
46
+ "100005": {
47
+ "content": "<|completion|>",
48
+ "lstrip": false,
49
+ "normalized": true,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": false
53
+ },
54
+ "100006": {
55
+ "content": "<|User|>",
56
+ "lstrip": false,
57
+ "normalized": true,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": false
61
+ },
62
+ "100007": {
63
+ "content": "<|Assistant|>",
64
+ "lstrip": false,
65
+ "normalized": true,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": false
69
+ },
70
+ "100008": {
71
+ "content": "<|EOT|>",
72
+ "lstrip": false,
73
+ "normalized": true,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": true
77
+ },
78
+ "100009": {
79
+ "content": "<|tool▁calls▁begin|>",
80
+ "lstrip": false,
81
+ "normalized": true,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": false
85
+ },
86
+ "100010": {
87
+ "content": "<|tool▁calls▁end|>",
88
+ "lstrip": false,
89
+ "normalized": true,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": false
93
+ },
94
+ "100011": {
95
+ "content": "<|tool▁call▁begin|>",
96
+ "lstrip": false,
97
+ "normalized": true,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": false
101
+ },
102
+ "100012": {
103
+ "content": "<|tool▁call▁end|>",
104
+ "lstrip": false,
105
+ "normalized": true,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": false
109
+ },
110
+ "100013": {
111
+ "content": "<|tool▁outputs▁begin|>",
112
+ "lstrip": false,
113
+ "normalized": true,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": false
117
+ },
118
+ "100014": {
119
+ "content": "<|tool▁outputs▁end|>",
120
+ "lstrip": false,
121
+ "normalized": true,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": false
125
+ },
126
+ "100015": {
127
+ "content": "<|tool▁output▁begin|>",
128
+ "lstrip": false,
129
+ "normalized": true,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": false
133
+ },
134
+ "100016": {
135
+ "content": "<|tool▁output▁end|>",
136
+ "lstrip": false,
137
+ "normalized": true,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": false
141
+ },
142
+ "100017": {
143
+ "content": "<|tool▁sep|>",
144
+ "lstrip": false,
145
+ "normalized": true,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": false
149
+ }
150
+ },
151
+ "bos_token": "<|begin▁of▁sentence|>",
152
+ "clean_up_tokenization_spaces": false,
153
+ "eos_token": "<|end▁of▁sentence|>",
154
+ "extra_special_tokens": {},
155
+ "legacy": true,
156
+ "model_max_length": 16384,
157
+ "pad_token": "<|end▁of▁sentence|>",
158
+ "sp_model_kwargs": {},
159
+ "tokenizer_class": "LlamaTokenizerFast",
160
+ "unk_token": null,
161
+ "use_default_system_prompt": false
162
+ }