Upload FP8 quantized version of deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct

Browse files

Files changed (13) hide show

README.md +148 -0
chat_template.jinja +5 -0
config.json +108 -0
generation_config.json +9 -0
model-00001-of-00004.safetensors +3 -0
model-00002-of-00004.safetensors +3 -0
model-00003-of-00004.safetensors +3 -0
model-00004-of-00004.safetensors +3 -0
model.safetensors.index.json +0 -0
recipe.yaml +6 -0
special_tokens_map.json +23 -0
tokenizer.json +0 -0
tokenizer_config.json +162 -0

README.md ADDED Viewed

	@@ -0,0 +1,148 @@

+---
+base_model: deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
+tags:
+  - fp8
+  - vllm
+  - compressed-tensors
+  - quantized
+  - llmcompressor
+license: apache-2.0
+inference:
+  parameters:
+    temperature: 0.7
+    top_p: 0.9
+    max_new_tokens: 2048
+library_name: transformers
+pipeline_tag: text-generation
+---
+# DeepSeek-Coder-V2-Lite-Instruct - FP8 Dynamic Quantization
+This is an FP8 quantized version of [deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct) using `llmcompressor` with the FP8_DYNAMIC scheme.
+## Model Details
+- **Base Model**: deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
+- **Quantization**: FP8_DYNAMIC (W8A8)
+- **Format**: compressed-tensors (SafeTensors)
+- **Memory**: ~50% of original BF16 size
+- **Quality**: <1-2% degradation on benchmarks (typical)
+## Quick Start
+### vLLM (Recommended)
+```bash
+pip install vllm
+# Serve the model
+vllm serve REPO_ID \
+  --max-model-len 32768 \
+  --gpu-memory-utilization 0.95
+# Python API
+from vllm import LLM
+llm = LLM(model="REPO_ID")
+outputs = llm.generate("Hello, how are you?")
+print(outputs[0].outputs[0].text)
+```
+### Transformers
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained(
+    "REPO_ID",
+    device_map="auto",
+    torch_dtype="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained("REPO_ID")
+messages = [{'role': 'user', 'content': 'Hello!'}]
+inputs = tokenizer.apply_chat_template(messages, return_tensors='pt').to(model.device)
+outputs = model.generate(inputs, max_new_tokens=512)
+print(tokenizer.decode(outputs[0]))
+```
+## Quantization Details
+This model was quantized using:
+- **Tool**: [llmcompressor](https://github.com/vllm-project/llm-compressor)
+- **Method**: FP8_DYNAMIC (Round-to-Nearest)
+- **Targets**: All Linear layers except `lm_head`
+- **Scheme**: W8A8 (8-bit weights and activations)
+## Performance
+### Memory Usage
+- **Original BF16**: ~2× size of FP8
+- **FP8 Quantized**: ~50% of original
+- **Savings**: ~50% VRAM reduction
+### Inference Speed
+- Expect 1.3-1.8× faster inference vs BF16
+- 2× higher throughput (more KV cache available)
+## Use Cases
+Perfect for:
+- ✅ Production inference on limited VRAM
+- ✅ Running larger models on single GPU
+- ✅ Cost-effective API serving
+- ✅ High-throughput applications
+- ✅ Extended context lengths (more KV cache)
+## Hardware Requirements
+**Minimum VRAM** (approximate):
+- 70B model: ~40 GB (RTX A6000, A100 40GB)
+- 123B model: ~70 GB (A100 80GB, H100, H200)
+**Recommended**:
+- H100/H200 for best performance
+- vLLM for optimized serving
+- Enable FP8 KV cache for extended context
+## Important Notes
+⚠️ **Quantization Trade-offs**:
+- Slight quality degradation (typically <1-2%)
+- Not suitable for fine-tuning (inference only)
+- Best with vLLM (has FP8 kernel optimizations)
+✅ **Best Practices**:
+- Use `--kv-cache-dtype fp8` for longer contexts
+- Set `--gpu-memory-utilization 0.90-0.95`
+- Add `--enforce-eager` if you encounter compilation issues
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{model_name-fp8,
+  author = {author},
+  title = {model_name FP8 Dynamic Quantization},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/repo_id}
+}
+```
+## License
+Inherits license from base model: [deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct)
+## Acknowledgments
+- Base model by [deepseek-ai](https://huggingface.co/deepseek-ai)
+- Quantization via [llmcompressor](https://github.com/vllm-project/llm-compressor)
+- Serving optimized for [vLLM](https://github.com/vllm-project/vllm)
+---
+**Want more FP8 models?** Check out my other quantizations!

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,5 @@

+{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}{{ 'User: ' + message['content'] + '
+' }}{% elif message['role'] == 'assistant' %}{{ 'Assistant: ' + message['content'] + eos_token }}{% elif message['role'] == 'system' %}{{ message['content'] + '
+' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}

config.json ADDED Viewed

	@@ -0,0 +1,108 @@

+{
+  "architectures": [
+    "DeepseekV2ForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_deepseek.DeepseekV2Config",
+    "AutoModel": "modeling_deepseek.DeepseekV2Model",
+    "AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
+  },
+  "aux_loss_alpha": 0.001,
+  "bos_token_id": 100000,
+  "eos_token_id": 100001,
+  "first_k_dense_replace": 1,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "intermediate_size": 10944,
+  "kv_lora_rank": 512,
+  "max_position_embeddings": 163840,
+  "mlp_bias": false,
+  "model_type": "deepseek_v2",
+  "moe_intermediate_size": 1408,
+  "moe_layer_freq": 1,
+  "n_group": 1,
+  "n_routed_experts": 64,
+  "n_shared_experts": 2,
+  "norm_topk_prob": false,
+  "num_attention_heads": 16,
+  "num_experts_per_tok": 6,
+  "num_hidden_layers": 27,
+  "num_key_value_heads": 16,
+  "pretraining_tp": 1,
+  "q_lora_rank": null,
+  "qk_nope_head_dim": 128,
+  "qk_rope_head_dim": 64,
+  "quantization_config": {
+    "config_groups": {
+      "group_0": {
+        "format": "float-quantized",
+        "input_activations": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": true,
+          "group_size": null,
+          "num_bits": 8,
+          "observer": null,
+          "observer_kwargs": {},
+          "strategy": "token",
+          "symmetric": true,
+          "type": "float"
+        },
+        "output_activations": null,
+        "targets": [
+          "Linear"
+        ],
+        "weights": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": false,
+          "group_size": null,
+          "num_bits": 8,
+          "observer": "minmax",
+          "observer_kwargs": {},
+          "strategy": "channel",
+          "symmetric": true,
+          "type": "float"
+        }
+      }
+    },
+    "format": "float-quantized",
+    "global_compression_ratio": null,
+    "ignore": [
+      "lm_head"
+    ],
+    "kv_cache_scheme": null,
+    "quant_method": "compressed-tensors",
+    "quantization_status": "compressed",
+    "sparsity_config": {},
+    "transform_config": {},
+    "version": "0.11.0"
+  },
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": {
+    "beta_fast": 32,
+    "beta_slow": 1,
+    "factor": 40,
+    "mscale": 0.707,
+    "mscale_all_dim": 0.707,
+    "original_max_position_embeddings": 4096,
+    "rope_type": "yarn",
+    "type": "yarn"
+  },
+  "rope_theta": 10000,
+  "routed_scaling_factor": 1.0,
+  "scoring_func": "softmax",
+  "seq_aux": true,
+  "tie_word_embeddings": false,
+  "topk_group": 1,
+  "topk_method": "greedy",
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.55.2",
+  "use_cache": true,
+  "v_head_dim": 128,
+  "vocab_size": 102400
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 100000,
+  "do_sample": true,
+  "eos_token_id": 100001,
+  "temperature": 0.3,
+  "top_p": 0.95,
+  "transformers_version": "4.55.2"
+}

model-00001-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4a079cb0d2e82d66714b9494d9c887a7e3ca124e929dd26a4233bad95e2c4e27
+size 4998118952

model-00002-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:910bb14e7858bd9ec6ecf98a1b4b860c483f40388cd2e3d4932a81977146316c
+size 4999835464

model-00003-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2cd0f9de0722d320f1a57e3f5c9383aa6c1b5360295cd1059c5d7deec23820cf
+size 5000220496

model-00004-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:768dccd2a8897004a344d576234c99f2497adce9b1dd7d7c73a1e1590fa3d4f8
+size 1149746800

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

recipe.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+default_stage:
+  default_modifiers:
+    QuantizationModifier:
+      targets: [Linear]
+      ignore: [lm_head]
+      scheme: FP8_DYNAMIC

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "bos_token": {
+    "content": "<｜begin▁of▁sentence｜>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<｜end▁of▁sentence｜>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<｜end▁of▁sentence｜>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,162 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "add_prefix_space": null,
+  "added_tokens_decoder": {
+    "100000": {
+      "content": "<｜begin▁of▁sentence｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100001": {
+      "content": "<｜end▁of▁sentence｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100002": {
+      "content": "<｜fim▁hole｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100003": {
+      "content": "<｜fim▁begin｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100004": {
+      "content": "<｜fim▁end｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100005": {
+      "content": "<｜completion｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100006": {
+      "content": "<｜User｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100007": {
+      "content": "<｜Assistant｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100008": {
+      "content": "<|EOT|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100009": {
+      "content": "<｜tool▁calls▁begin｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100010": {
+      "content": "<｜tool▁calls▁end｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100011": {
+      "content": "<｜tool▁call▁begin｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100012": {
+      "content": "<｜tool▁call▁end｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100013": {
+      "content": "<｜tool▁outputs▁begin｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100014": {
+      "content": "<｜tool▁outputs▁end｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100015": {
+      "content": "<｜tool▁output▁begin｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100016": {
+      "content": "<｜tool▁output▁end｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "100017": {
+      "content": "<｜tool▁sep｜>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "bos_token": "<｜begin▁of▁sentence｜>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<｜end▁of▁sentence｜>",
+  "extra_special_tokens": {},
+  "legacy": true,
+  "model_max_length": 16384,
+  "pad_token": "<｜end▁of▁sentence｜>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "LlamaTokenizerFast",
+  "unk_token": null,
+  "use_default_system_prompt": false
+}