sh0ck0r commited on Dec 24, 2025

Commit

866d18c

verified ·

1 Parent(s): 4cfb951

Upload FP8 quantized version of MaziyarPanahi/calme-3.2-instruct-78b

Browse files

Files changed (30) hide show

.gitattributes +1 -0
README.md +148 -0
added_tokens.json +5 -0
chat_template.jinja +4 -0
config.json +161 -0
generation_config.json +6 -0
merges.txt +0 -0
model-00001-of-00017.safetensors +3 -0
model-00002-of-00017.safetensors +3 -0
model-00003-of-00017.safetensors +3 -0
model-00004-of-00017.safetensors +3 -0
model-00005-of-00017.safetensors +3 -0
model-00006-of-00017.safetensors +3 -0
model-00007-of-00017.safetensors +3 -0
model-00008-of-00017.safetensors +3 -0
model-00009-of-00017.safetensors +3 -0
model-00010-of-00017.safetensors +3 -0
model-00011-of-00017.safetensors +3 -0
model-00012-of-00017.safetensors +3 -0
model-00013-of-00017.safetensors +3 -0
model-00014-of-00017.safetensors +3 -0
model-00015-of-00017.safetensors +3 -0
model-00016-of-00017.safetensors +3 -0
model-00017-of-00017.safetensors +3 -0
model.safetensors.index.json +0 -0
recipe.yaml +6 -0
special_tokens_map.json +20 -0
tokenizer.json +3 -0
tokenizer_config.json +43 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,148 @@

+---
+base_model: MaziyarPanahi/calme-3.2-instruct-78b
+tags:
+  - fp8
+  - vllm
+  - compressed-tensors
+  - quantized
+  - llmcompressor
+license: apache-2.0
+inference:
+  parameters:
+    temperature: 0.7
+    top_p: 0.9
+    max_new_tokens: 2048
+library_name: transformers
+pipeline_tag: text-generation
+---
+# calme-3.2-instruct-78b - FP8 Dynamic Quantization
+This is an FP8 quantized version of [MaziyarPanahi/calme-3.2-instruct-78b](https://huggingface.co/MaziyarPanahi/calme-3.2-instruct-78b) using `llmcompressor` with the FP8_DYNAMIC scheme.
+## Model Details
+- **Base Model**: MaziyarPanahi/calme-3.2-instruct-78b
+- **Quantization**: FP8_DYNAMIC (W8A8)
+- **Format**: compressed-tensors (SafeTensors)
+- **Memory**: ~50% of original BF16 size
+- **Quality**: <1-2% degradation on benchmarks (typical)
+## Quick Start
+### vLLM (Recommended)
+```bash
+pip install vllm
+# Serve the model
+vllm serve REPO_ID \
+  --max-model-len 32768 \
+  --gpu-memory-utilization 0.95
+# Python API
+from vllm import LLM
+llm = LLM(model="REPO_ID")
+outputs = llm.generate("Hello, how are you?")
+print(outputs[0].outputs[0].text)
+```
+### Transformers
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained(
+    "REPO_ID",
+    device_map="auto",
+    torch_dtype="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained("REPO_ID")
+messages = [{'role': 'user', 'content': 'Hello!'}]
+inputs = tokenizer.apply_chat_template(messages, return_tensors='pt').to(model.device)
+outputs = model.generate(inputs, max_new_tokens=512)
+print(tokenizer.decode(outputs[0]))
+```
+## Quantization Details
+This model was quantized using:
+- **Tool**: [llmcompressor](https://github.com/vllm-project/llm-compressor)
+- **Method**: FP8_DYNAMIC (Round-to-Nearest)
+- **Targets**: All Linear layers except `lm_head`
+- **Scheme**: W8A8 (8-bit weights and activations)
+## Performance
+### Memory Usage
+- **Original BF16**: ~2× size of FP8
+- **FP8 Quantized**: ~50% of original
+- **Savings**: ~50% VRAM reduction
+### Inference Speed
+- Expect 1.3-1.8× faster inference vs BF16
+- 2× higher throughput (more KV cache available)
+## Use Cases
+Perfect for:
+- ✅ Production inference on limited VRAM
+- ✅ Running larger models on single GPU
+- ✅ Cost-effective API serving
+- ✅ High-throughput applications
+- ✅ Extended context lengths (more KV cache)
+## Hardware Requirements
+**Minimum VRAM** (approximate):
+- 70B model: ~40 GB (RTX A6000, A100 40GB)
+- 123B model: ~70 GB (A100 80GB, H100, H200)
+**Recommended**:
+- H100/H200 for best performance
+- vLLM for optimized serving
+- Enable FP8 KV cache for extended context
+## Important Notes
+⚠️ **Quantization Trade-offs**:
+- Slight quality degradation (typically <1-2%)
+- Not suitable for fine-tuning (inference only)
+- Best with vLLM (has FP8 kernel optimizations)
+✅ **Best Practices**:
+- Use `--kv-cache-dtype fp8` for longer contexts
+- Set `--gpu-memory-utilization 0.90-0.95`
+- Add `--enforce-eager` if you encounter compilation issues
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{model_name-fp8,
+  author = {author},
+  title = {model_name FP8 Dynamic Quantization},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/repo_id}
+}
+```
+## License
+Inherits license from base model: [MaziyarPanahi/calme-3.2-instruct-78b](https://huggingface.co/MaziyarPanahi/calme-3.2-instruct-78b)
+## Acknowledgments
+- Base model by [MaziyarPanahi](https://huggingface.co/MaziyarPanahi)
+- Quantization via [llmcompressor](https://github.com/vllm-project/llm-compressor)
+- Serving optimized for [vLLM](https://github.com/vllm-project/vllm)
+---
+**Want more FP8 models?** Check out my other quantizations!

added_tokens.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "<|endoftext|>": 151643,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,4 @@

+{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '
+' + message['content'] + '<|im_end|>' + '
+'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
+' }}{% endif %}

config.json ADDED Viewed

	@@ -0,0 +1,161 @@

+{
+  "architectures": [
+    "Qwen2ForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "eos_token_id": 151645,
+  "hidden_act": "silu",
+  "hidden_size": 8192,
+  "initializer_range": 0.02,
+  "intermediate_size": 29568,
+  "layer_types": [
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention"
+  ],
+  "max_position_embeddings": 32768,
+  "max_window_layers": 80,
+  "model_type": "qwen2",
+  "num_attention_heads": 64,
+  "num_hidden_layers": 86,
+  "num_key_value_heads": 8,
+  "quantization_config": {
+    "config_groups": {
+      "group_0": {
+        "format": "float-quantized",
+        "input_activations": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": true,
+          "group_size": null,
+          "num_bits": 8,
+          "observer": null,
+          "observer_kwargs": {},
+          "strategy": "token",
+          "symmetric": true,
+          "type": "float"
+        },
+        "output_activations": null,
+        "targets": [
+          "Linear"
+        ],
+        "weights": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": false,
+          "group_size": null,
+          "num_bits": 8,
+          "observer": "minmax",
+          "observer_kwargs": {},
+          "strategy": "channel",
+          "symmetric": true,
+          "type": "float"
+        }
+      }
+    },
+    "format": "float-quantized",
+    "global_compression_ratio": null,
+    "ignore": [
+      "lm_head"
+    ],
+    "kv_cache_scheme": null,
+    "quant_method": "compressed-tensors",
+    "quantization_status": "compressed",
+    "sparsity_config": {},
+    "transform_config": {},
+    "version": "0.11.0"
+  },
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": null,
+  "rope_theta": 1000000.0,
+  "sliding_window": null,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.55.2",
+  "use_cache": false,
+  "use_sliding_window": false,
+  "vocab_size": 151646
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "eos_token_id": 151645,
+  "transformers_version": "4.55.2",
+  "use_cache": false
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model-00001-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:58fb23b7f6d640fd8758d311c4f190c565ac052c27bacd53f6924a8752b91e70
+size 4875952840

model-00002-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:77b353471b4b0529197fa5a0cb5ad34c45b2333e84a85451c4594c15ced6c4fb
+size 4782749264

model-00003-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:890a91921810569afc03130a4cf0a8c762c01f209c341b028cfa4ab8d890596d
+size 4873985992

model-00004-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5c73ba5900e96dad41a646a3751fa9f19ee74634618724c174129b431709da78
+size 4782749368

model-00005-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:116b72d494d3aa27a99e9bd1658939a792fd185d7768534887cba92ef1fd1fef
+size 4873986032

model-00006-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:740b69c6a5bce08796b72446261eea2b7a978c1d62bd5bef113e78750a2077ea
+size 4782749368

model-00007-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b260c8a5430689830923771eac76c678a61ca7de339c8f13ac590f0e2c791e02
+size 4873986032

model-00008-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cea9a3e6d0dc58753c3d64733076a9bfe079f39a46793a4e7ae2b08938347968
+size 4782749368

model-00009-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:71646681a467673f1a55467657e7f2e7d9c94c3f4a8a0f7e7d19f2b8de4b1160
+size 4873986032

model-00010-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f3e9efb684caf69851c020d0241689eafc368cfffd1ff4b4cb5b34eccc7c36aa
+size 4782749368

model-00011-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0a18a5fe9274130d11edf24c690ece9d4f53ea31ba070866eb736f67c88345c2
+size 4873986032

model-00012-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a2176a61ef966e934a14d0ca056db3b01070df5defe88294732683ecb1d4b5c2
+size 4782749368

model-00013-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d091f82e6dd088f5f70e1c1a45c8a4a4c43a8c8daf182b0365c8fb117cdf7a62
+size 4873986032

model-00014-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9dda5f237f94344d893a8d5be4cfa68e62e63647aabc04d03e76941fbcdd82be
+size 4782749368

model-00015-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dfba1fb0f8b3e554a5380db0c8c8a2660d85c30e21cd88273e8584921adf947e
+size 4873986032

model-00016-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f061042a6be5c5926503fc6400356095ce574942a213bc832e84ac2ce4fcec65
+size 4782749368

model-00017-of-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:79927f27c4c65629f891da9c7de2a3789d061ebe9fbcaf44af2767b78053b2d4
+size 3211416200

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

recipe.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+default_stage:
+  default_modifiers:
+    QuantizationModifier:
+      targets: [Linear]
+      ignore: [lm_head]
+      scheme: FP8_DYNAMIC

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bcfe42da0a4497e8b2b172c1f9f4ec423a46dc12907f4349c55025f670422ba9
+size 11418266

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,43 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff