bluetrace
/

SERA-14B-FP8

Text Generation

text-generation-inference

compressed-tensors

Model card Files Files and versions

ikarabulut-dev commited on 25 days ago

Commit

f5d11e8

·

verified ·

1 Parent(s): 48dd855

Create README.md

Files changed (1) hide show

README.md +68 -0

README.md ADDED Viewed

	@@ -0,0 +1,68 @@

+---
+base_model: allenai/SERA-14B
+base_model_relation: quantized
+pipeline_tag: text-generation
+library_name: transformers
+language:
+  - en
+license: mit
+tags:
+  - fp8
+  - quantized
+  - llmcompressor
+  - vllm
+datasets:
+  - allenai/Sera-4.5A-Lite-T2
+---
+# SERA-14B-FP8
+FP8 quantization of [allenai/SERA-14B](https://huggingface.co/allenai/SERA-14B), produced with [llmcompressor](https://github.com/vllm-project/llm-compressor) and validated with vLLM.
+## Quantization Details
+| Parameter | Value |
+|---|---|
+| Method | FP8 (W8A8) via `llmcompressor` `oneshot` |
+| Targets | All `Linear` layers except `lm_head` |
+| Calibration dataset | `allenai/Sera-4.5A-Lite-T2` |
+| Calibration samples | 512 |
+| Calibration sequence length | 2048 tokens |
+| llmcompressor version | 0.9.0.2 |
+| Hardware | Local GPU (RTX 5080, 16 GB VRAM) |
+| Model size (uploaded) | ~16.2 GB (4 safetensors shards) |
+## GPU Stats
+- 1x RTX 5080
+- Total time: 1 hr
+## Usage
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(model="bluetrace/SERA-14B-FP8", max_model_len=16384)
+params = SamplingParams(temperature=0.7, max_tokens=512)
+outputs = llm.generate(
+    [{"role": "user", "content": "Explain quantum entanglement simply."}],
+    params,
+)
+print(outputs[0].outputs[0].text)
+```
+## Validation
+After quantization the model was loaded into vLLM and a test chat completion request was sent.
+## Limitations
+- Quality degradation relative to the BF16 base model has not been formally benchmarked. FP8 quantization with 512 calibration samples is generally low-loss for instruction-tuned models, but edge cases may differ.
+- Maximum recommended context length is 16 384 tokens on a single L40S GPU.
+- The `lm_head` layer is kept in BF16 (not quantized) to preserve output distribution.
+## Related
+- Base model: [allenai/SERA-14B](https://huggingface.co/allenai/SERA-14B)
+- Quantization tooling: [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)