ikarabulut-dev commited on
Commit
f5d11e8
·
verified ·
1 Parent(s): 48dd855

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -0
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: allenai/SERA-14B
3
+ base_model_relation: quantized
4
+ pipeline_tag: text-generation
5
+ library_name: transformers
6
+ language:
7
+ - en
8
+ license: mit
9
+ tags:
10
+ - fp8
11
+ - quantized
12
+ - llmcompressor
13
+ - vllm
14
+ datasets:
15
+ - allenai/Sera-4.5A-Lite-T2
16
+ ---
17
+
18
+ # SERA-14B-FP8
19
+
20
+ FP8 quantization of [allenai/SERA-14B](https://huggingface.co/allenai/SERA-14B), produced with [llmcompressor](https://github.com/vllm-project/llm-compressor) and validated with vLLM.
21
+
22
+ ## Quantization Details
23
+
24
+ | Parameter | Value |
25
+ |---|---|
26
+ | Method | FP8 (W8A8) via `llmcompressor` `oneshot` |
27
+ | Targets | All `Linear` layers except `lm_head` |
28
+ | Calibration dataset | `allenai/Sera-4.5A-Lite-T2` |
29
+ | Calibration samples | 512 |
30
+ | Calibration sequence length | 2048 tokens |
31
+ | llmcompressor version | 0.9.0.2 |
32
+ | Hardware | Local GPU (RTX 5080, 16 GB VRAM) |
33
+ | Model size (uploaded) | ~16.2 GB (4 safetensors shards) |
34
+
35
+
36
+ ## GPU Stats
37
+ - 1x RTX 5080
38
+ - Total time: 1 hr
39
+
40
+ ## Usage
41
+
42
+ ```python
43
+ from vllm import LLM, SamplingParams
44
+
45
+ llm = LLM(model="bluetrace/SERA-14B-FP8", max_model_len=16384)
46
+ params = SamplingParams(temperature=0.7, max_tokens=512)
47
+
48
+ outputs = llm.generate(
49
+ [{"role": "user", "content": "Explain quantum entanglement simply."}],
50
+ params,
51
+ )
52
+ print(outputs[0].outputs[0].text)
53
+ ```
54
+
55
+ ## Validation
56
+
57
+ After quantization the model was loaded into vLLM and a test chat completion request was sent.
58
+
59
+ ## Limitations
60
+
61
+ - Quality degradation relative to the BF16 base model has not been formally benchmarked. FP8 quantization with 512 calibration samples is generally low-loss for instruction-tuned models, but edge cases may differ.
62
+ - Maximum recommended context length is 16 384 tokens on a single L40S GPU.
63
+ - The `lm_head` layer is kept in BF16 (not quantized) to preserve output distribution.
64
+
65
+ ## Related
66
+
67
+ - Base model: [allenai/SERA-14B](https://huggingface.co/allenai/SERA-14B)
68
+ - Quantization tooling: [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)