File size: 1,114 Bytes
a227c91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Quantization on Ascend.

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add `--quantization` argument when starting the engine. The quantization method will be automatically parsed from the downloaded `quant_model_description.json` or `config.json` config.

[ModelSlim on Ascend support](https://github.com/sgl-project/sglang/pull/14504):
- [x] W4A4 dynamic linear
- [x] W8A8 static linear
- [x] W8A8 dynamic linear
- [x] W4A8 dynamic MOE
- [x] W8A8 dynamic MOE

[AWQ on Ascend support](https://github.com/sgl-project/sglang/pull/10158):
- [x] W4A16 linear
- [x] W8A16 linear # Need to test
- [x] W4A16 MOE # Need to test

Compressed-tensors (LLM Compressor) on Ascend support:
- [x] [W4A8 dynamic MOE with/without activation clip](https://github.com/sgl-project/sglang/pull/14736) # Need to test
- [x] [W4A16 MOE](https://github.com/sgl-project/sglang/pull/12759)
- [x] [W8A8 dynamic linear](https://github.com/sgl-project/sglang/pull/14504)
- [x] [W8A8 dynamic MOE](https://github.com/sgl-project/sglang/pull/14504)