File size: 1,114 Bytes

a227c91

Quantization on Ascend.

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add `--quantization` argument when starting the engine. The quantization method will be automatically parsed from the downloaded `quant_model_description.json` or `config.json` config.

[ModelSlim on Ascend support](https://github.com/sgl-project/sglang/pull/14504):
- [x] W4A4 dynamic linear
- [x] W8A8 static linear
- [x] W8A8 dynamic linear
- [x] W4A8 dynamic MOE
- [x] W8A8 dynamic MOE

[AWQ on Ascend support](https://github.com/sgl-project/sglang/pull/10158):
- [x] W4A16 linear
- [x] W8A16 linear # Need to test
- [x] W4A16 MOE # Need to test

Compressed-tensors (LLM Compressor) on Ascend support:
- [x] [W4A8 dynamic MOE with/without activation clip](https://github.com/sgl-project/sglang/pull/14736) # Need to test
- [x] [W4A16 MOE](https://github.com/sgl-project/sglang/pull/12759)
- [x] [W8A8 dynamic linear](https://github.com/sgl-project/sglang/pull/14504)
- [x] [W8A8 dynamic MOE](https://github.com/sgl-project/sglang/pull/14504)