metadata
license: mit
language:
- zh
- en
base_model:
- Qwen/Qwen2.5-1.5B-Instruct-GPTQ-INT8
- Qwen/Qwen2.5-1.5B-Instruct-GPTQ-INT4
pipeline_tag: text-generation
library_name: transformers
tags:
- Context
- Qwen2.5-1.5B-Instruct-GPTQ-INT8
- Qwen2.5-1.5B-Instruct-GPTQ-INT4
Qwen2.5-1.5B-Instruct-python
This version of Qwen2.5-1.5B-Instruct-python has been converted to run on the Axera NPU using w8a16 and w4a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 4.1
Feature
- Support for longer contexts, in this sample it's 2.5k
- Support context dialogue
- System prompt kvcache is supported
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
AXera NPU AXEngine LLM Runtime
Convert script
The follow show how to convert Qwen2.5-1.5B-Instruct-GPTQ-Int8
pulsar2 llm_build --input_path Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8 \
--output_path Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8-ctx-ax650 \
--hidden_state_type bf16 --kv_cache_len 2047 --prefill_len 128 \
--last_kv_cache_len 128 \
--last_kv_cache_len 256 \
--last_kv_cache_len 384 \
--last_kv_cache_len 512 \
--last_kv_cache_len 640 \
--last_kv_cache_len 768 \
--last_kv_cache_len 896 \
--last_kv_cache_len 1024 \
--chip AX650 -c 1 --parallel 8
Support Platform
- AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card
- AX630C
- TBD
How to use
Download all files from this repository to the device
root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-python# tree -L 1
.
├── chat.py
├── infer.py
├── infer_torch.py
├── Qwen2.5-1.5B-Instruct-GPTQ-Int8
├── Qwen2.5-1.5B-Instruct-GPTQ-Int8_axmodel
└── README.md
2 directories, 4 files