AXERA-TECH
/

FastVLM-1.5B-GPTQ-Int4

Image-Text-to-Text

English

vlm

Model card Files Files and versions

xet

Community

wli1995 commited on about 20 hours ago

Commit

987f503

verified ·

1 Parent(s): 10f34e8

update project

Browse files

Files changed (1) hide show

README.md +103 -54

README.md CHANGED Viewed

@@ -9,6 +9,8 @@ tags:
 - vlm
 - en
 ---
 # FastVLM-1.5B-GPTQ-Int4
 This version of FastVLM-1.5B-GPTQ-Int4 has been converted to run on the Axera NPU using **w4a16** quantization.
@@ -79,12 +81,23 @@ hf download AXERA-TECH/FastVLM-1.5B-GPTQ-Int4 --local-dir .
 # structure of the downloaded files
 tree -L 3
 .
-└── AXERA-TECH
-    └── FastVLM-1.5B-GPTQ-Int4
-2 directories, 34 files
 ```
 ## Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
@@ -92,21 +105,23 @@ tree -L 3
 ### 运行（CLI）
 ```shell
-(base) root@ax650:~# axllm run AXERA-TECH/Qwen3-0.6B/
-[I][                            Init][ 127]: LLM init start
-tokenizer_type = 1
- 96% | ███████████████████████████████   |  30 /  31 [2.35s<2.42s, 12.79 count/s] init post axmodel ok,remain_cmm(8662 MB)
-[I][                            Init][ 188]: max_token_len : 2559
-[I][                            Init][ 191]: kv_cache_size : 1024, kv_cache_num: 2559
-[I][                            Init][ 194]: prefill_token_num : 128
-[I][                            Init][ 198]: grp: 1, prefill_max_kv_cache_num : 1
-[I][                            Init][ 198]: grp: 2, prefill_max_kv_cache_num : 512
-[I][                            Init][ 198]: grp: 3, prefill_max_kv_cache_num : 1024
-[I][                            Init][ 198]: grp: 4, prefill_max_kv_cache_num : 1536
-[I][                            Init][ 198]: grp: 5, prefill_max_kv_cache_num : 2048
-[I][                            Init][ 203]: prefill_max_token_num : 2048
 [I][                            Init][  27]: LLaMaEmbedSelector use mmap
-100% | ████████████████████████████████ |  31 /  31 [2.35s<2.35s, 13.21 count/s] embed_selector init ok
 [I][                     load_config][ 282]: load config:
 {
     "enable_repetition_penalty": false,
@@ -120,49 +135,83 @@ tokenizer_type = 1
     "top_p": 0.8
 }
-[I][                            Init][ 224]: LLM init ok
 Type "q" to exit
 Ctrl+c to stop current running
 "reset" to reset kvcache
 "dd" to remove last conversation.
 "pp" to print history.
 ----------------------------------------
 prompt >> who are you
-[I][                      SetKVCache][ 357]: prefill_grpid:2 kv_cache_num:512 precompute_len:0 input_num_token:22
-[I][                      SetKVCache][ 359]: current prefill_max_token_num:2048
-[I][                      SetKVCache][ 360]: first run
-[I][                             Run][ 412]: input token num : 22, prefill_split_num : 1
-[I][                             Run][ 474]: ttft: 586.40 ms
-<think>
-Okay, the user asked, "Who are you?" I need to respond appropriately. Since I'm an AI assistant, I should acknowledge their question and explain my purpose. I should mention that I'm here to help and that I can assist with various tasks. I should keep the response friendly and open-ended to encourage further interaction. Let me make sure the language is clear and natural.
-</think>
-I'm an AI assistant designed to help you with a wide range of questions and tasks. How can I assist you today? 😊
-[N][                             Run][ 554]: hit eos,avg 15.63 token/s
-[I][                      GetKVCache][ 331]: precompute_len:130, remaining:1918
 prompt >> q
 ```
 ### 启动服务（OpenAI 兼容）
 ```shell
-(base) root@ax650:~# axllm serve AXERA-TECH/Qwen3-0.6B/
-[I][                            Init][ 127]: LLM init start
-tokenizer_type = 1
- 96% | ███████████████████████████████   |  30 /  31 [2.06s<2.13s, 14.58 count/s] init post axmodel ok,remain_cmm(8662 MB)
-[I][                            Init][ 188]: max_token_len : 2559
-[I][                            Init][ 191]: kv_cache_size : 1024, kv_cache_num: 2559
-[I][                            Init][ 194]: prefill_token_num : 128
-[I][                            Init][ 198]: grp: 1, prefill_max_kv_cache_num : 1
-[I][                            Init][ 198]: grp: 2, prefill_max_kv_cache_num : 512
-[I][                            Init][ 198]: grp: 3, prefill_max_kv_cache_num : 1024
-[I][                            Init][ 198]: grp: 4, prefill_max_kv_cache_num : 1536
-[I][                            Init][ 198]: grp: 5, prefill_max_kv_cache_num : 2048
-[I][                            Init][ 203]: prefill_max_token_num : 2048
 [I][                            Init][  27]: LLaMaEmbedSelector use mmap
-100% | ████████████████████████████████ |  31 /  31 [2.06s<2.06s, 15.07 count/s] embed_selector init ok
 [I][                     load_config][ 282]: load config:
 {
     "enable_repetition_penalty": false,
@@ -176,11 +225,11 @@ tokenizer_type = 1
     "top_p": 0.8
 }
-[I][                            Init][ 224]: LLM init ok
-Starting server on port 8000 with model 'AXERA-TECH/Qwen3-0.6B'...
 OpenAI API Server starting on http://0.0.0.0:8000
 Max concurrency: 1
-Models: AXERA-TECH/Qwen3-0.6B
 ```
 ### OpenAI 调用示例
@@ -189,7 +238,7 @@ Models: AXERA-TECH/Qwen3-0.6B
 from openai import OpenAI
 API_URL = "http://127.0.0.1:8000/v1"
-MODEL = "AXERA-TECH/Qwen3-0.6B"
 messages = [
     {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
@@ -212,7 +261,7 @@ print(completion.choices[0].message.content)
 from openai import OpenAI
 API_URL = "http://127.0.0.1:8000/v1"
-MODEL = "AXERA-TECH/Qwen3-0.6B"
 messages = [
     {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},

 - vlm
 - en
 ---
 # FastVLM-1.5B-GPTQ-Int4
 This version of FastVLM-1.5B-GPTQ-Int4 has been converted to run on the Axera NPU using **w4a16** quantization.
 # structure of the downloaded files
 tree -L 3
 .
+`-- AXERA-TECH
+    `-- FastVLM-1.5B-GPTQ-Int4
+        |-- FastVLM_tokenizer.txt
+        |-- README.md
+        |-- config.json
+        |-- image.png
+        |-- image_encoder_1024x1024.axmodel
+        |-- image_encoder_512x512.axmodel
+        |-- llava_qwen2_p128_l0_together.axmodel
+        ...
+        |-- llava_qwen2_p128_l9_together.axmodel
+        |-- llava_qwen2_post.axmodel
+        |-- model.embed_tokens.weight.bfloat16.bin
+        |-- post_config.json
+        `-- vision_cache
+3 directories, 37 files
 ```
 ## Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
 ### 运行（CLI）
 ```shell
+root@ax650:~# axllm run AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/
+[I][                            Init][ 138]: LLM init start
+tokenizer_type = 3
+ 96% | ███████████████████████████████   |  30 /  31 [3.66s<3.78s, 8.20 count/s] init post axmodel ok,remain_cmm(10593 MB)
+[I][                            Init][ 199]: max_token_len : 1024
+[I][                            Init][ 202]: kv_cache_size : 256, kv_cache_num: 1024
+[I][                            Init][ 205]: prefill_token_num : 128
+[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
+[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
+[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
+[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 512
+[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 640
+[I][                            Init][ 214]: prefill_max_token_num : 640
 [I][                            Init][  27]: LLaMaEmbedSelector use mmap
+100% | ████████████████████████████████ |  31 /  31 [3.66s<3.66s, 8.47 count/s] embed_selector init ok
+[W][                            Init][ 480]: classic vision size override: cfg=448x448 -> model=1024x1024 (from input shape)
+[I][                            Init][ 666]: VisionModule init ok: type=FastVLM, tokens_per_block=256, embed_size=1536, out_dtype=fp32
 [I][                     load_config][ 282]: load config:
 {
     "enable_repetition_penalty": false,
     "top_p": 0.8
 }
+[I][                            Init][ 272]: LLM init ok
 Type "q" to exit
 Ctrl+c to stop current running
 "reset" to reset kvcache
 "dd" to remove last conversation.
 "pp" to print history.
+VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
 ----------------------------------------
 prompt >> who are you
+image >>
+[I][                      SetKVCache][ 406]: prefill_grpid:2 kv_cache_num:128 precompute_len:0 input_num_token:22
+[I][                      SetKVCache][ 408]: current prefill_max_token_num:640
+[I][                      SetKVCache][ 409]: first run
+[I][                             Run][ 457]: input token num : 22, prefill_split_num : 1
+[I][                             Run][ 497]: prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=22
+[I][                             Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
+[I][                             Run][ 627]: ttft: 137.01 ms
+I am an AI language model, I am here to help answer any questions you may have. How can I assist you today?
+[N][                             Run][ 709]: hit eos,avg 14.77 token/s
+[I][                      GetKVCache][ 380]: precompute_len:48, remaining:592
+prompt >> describe the image
+image >> ./AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/image.png
+[I][                EncodeForContent][ 994]: vision cache store: ./AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/image.png
+[I][                      SetKVCache][ 406]: prefill_grpid:4 kv_cache_num:512 precompute_len:48 input_num_token:271
+[I][                      SetKVCache][ 408]: current prefill_max_token_num:512
+[I][                             Run][ 457]: input token num : 271, prefill_split_num : 3
+[I][                             Run][ 497]: prefill chunk p=0 history_len=48 grpid=2 kv_cache_num=128 input_tokens=128
+[I][                             Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
+[I][                             Run][ 497]: prefill chunk p=1 history_len=176 grpid=3 kv_cache_num=256 input_tokens=128
+[I][                             Run][ 519]: prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=0
+[I][                             Run][ 497]: prefill chunk p=2 history_len=304 grpid=4 kv_cache_num=512 input_tokens=15
+[I][                             Run][ 519]: prefill indices shape: p=2 idx_elems=128 idx_rows=1 pos_rows=0
+[I][                             Run][ 627]: ttft: 403.77 ms
+The image depicts three astronauts standing in a forest, wearing full space suits with helmets. The scene is surreal and otherworldly, as the astronauts are dressed in space suits and are surrounded by a natural environment. The image is in black and white, which adds to the surreal and dreamlike quality of the scene. The astronauts appear to be exploring the forest, and the contrast between the natural environment and the space suits creates a striking and thought-provoking image.
+[N][                             Run][ 709]: hit eos,avg 14.79 token/s
+[I][                      GetKVCache][ 380]: precompute_len:412, remaining:228
+prompt >> how many people in the image?
+image >>
+[I][                EncodeForContent][ 926]: vision cache hit (mem): ./AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/image.png
+[I][                      SetKVCache][ 406]: prefill_grpid:4 kv_cache_num:512 precompute_len:412 input_num_token:17
+[I][                      SetKVCache][ 408]: current prefill_max_token_num:128
+[I][                             Run][ 457]: input token num : 17, prefill_split_num : 1
+[I][                             Run][ 497]: prefill chunk p=0 history_len=412 grpid=4 kv_cache_num=512 input_tokens=17
+[I][                             Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
+[I][                             Run][ 627]: ttft: 168.52 ms
+There are three people in the image.
+[N][                             Run][ 709]: hit eos,avg 14.69 token/s
+[I][                      GetKVCache][ 380]: precompute_len:437, remaining:203
 prompt >> q
 ```
 ### 启动服务（OpenAI 兼容）
 ```shell
+root@ax650:~# axllm serve AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/
+[I][                            Init][ 138]: LLM init start
+tokenizer_type = 3
+ 96% | ███████████████████████████████   |  30 /  31 [2.72s<2.81s, 11.02 count/s] init post axmodel ok,remain_cmm(10593 MB)
+[I][                            Init][ 199]: max_token_len : 1024
+[I][                            Init][ 202]: kv_cache_size : 256, kv_cache_num: 1024
+[I][                            Init][ 205]: prefill_token_num : 128
+[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
+[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
+[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
+[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 512
+[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 640
+[I][                            Init][ 214]: prefill_max_token_num : 640
 [I][                            Init][  27]: LLaMaEmbedSelector use mmap
+100% | ████████████████████████████████ |  31 /  31 [2.72s<2.72s, 11.38 count/s] embed_selector init ok
+[W][                            Init][ 480]: classic vision size override: cfg=448x448 -> model=1024x1024 (from input shape)
+[I][                            Init][ 666]: VisionModule init ok: type=FastVLM, tokens_per_block=256, embed_size=1536, out_dtype=fp32
 [I][                     load_config][ 282]: load config:
 {
     "enable_repetition_penalty": false,
     "top_p": 0.8
 }
+[I][                            Init][ 272]: LLM init ok
+Starting server on port 8000 with model 'AXERA-TECH/FastVLM-1.5B-GPTQ-Int4'...
 OpenAI API Server starting on http://0.0.0.0:8000
 Max concurrency: 1
+Models: AXERA-TECH/FastVLM-1.5B-GPTQ-Int4
 ```
 ### OpenAI 调用示例
 from openai import OpenAI
 API_URL = "http://127.0.0.1:8000/v1"
+MODEL = "AXERA-TECH/FastVLM-1.5B-GPTQ-Int4"
 messages = [
     {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
 from openai import OpenAI
 API_URL = "http://127.0.0.1:8000/v1"
+MODEL = "AXERA-TECH/FastVLM-1.5B-GPTQ-Int4"
 messages = [
     {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},