Instructions to use AXERA-TECH/SmolVLM2-500M-Video-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AXERA-TECH/SmolVLM2-500M-Video-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="AXERA-TECH/SmolVLM2-500M-Video-Instruct")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("AXERA-TECH/SmolVLM2-500M-Video-Instruct", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use AXERA-TECH/SmolVLM2-500M-Video-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AXERA-TECH/SmolVLM2-500M-Video-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AXERA-TECH/SmolVLM2-500M-Video-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/AXERA-TECH/SmolVLM2-500M-Video-Instruct

SGLang

How to use AXERA-TECH/SmolVLM2-500M-Video-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AXERA-TECH/SmolVLM2-500M-Video-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AXERA-TECH/SmolVLM2-500M-Video-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AXERA-TECH/SmolVLM2-500M-Video-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AXERA-TECH/SmolVLM2-500M-Video-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use AXERA-TECH/SmolVLM2-500M-Video-Instruct with Docker Model Runner:
```
docker model run hf.co/AXERA-TECH/SmolVLM2-500M-Video-Instruct
```

SmolVLM2-500M-Video-Instruct

This version of SmolVLM2-500M-Video-Instructhas been converted to run on the Axera NPU using w8a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Image Process

Chips	input size	image num	image encoder	ttft(344 tokens)	w8a16	CMM	Flash
AX650	512*512	1	537 ms	510 ms	35.23 tokens/sec	773 MB	813MB

Video Process

Chips	input size	image num	image encoder	ttft(656 tokens)	w8a16	CMM	Flash
AX650	512*512	8	832 ms	1523 ms	35.32 tokens/sec	773 MB	813MB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

安装 axllm

方式一：克隆仓库后执行安装脚本：

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二：一行命令安装（默认分支 axllm）：

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三：下载Github Actions CI 导出的可执行程序（适合没有编译环境的用户）：

如果没有编译环境，请到： https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序（axllm），然后：

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载（Hugging Face）

mkdir -p AXERA-TECH/SmolVLM2-500M-Video-Instruct
cd AXERA-TECH/SmolVLM2-500M-Video-Instruct
hf download AXERA-TECH/SmolVLM2-500M-Video-Instruct --local-dir .

# structure of the downloaded files
tree -L 3
.
└── AXERA-TECH
    └── SmolVLM2-500M-Video-Instruct
        ├── config.json
        ├── image.png
        ├── llama_p128_l0_together.axmodel
        ...
        ├── llama_p128_l9_together.axmodel
        ├── llama_post.axmodel
        ├── model.embed_tokens.weight.bfloat16.bin
        ├── post_config.json
        ├── README.md
        ├── smolvlm2_tokenizer.txt
        ├── vision_cache
        └── vision_model_1x3x512x512_NHwC_U8.axmodel

4 directories, 40 files

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行（CLI）

root@ax650:~# axllm run AXERA-TECH/SmolVLM2-500M-Video-Instruct/
[I][                            Init][ 138]: LLM init start
tokenizer_type = 1
 97% | ████████████████████████████████  |  34 /  35 [3.08s<3.17s, 11.03 count/s] init post axmodel ok,remain_cmm(11158 MB)
[I][                            Init][ 199]: max_token_len : 1023
[I][                            Init][ 202]: kv_cache_size : 320, kv_cache_num: 1023
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 384
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 6, prefill_max_kv_cache_num : 640
[I][                            Init][ 209]: grp: 7, prefill_max_kv_cache_num : 768
[I][                            Init][ 214]: prefill_max_token_num : 768
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  35 /  35 [3.08s<3.08s, 11.35 count/s] embed_selector init ok
[W][                            Init][ 526]: SmolVLM2 vision size override: cfg=448x448 -> model=512x512
[I][                            Init][ 666]: VisionModule init ok: type=SmolVLM2, tokens_per_block=64, embed_size=960, out_dtype=fp32
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Type "q" to exit
Ctrl+c to stop current running
"reset" to reset kvcache
"dd" to remove last conversation.
"pp" to print history.
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> who are you
image >>
[W][             apply_chat_template][  80]: system content is not supported
[I][                      SetKVCache][ 406]: prefill_grpid:2 kv_cache_num:128 precompute_len:0 input_num_token:11
[I][                      SetKVCache][ 408]: current prefill_max_token_num:768
[I][                      SetKVCache][ 409]: first run
[I][                             Run][ 457]: input token num : 11, prefill_split_num : 1
[I][                             Run][ 497]: prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=11
[I][                             Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
[I][                             Run][ 627]: ttft: 119.38 ms
 I'm an AI assistant, I don't have personal experiences or emotions, but I can provide information and answer questions based on my programming and knowledge base.

I was trained on a vast amount of text data, including books, articles, and websites, which enables me to understand and respond to a wide range of questions.

If you have a question or need help with a task, feel free to ask!

[N][                             Run][ 709]: hit eos,avg 29.32 token/s

[I][                      GetKVCache][ 380]: precompute_len:96, remaining:672
[W][             apply_chat_template][  80]: system content is not supported
prompt >> how many people in the image?
image >> ./AXERA-TECH/SmolVLM2-500M-Video-Instruct/image.png
[I][                EncodeForContent][ 994]: vision cache store: ./AXERA-TECH/SmolVLM2-500M-Video-Instruct/image.png
[W][             apply_chat_template][  80]: system content is not supported
[I][                      SetKVCache][ 406]: prefill_grpid:5 kv_cache_num:512 precompute_len:96 input_num_token:350
[I][                      SetKVCache][ 408]: current prefill_max_token_num:640
[I][                             Run][ 457]: input token num : 350, prefill_split_num : 3
[I][                             Run][ 497]: prefill chunk p=0 history_len=96 grpid=2 kv_cache_num=128 input_tokens=128
[I][                             Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
[I][                             Run][ 497]: prefill chunk p=1 history_len=224 grpid=3 kv_cache_num=256 input_tokens=128
[I][                             Run][ 519]: prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=0
[I][                             Run][ 497]: prefill chunk p=2 history_len=352 grpid=4 kv_cache_num=384 input_tokens=94
[I][                             Run][ 519]: prefill indices shape: p=2 idx_elems=128 idx_rows=1 pos_rows=0
[I][                             Run][ 627]: ttft: 681.38 ms
 There are three people in the image.

[N][                             Run][ 709]: hit eos,avg 30.19 token/s

[I][                      GetKVCache][ 380]: precompute_len:454, remaining:314
[W][             apply_chat_template][  80]: system content is not supported
prompt >> q

启动服务（OpenAI 兼容）

root@ax650:~# axllm serve AXERA-TECH/SmolVLM2-500M-Video-Instruct/
[I][                            Init][ 138]: LLM init start
tokenizer_type = 1
 97% | ████████████████████████████████  |  34 /  35 [1.91s<1.97s, 17.77 count/s] init post axmodel ok,remain_cmm(11158 MB)
[I][                            Init][ 199]: max_token_len : 1023
[I][                            Init][ 202]: kv_cache_size : 320, kv_cache_num: 1023
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 384
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 6, prefill_max_kv_cache_num : 640
[I][                            Init][ 209]: grp: 7, prefill_max_kv_cache_num : 768
[I][                            Init][ 214]: prefill_max_token_num : 768
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  35 /  35 [1.91s<1.91s, 18.29 count/s] embed_selector init ok
[W][                            Init][ 526]: SmolVLM2 vision size override: cfg=448x448 -> model=512x512
[I][                            Init][ 666]: VisionModule init ok: type=SmolVLM2, tokens_per_block=64, embed_size=960, out_dtype=fp32
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/SmolVLM2-500M-Video-Instruct'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/SmolVLM2-500M-Video-Instruct

OpenAI 调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/SmolVLM2-500M-Video-Instruct"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)

OpenAI 流式调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/SmolVLM2-500M-Video-Instruct"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print("
")

Downloads last month: 6

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support