Gemma4-Axera Banner

Gemma 4 E2B on AXERA NPU

Ready-to-run deployment package for google/gemma-4-E2B-it on AX650 / NPU3.

This release packages the w8a16 AXERA NPU runtime.
Compatible with Pulsar2 5.2 and later.
Includes the tokenizer/config files required at runtime.
Includes compiled Gemma 4 text .axmodel files, Vision .axmodel files, and fixed-duration Audio .axmodel files.
Supports text-only chat, single-image multimodal inference, and fixed-duration audio transcription in the legacy Python demo flow.

Supported Platform

AX650 / NPU3

Validated Devices

This package has been validated on the following AX650-based devices:

Performance

All measurements below were taken on AX650 / NPU3. TTFT stands for time to first token.

w8a16: TTFT is approximately 2175 ms (1152 tokens), with a decode throughput of approximately 7.99 tokens/s (theoretical maximum).
w4a16: TTFT is approximately 1568 ms (1152 tokens), with a decode throughput of approximately 12.41 tokens/s (theoretical maximum).

The packaged text runtime in this release is the w8a16 build. Its text runtime files are packaged at the repository root. The w4a16 numbers are provided for reference only.

Startup Runtime Footprint

Item	Value
`Flash total (all 41 axmodels)`	`3.84 GiB` (`3933.57 MiB`)
`Runtime CMM total (default config)`	`3.80 GiB` (`3888.68 MiB`)

Vision Encoder Latency

Model	Resolution	Soft Tokens	Time (ms)
`gemma4_vision_h336_w480_t70.axmodel`	336×480	70	87.966 ms
`gemma4_vision_h480_w672_t140.axmodel`	480×672	140	258.329 ms
`gemma4_vision_h672_w960_t280.axmodel`	672×960	280	750.429 ms

Audio Encoder Latency and Accuracy

Validated on AX650 / NPU3 with the packaged WAV sample clips and the legacy Python demo flow.

Model	Audio Duration	Audio Tokens
`gemma4_audio_5s.axmodel`	`5s`	`125`
`gemma4_audio_30s.axmodel`	`30s`	`750`

Single-run latency measured with /opt/bin/ax_run_model -w 1 -r 5:

Model	CMM Size	Avg
`gemma4_audio_5s.axmodel`	`~310 MiB`	`28.930 ms`
`gemma4_audio_30s.axmodel`	`~359 MiB`	`170.978 ms`

Package Layout

.
├── README.md
├── config.json
├── post_config.json
├── infer_axmodel.py
├── gradio_demo.py
├── assets/
├── gemma_4_e2b_it_tokenizer/
├── gemma4_tokenizer.txt
├── gemma4_text_p128_l*.axmodel
├── gemma4_text_post.axmodel
├── gemma4_audio_5s.axmodel
├── gemma4_audio_30s.axmodel
├── gemma4_vision_h336_w480_t70.axmodel
├── gemma4_vision_h480_w672_t140.axmodel
├── gemma4_vision_h672_w960_t280.axmodel
├── model.embed_tokens_per_layer.weight.npy
├── model.embed_tokens.weight.bfloat16.bin
├── model.per_layer_model_projection.weight.npy
├── model.per_layer_projection_norm.weight.npy
├── vit_models/
└── utils/

This package uses a hybrid layout: the tokenizer stays in a subdirectory, the packaged text runtime files plus the Vision and Audio .axmodel files live at the repository root, and vit_models/ keeps the accompanying Vision metadata JSON files.

The Python demo scripts auto-detect the packaged paths above. If you keep this layout unchanged, you can run the Python examples later in this README without passing extra path arguments.

Sample Image

Both the axllm flow and the legacy Python demo flow below can use the packaged sample image: assets/sample.png

Sample Audio

The package also includes three packaged WAV clips for board-side audio validation:

assets/gemma4_audio_test_5s.wav
assets/gemma4_audio_test_chunk0_30s.wav
assets/gemma4_audio_test_chunk1_30s.wav

Direct Inference with `axllm`

The axllm workflow is still being refined. The instructions below reflect the current validated flow and may be adjusted as the packaging continues to evolve.

Download the Model Package

Download the release package from Hugging Face:

mkdir -p AXERA-TECH/gemma-4-E2B-it
cd AXERA-TECH/gemma-4-E2B-it
hf download AXERA-TECH/gemma-4-E2B-it --local-dir .

Install `axllm`

Option 1: clone the repository and run the installer:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

Option 2: install with a one-line command (default branch: axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

Option 3: download the prebuilt binary from GitHub Actions CI:

If you do not have a local build environment, download the latest CI-generated axllm binary from GitHub Actions: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm Then run:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

Run on the Board

The package root is already arranged for axllm, so no extra runtime path arguments are required.

Note: the command below assumes you run it from the parent directory of AXERA-TECH/gemma-4-E2B-it. If you are already inside the package directory, use axllm run . instead.

For multimodal testing, you can use the sample image shown above: ./assets/sample.png.

$ axllm run AXERA-TECH/gemma-4-E2B-it

# output log example:
14:57:27.565 INF Init:1019 | LLM init start
14:57:27.565 INF Init:1034 | shared kv enabled: num_kv_shared_layers=20
14:57:27.565 INF Init:1050 | attention config: layers=35 sliding=28 full=7 linear=0 sliding_window=512 ref_full_layer_idx=0
tokenizer_type = 3
huggingface tokenizer mode = space_replace_bpe
 31% | ##########                       |  12 /  38 [3.41s<10.80s, 3.52 count/s] init 10 axmodel ok,remain_cmm(7282 MB                                                                                                                       34% | ##########                       |  13 /  38 [3.48s<10.18s, 3.73 count/s] init 11 axmodel ok,remain_cmm(7227 MB                                                                                                                       97% | ###############################  |  37 /  38 [6.57s<6.74s, 5.64 count/s] init post axmodel ok,remain_cmm(4868 MB)
14:57:34.130 INF Init:1196 | max_token_len : 2047
14:57:34.131 INF Init:1199 | kv_cache_size : 256, kv_cache_num: 2047
14:57:34.131 INF init_groups_from_model:702 | prefill_token_num : 128
14:57:34.131 INF init_groups_from_model:916 | decode grp: 0, gid: 0, max_token_len : 2047
14:57:34.131 INF init_groups_from_model:920 | prefill grp: 0, gid: 1, history_cap: 0, total_cap: 128, symbolic_cap: 1
14:57:34.131 INF init_groups_from_model:920 | prefill grp: 1, gid: 2, history_cap: 128, total_cap: 256, symbolic_cap: 128
14:57:34.131 INF init_groups_from_model:920 | prefill grp: 2, gid: 3, history_cap: 256, total_cap: 384, symbolic_cap: 256
14:57:34.131 INF init_groups_from_model:920 | prefill grp: 3, gid: 4, history_cap: 384, total_cap: 512, symbolic_cap: 384
14:57:34.131 INF init_groups_from_model:920 | prefill grp: 4, gid: 5, history_cap: 512, total_cap: 640, symbolic_cap: 512
14:57:34.131 INF init_groups_from_model:920 | prefill grp: 5, gid: 6, history_cap: 640, total_cap: 768, symbolic_cap: 640
14:57:34.131 INF init_groups_from_model:920 | prefill grp: 6, gid: 7, history_cap: 768, total_cap: 896, symbolic_cap: 768
14:57:34.131 INF init_groups_from_model:920 | prefill grp: 7, gid: 8, history_cap: 896, total_cap: 1024, symbolic_cap: 896
14:57:34.131 INF init_groups_from_model:920 | prefill grp: 8, gid: 9, history_cap: 1024, total_cap: 1152, symbolic_cap: 1024
14:57:34.131 INF init_groups_from_model:927 | prefill_max_token_num : 1152
14:57:34.131 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  38 /  38 [6.57s<6.57s, 5.78 count/s] embed_selector init ok
14:57:34.149 INF Init:475 | Gemma4 per-layer helper enabled: vocab=262144 hidden=1536 layers=35 per_layer=256 pad=0
14:57:39.791 INF init_audio_profile:245 | Gemma4 audio profile init ok: path=../../gemma-4-E2B-it/gemma4_audio_5s.axmodel duration=5.0s mel_frames=499 tokens=125 out_dtype=fp32
14:57:40.049 INF init_audio_profile:245 | Gemma4 audio profile init ok: path=../../gemma-4-E2B-it/gemma4_audio_30s.axmodel duration=30.0s mel_frames=2999 tokens=750 out_dtype=fp32
14:57:40.049 INF Init:914 | Gemma4-VL token ids: image_pad=258880 video_pad=258884 audio_pad=258881
14:57:40.049 INF Init:921 | VisionModule init ok: type=Gemma4VL, tokens_per_block=70, embed_size=1536, out_dtype=fp32
14:57:40.049 WRN Init:930 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
14:57:40.054 INF load_config:444 | load config:
14:57:40.054 INF load_config:444 | {
14:57:40.054 INF load_config:444 |     "enable_repetition_penalty": false,
14:57:40.054 INF load_config:444 |     "enable_temperature": false,
14:57:40.054 INF load_config:444 |     "enable_top_k_sampling": false,
14:57:40.054 INF load_config:444 |     "enable_top_p_sampling": false,
14:57:40.054 INF load_config:444 |     "penalty_window": 64,
14:57:40.054 INF load_config:444 |     "repetition_penalty": 1.0,
14:57:40.054 INF load_config:444 |     "temperature": 1.0,
14:57:40.054 INF load_config:444 |     "top_k": 64,
14:57:40.054 INF load_config:444 |     "top_p": 0.95
14:57:40.054 INF load_config:444 | }
14:57:40.055 INF Init:1293 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input media path (empty = text-only). Use "video:<frames_dir>" for video, "audio:<file>" for audio.
----------------------------------------
prompt >> hello, who are you?
media >>
14:58:01.823 INF SetKVCache:1607 | decode_grpid:0 prefill_grpid:1 history_cap:0 total_cap:128 symbolic_cap:1 precompute_len:0 input_num_token:26 prefer_symbolic_group:0
14:58:01.823 INF SetKVCache:1628 | current prefill_max_token_num:1152
14:58:01.863 INF SetKVCache:1632 | first run
14:58:01.884 INF Run:1736 | input token num : 26, prefill_split_num : 1
14:58:02.014 INF Run:1819 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=26
14:58:02.014 INF Run:1843 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
14:58:02.267 INF Run:2028 | ttft: 383.05 ms
Hello! I am Gemma 4, a Large Language Model developed by Google DeepMind. I am an open weights model. How can I help you today?

14:58:07.195 NTC Run:2396 | hit eos,decode avg 6.29 token/s
14:58:07.198 INF GetKVCache:1572 | precompute_len:58, remaining:1094 (tracked)
prompt >> Okay, great!
media >>
14:58:20.884 INF SetKVCache:1607 | decode_grpid:0 prefill_grpid:2 history_cap:128 total_cap:256 symbolic_cap:128 precompute_len:58 input_num_token:14 prefer_symbolic_group:0
14:58:20.885 INF SetKVCache:1628 | current prefill_max_token_num:1024
14:58:20.892 INF Run:1736 | input token num : 14, prefill_split_num : 1
14:58:20.937 INF Run:1819 | prefill chunk p=0 history_len=58 grpid=2 kv_cache_num=128 input_tokens=14
14:58:20.938 INF Run:1843 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
14:58:21.210 INF Run:2028 | ttft: 318.07 ms
I'm happy to help! What can I do for you? Do you have a question, need some information, want to brainstorm some ideas, or anything else? 😊

14:58:26.377 NTC Run:2396 | hit eos,decode avg 6.58 token/s
14:58:26.383 INF GetKVCache:1572 | precompute_len:107, remaining:1045 (tracked)
prompt >> Please describe this image in detail.
media >> /root/your/workspace/gemma-4-E2B-it/assets/sample.png
15:01:41.584 INF EncodeForContent:1464 | vision cache store: /root/your/workspace/gemma-4-E2B-it/assets/sample.png
15:01:41.610 INF SetKVCache:1607 | decode_grpid:0 prefill_grpid:3 history_cap:256 total_cap:384 symbolic_cap:256 precompute_len:107 input_num_token:91 prefer_symbolic_group:1
15:01:41.610 INF SetKVCache:1628 | current prefill_max_token_num:1024
15:01:41.630 INF Run:1736 | input token num : 91, prefill_split_num : 1
15:01:41.922 INF Run:1819 | prefill chunk p=0 history_len=107 grpid=3 kv_cache_num=256 input_tokens=91
15:01:41.923 INF Run:1843 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
15:01:42.232 INF Run:2028 | ttft: 601.69 ms
This image is a cartoon illustration of a **red, stylized lobster**.

Here is a detailed description:

*   **Subject:** The central subject is a lobster, depicted in a vibrant, bright red color.
*   **Style:** The illustration is highly stylized and cartoonish, featuring thick outlines and exaggerated features, giving it a playful or energetic look.
*   **Pose/Expression:** The lobster appears to be in an active or aggressive pose. It has prominent claws (chelipeds) that are raised, suggesting it might be waving, striking, or ready for action. Its body is curved dynamically.
*   **Details:** You can clearly see the segmented body, the claws, and the legs. The overall design is bold, with clean lines and a glossy or slightly textured appearance typical of a sticker or graphic design.
*   **Background:** The lobster is isolated on a plain white background, which makes the red color and the details of the illustration stand out prominently.
*   **Overall Impression:** The image is energetic, bold, and clearly designed to be eye-catching, likely intended for use as a sticker, icon, or graphic element.

15:02:16.834 NTC Run:2396 | hit eos,decode avg 6.85 token/s
15:02:17.013 INF GetKVCache:1572 | precompute_len:436, remaining:716 (tracked)
prompt >>

Serve with `axllm`

To launch the packaged model through the local axllm service:

Note: the command below assumes you run it from the parent directory of AXERA-TECH/gemma-4-E2B-it. If you are already inside the package directory, use axllm serve . --port 8000 instead.

$ axllm serve AXERA-TECH/gemma-4-E2B-it --port 8000
# output log example:
15:05:32.638 INF Init:1019 | LLM init start
15:05:32.638 INF Init:1034 | shared kv enabled: num_kv_shared_layers=20
15:05:32.638 INF Init:1050 | attention config: layers=35 sliding=28 full=7 linear=0 sliding_window=512 ref_full_layer_idx=0
tokenizer_type = 3
huggingface tokenizer mode = space_replace_bpe
 31% | ##########                       |  12 /  38 [3.32s<10.50s, 3.62 count/s] init 10 axmodel ok,remain_cmm(7282 MB                                                                                                                       97% | ###############################  |  37 /  38 [6.50s<6.67s, 5.69 count/s] init post axmodel ok,remain_cmm(4868 MB)
15:05:39.135 INF Init:1196 | max_token_len : 2047
15:05:39.135 INF Init:1199 | kv_cache_size : 256, kv_cache_num: 2047
15:05:39.135 INF init_groups_from_model:702 | prefill_token_num : 128
15:05:39.135 INF init_groups_from_model:916 | decode grp: 0, gid: 0, max_token_len : 2047
15:05:39.135 INF init_groups_from_model:920 | prefill grp: 0, gid: 1, history_cap: 0, total_cap: 128, symbolic_cap: 1
15:05:39.135 INF init_groups_from_model:920 | prefill grp: 1, gid: 2, history_cap: 128, total_cap: 256, symbolic_cap: 128
15:05:39.135 INF init_groups_from_model:920 | prefill grp: 2, gid: 3, history_cap: 256, total_cap: 384, symbolic_cap: 256
15:05:39.135 INF init_groups_from_model:920 | prefill grp: 3, gid: 4, history_cap: 384, total_cap: 512, symbolic_cap: 384
15:05:39.135 INF init_groups_from_model:920 | prefill grp: 4, gid: 5, history_cap: 512, total_cap: 640, symbolic_cap: 512
15:05:39.135 INF init_groups_from_model:920 | prefill grp: 5, gid: 6, history_cap: 640, total_cap: 768, symbolic_cap: 640
15:05:39.135 INF init_groups_from_model:920 | prefill grp: 6, gid: 7, history_cap: 768, total_cap: 896, symbolic_cap: 768
15:05:39.135 INF init_groups_from_model:920 | prefill grp: 7, gid: 8, history_cap: 896, total_cap: 1024, symbolic_cap: 896
15:05:39.135 INF init_groups_from_model:920 | prefill grp: 8, gid: 9, history_cap: 1024, total_cap: 1152, symbolic_cap: 1024
15:05:39.135 INF init_groups_from_model:927 | prefill_max_token_num : 1152
15:05:39.135 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  38 /  38 [6.50s<6.50s, 5.85 count/s] embed_selector init ok
15:05:39.151 INF Init:475 | Gemma4 per-layer helper enabled: vocab=262144 hidden=1536 layers=35 per_layer=256 pad=0
15:05:39.595 INF init_audio_profile:245 | Gemma4 audio profile init ok: path=../../gemma-4-E2B-it/gemma4_audio_5s.axmodel duration=5.0s mel_frames=499 tokens=125 out_dtype=fp32
15:05:39.850 INF init_audio_profile:245 | Gemma4 audio profile init ok: path=../../gemma-4-E2B-it/gemma4_audio_30s.axmodel duration=30.0s mel_frames=2999 tokens=750 out_dtype=fp32
15:05:39.850 INF Init:914 | Gemma4-VL token ids: image_pad=258880 video_pad=258884 audio_pad=258881
15:05:39.850 INF Init:921 | VisionModule init ok: type=Gemma4VL, tokens_per_block=70, embed_size=1536, out_dtype=fp32
15:05:39.850 WRN Init:930 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
15:05:39.852 INF load_config:444 | load config:
15:05:39.852 INF load_config:444 | {
15:05:39.852 INF load_config:444 |     "enable_repetition_penalty": false,
15:05:39.852 INF load_config:444 |     "enable_temperature": false,
15:05:39.852 INF load_config:444 |     "enable_top_k_sampling": false,
15:05:39.852 INF load_config:444 |     "enable_top_p_sampling": false,
15:05:39.852 INF load_config:444 |     "penalty_window": 64,
15:05:39.852 INF load_config:444 |     "repetition_penalty": 1.0,
15:05:39.852 INF load_config:444 |     "temperature": 1.0,
15:05:39.852 INF load_config:444 |     "top_k": 64,
15:05:39.852 INF load_config:444 |     "top_p": 0.95
15:05:39.852 INF load_config:444 | }
15:05:39.852 INF Init:1293 | LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/gemma-4-E2B-it'...
API URLs:
  GET  http://127.0.0.1:8000/health
  GET  http://127.0.0.1:8000/v1/models
  POST http://127.0.0.1:8000/v1/chat/completions
  GET  http://10.168.232.217:8000/health
  GET  http://10.168.232.217:8000/v1/models
  POST http://10.168.232.217:8000/v1/chat/completions
  GET  http://172.17.0.1:8000/health
  GET  http://172.17.0.1:8000/v1/models
  POST http://172.17.0.1:8000/v1/chat/completions
Aliases:
  GET  http://127.0.0.1:8000/models
  POST http://127.0.0.1:8000/chat/completions
  GET  http://10.168.232.217:8000/models
  POST http://10.168.232.217:8000/chat/completions
  GET  http://172.17.0.1:8000/models
  POST http://172.17.0.1:8000/chat/completions
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/gemma-4-E2B-it

You can then send requests to the server using the API endpoints shown in the log. For example, to check the health status and list the available models:

$ curl http://127.0.0.1:8000/health
$ curl http://127.0.0.1:8000/v1/models

# Example output:
root@ax650 ~ # curl http://127.0.0.1:8000/health
{
  "concurrency": 0,
  "max_concurrency": 1,
  "status": "healthy"
}
root@ax650 ~ # curl http://127.0.0.1:8000/v1/models
{
  "data": [
    {
      "created": 1777019000,
      "id": "AXERA-TECH/gemma-4-E2B-it",
      "object": "model",
      "owned_by": "openai-api"
    }
  ],
  "object": "list"
}

Browser UI with `lite_webui`

If you want a browser UI for the OpenAI-compatible service started by axllm serve, use AXERA-TECH/lite_webui.

Set the OpenAI base URL to http://<board-ip>:8000 and the model name to AXERA-TECH/gemma-4-E2B-it.

Python Runtime Requirements

Install the following packages on the AX board:

pyaxengine
transformers>=5.5.0
numpy
ml_dtypes
pillow
torch
gradio for the web demo only

Before running any Python demo command in this package, make sure the Python dependency overlay is visible in PYTHONPATH:

export PYTHONPATH=/path/to/your/gemma4_pydeps:$PYTHONPATH

If your board image ships with an older transformers stack, this pure-Python overlay is the recommended way to supply the required runtime dependencies.

Legacy Python Demo Flow

Enter the package directory on the board:

cd /path/to/your/gemma-4-E2B-it

Text-Only Inference

Run the following command:

python3 infer_axmodel.py \
  --prompt "What is the capital of the United States?" \
  --max_new_tokens 256

A typical output looks like this:

[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 35/35 [00:02<00:00, 12.73it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
answer >> The capital of the United States is **Washington, D.C.**

Multimodal Inference

Use the sample image shown above: assets/sample.png

Recommended profile: 70 soft tokens at 336x480.

python3 infer_axmodel.py \
  --image_path ./assets/sample.png \
  --prompt "Describe this image in detail." \
  --system_prompt "" \
  --max_new_tokens 1024

A typical output looks like this:

[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 35/35 [00:28<00:00,  1.22it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
answer >> This image is a vibrant, cartoon-style illustration of a **red crab**.

Here's a detailed description:

* **Subject:** The central subject is a crab, depicted in a bright, saturated red color.
* **Style:** The illustration is highly stylized and cartoonish, featuring thick outlines and exaggerated features, giving it a playful or energetic feel.
* **Appearance of the Crab:**
    * **Color:** The crab is predominantly bright red.
    * **Body:** It has a segmented body typical of a crab, with visible claws and legs.
    * **Claws (Chelipeds):** The claws are prominent and appear muscular. The crab is shown with its claws raised, suggesting action or excitement.
    * **Eyes/Face:** It has a somewhat expressive face, though simplified.
* **Composition:** The crab is positioned centrally and appears to be moving or posed dynamically.
* **Background:** The background is plain white, which makes the red crab stand out sharply.
* **Outline/Effect:** The illustration has a distinct, thick black outline, and there is a subtle white or light-colored outline effect around the edges, suggesting it might be a sticker, icon, or graphic element.

**Overall Impression:** The image is energetic, bold, and eye-catching, suitable for use as a mascot, icon, or graphic design element.

In addition to the default t70 profile, the package also includes two higher-resolution Vision models:

VIT file	Resolution	Soft tokens
`gemma4_vision_h336_w480_t70.axmodel`	`336x480`	`70`
`gemma4_vision_h480_w672_t140.axmodel`	`480x672`	`140`
`gemma4_vision_h672_w960_t280.axmodel`	`672x960`	`280`

To use a different profile, pass --vit_model_path explicitly. The runtime will infer the matching soft-token count from the filename:

python3 infer_axmodel.py \
  --image_path ./assets/sample.png \
  --prompt "Describe this image in detail." \
  --system_prompt "" \
  --vit_model_path ./gemma4_vision_h480_w672_t140.axmodel \
  --max_new_tokens 256

python3 infer_axmodel.py \
  --image_path ./assets/sample.png \
  --prompt "Describe this image in detail." \
  --system_prompt "" \
  --vit_model_path ./gemma4_vision_h672_w960_t280.axmodel \
  --max_new_tokens 1024

Example output with the 672x960 / t280 profile:

[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 35/35 [00:27<00:00,  1.29it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Model loaded successfully!
[WARN] Image token block (group_id=0, pos 5-284) spans 3 prefill slices. Bidirectional attention within earlier slices is partial (chunked prefill limitation).
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> This is a digital illustration of a cartoonish, anthropomorphic **red lobster**.

Here is a detailed description:

* **Subject:** The central subject is a lobster, depicted in a vibrant, glossy red color.
* **Style:** The illustration is rendered in a bold cartoon style, characterized by thick outlines and bright colors that give it a playful, energetic feel.
* **Expression and Pose:** The lobster has a cheerful, confident expression with a wide, toothy smile. It is posed dynamically, as if flexing or striking a pose, with its claws raised.
    * Its claws (chelipeds) are prominent and muscular, and one of them appears to be flexed.
    * Its body is curved, suggesting motion.
* **Details:** The lobster has visible antennae on its head. The overall design emphasizes its bright red color and gives it a strong, assertive personality.
* **Outline and Background:** The character is outlined in black, which helps define its shape against the plain white background and makes the red lobster stand out prominently.
* **Format:** The image resembles a sticker or clip-art graphic because of its clean, isolated presentation.

In summary, it is a cheerful, stylized, red cartoon lobster flexing its claws.

Audio Inference

The package includes two fixed-duration audio encoders:

gemma4_audio_5s.axmodel for 5s / 125 audio tokens
gemma4_audio_30s.axmodel for 30s / 750 audio tokens

For board-side validation, use the packaged WAV clips listed in the sample-audio section above.

Example: 5s profile

python3 infer_axmodel.py \
  --audio_path ./assets/gemma4_audio_test_5s.wav \
  --audio_model_path ./gemma4_audio_5s.axmodel \
  --audio_duration_sec 5 \
  --audio_tokens 125 \
  --system_prompt "" \
  --prompt "Transcribe the speech in its original language. Output only the transcription." \
  --max_new_tokens 128

Typical output:

answer >> When I was seventeen,I read a quote that went something like if you ...

Example: 30s profile

python3 infer_axmodel.py \
  --audio_path ./assets/gemma4_audio_test_chunk0_30s.wav \
  --audio_model_path ./gemma4_audio_30s.axmodel \
  --audio_duration_sec 30 \
  --audio_tokens 750 \
  --system_prompt "" \
  --prompt "Transcribe the speech in its original language. Output only the transcription." \
  --max_new_tokens 256

Typical output:

answer >> No one wants to die. Even people who want to go to heaven don't want to die. Even people who want to go to heaven don't want to die. No one wants to die. Death is the destination we all share and yet death is the destination we all share and no one has ever escaped it and that is as it should be because death is the single best invention of life death is life change out the old make way for the new right now the new is you but someday not too long from now you^@ will gradually become ...

Notes:

The two commands above were validated on board with a gemma4_pydeps dependency overlay added to PYTHONPATH.
The 30s / 750-token profile spans multiple 128-token prefill slices. The runtime will print a warning about partial bidirectional attention across earlier slices inside the same multimodal block. This is expected for chunked prefill.
The Python demo loader handles WAV directly with the Python standard library. For mp3 / flac / m4a / ogg, install librosa on the board.

Gradio Demo

python3 gradio_demo.py \
  --host 0.0.0.0 \
  --port 7860

After the server starts, open http://<board-ip>:7860 in your browser.

Packaged Python Runtime Paths

The Python demo scripts use the following default paths:

Tokenizer and config: ./gemma_4_e2b_it_tokenizer
Text LLM runtime root: ./
Vision axmodels: ./
Audio axmodels: ./

If you move any of these directories, pass the new values with --hf_model, --axmodel_path, --vit_model_path, and --audio_model_path.

For the Python demo flow, --axmodel_path should point to the directory that contains the text runtime files such as gemma4_text_p128_l*.axmodel, gemma4_text_post.axmodel, model.embed_tokens.weight.bfloat16.bin, and the model.*per_layer*.npy files.

These path arguments apply to the Python demo flow only. The axllm flow reads the same root-level runtime files packaged in this repository.

Conversion References

If you need the original model files or want to rebuild the deployment artifacts, start with:

Original Hugging Face model: google/gemma-4-E2B-it
AXERA conversion and deployment workflow: AXERA-TECH/gemma-4-E2B-it.axera

Discussion

GitHub Issues
QQ group: 139953715

Downloads last month: 112

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/gemma-4-E2B-it

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it