Qwen3.5-0.8B LiteRT (Multimodal)
This repository contains a LiteRT (formerly TFLite) conversion of Qwen/Qwen3.5-0.8B for on-device inference, packaged in the LiteRT-LM .litertlm format. Includes the full multimodal pipeline: language model, vision encoder, and vision adapter for image understanding.
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-0.8B |
| Architecture | Hybrid attention (GatedDeltaNet + Full Attention) + ViT vision encoder |
| Parameters | 752M (language) + 675M (vision encoder) + 10M (vision adapter) |
| Quantization | Dynamic INT8 |
| KV Cache Length | 2048 |
| Prefill Signatures | 64, 128, 256, 512 |
| Vision Signatures | 256, 576, 1024, 2304 patches |
| Format | .litertlm (LiteRT-LM container) |
Architecture
Language Model
Qwen3.5-0.8B uses a hybrid attention architecture that combines:
- 18 GatedDeltaNet layers (linear attention with recurrent delta rule) at positions 0-2, 4-6, 8-10, 12-14, 16-18, 20-22
- 6 Full Attention layers (standard multi-head attention with output gating and partial RoPE) at positions 3, 7, 11, 15, 19, 23
Vision Encoder
The vision encoder is a 27-layer Vision Transformer (ViT):
- Patch embedding: Conv3d (3→1152, kernel=[2,16,16]) with learned position embeddings (bilinear interpolation from 48×48 grid)
- 27 VisionBlocks: LayerNorm → Self-Attention (16 heads, head_dim=72, 2D rotary pos emb) → MLP (1152→4304→1152, GELU)
- Patch merger (vision adapter): Groups 4 adjacent patches (spatial_merge_size=2) and projects to language model dimension (4608→1024)
The model was re-authored from scratch using the LiteRT Generative API. The vision encoder and adapter are exported as separate TFLite models bundled alongside the language model.
Files
| File | Size | Description |
|---|---|---|
qwen35_mm_q8_ekv2048.litertlm |
~1.2 GB | LiteRT-LM bundle (LM + vision encoder + vision adapter + tokenizer) |
qwen35_mm_q8_ekv2048.tflite |
~757 MB | Language model TFLite |
qwen35_vision_encoder_q8.tflite |
~88 MB | Vision encoder TFLite |
qwen35_vision_adapter_q8.tflite |
~12 MB | Vision adapter TFLite |
qwen35_embedder_q8.tflite |
~245 MB | Text embedder TFLite |
tokenizer.json |
~11 MB | HuggingFace tokenizer |
tokenizer_config.json |
~2 KB | Tokenizer configuration |
Signatures
Language Model
| Signature | Input Length | Outputs |
|---|---|---|
prefill_64 |
64 tokens | Updated KV cache |
prefill_128 |
128 tokens | Updated KV cache |
prefill_256 |
256 tokens | Updated KV cache |
prefill_512 |
512 tokens | Updated KV cache |
decode |
1 token | Logits + Updated KV cache |
Vision Encoder
| Signature | Patches | Approx. Image Size |
|---|---|---|
encode_256 |
256 | 256×256 |
encode_576 |
576 | 384×384 |
encode_1024 |
1024 | 512×512 |
encode_2304 |
2304 | 768×768 |
Vision Adapter
| Signature | Merged Tokens | From Patches |
|---|---|---|
adapt_64 |
64 | 256 |
adapt_144 |
144 | 576 |
adapt_256 |
256 | 1024 |
adapt_576 |
576 | 2304 |
Usage
Python (ai-edge-litert)
import numpy as np
from ai_edge_litert import interpreter as tfl_interpreter
# Load model
interp = tfl_interpreter.Interpreter(model_path="qwen35_mm_q8_ekv2048.tflite")
interp.allocate_tensors()
# Initialize KV cache (24 layers, mixed shapes)
kv_cache = {} # See inference_tflite.py for full initialization
# Prefill
prefill_runner = interp.get_signature_runner("prefill_64")
tokens = np.array([[...]], dtype=np.int32) # Padded to 64
input_pos = np.arange(64, dtype=np.int32)
output = prefill_runner(tokens=tokens, input_pos=input_pos, **kv_cache)
# Decode loop
decode_runner = interp.get_signature_runner("decode")
for step in range(max_tokens):
output = decode_runner(tokens=next_token, input_pos=pos, **kv_cache)
next_token = np.argmax(output["logits"][0, -1])
Tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("g-ntovas/Qwen3.5-0.8B-LiteRT")
Conversion Details
- Source: Qwen/Qwen3.5-0.8B (multimodal model)
- Method: Custom re-authoring using LiteRT Generative API
- Quantization: Dynamic INT8 (
dynamic_int8) - Export: Per-signature tracing with fixed prefill lengths and patch counts
- Vision: Encoder and adapter exported as separate TFLite models, bundled into
.litertlm
Limitations
- Video input is not yet supported (encoder architecture supports it, but the data processor returns UNIMPLEMENTED for video)
- Prompts are padded to the nearest prefill signature length, which may introduce minor quality differences for the linear attention layers
- The recurrent GatedDeltaNet implementation may produce slightly different outputs compared to the chunk-based HuggingFace implementation due to floating-point operation ordering
License
This model inherits the Apache 2.0 license from the original Qwen/Qwen3.5-0.8B model.
Citation
If you use this model, please cite the original Qwen3.5 paper:
@misc{qwen3.5,
title={Qwen3.5 Technical Report},
author={Qwen Team},
year={2026},
url={https://huggingface.co/Qwen/Qwen3.5-0.8B}
}
- Downloads last month
- 38