Qwen3.5-0.8B LiteRT (Multimodal)

This repository contains a LiteRT (formerly TFLite) conversion of Qwen/Qwen3.5-0.8B for on-device inference, packaged in the LiteRT-LM .litertlm format. Includes the full multimodal pipeline: language model, vision encoder, and vision adapter for image understanding.

Model Details

Property	Value
Base Model	Qwen/Qwen3.5-0.8B
Architecture	Hybrid attention (GatedDeltaNet + Full Attention) + ViT vision encoder
Parameters	752M (language) + 675M (vision encoder) + 10M (vision adapter)
Quantization	Dynamic INT8
KV Cache Length	2048
Prefill Signatures	64, 128, 256, 512
Vision Signatures	256, 576, 1024, 2304 patches
Format	`.litertlm` (LiteRT-LM container)

Architecture

Language Model

Qwen3.5-0.8B uses a hybrid attention architecture that combines:

18 GatedDeltaNet layers (linear attention with recurrent delta rule) at positions 0-2, 4-6, 8-10, 12-14, 16-18, 20-22
6 Full Attention layers (standard multi-head attention with output gating and partial RoPE) at positions 3, 7, 11, 15, 19, 23

Vision Encoder

The vision encoder is a 27-layer Vision Transformer (ViT):

Patch embedding: Conv3d (3→1152, kernel=[2,16,16]) with learned position embeddings (bilinear interpolation from 48×48 grid)
27 VisionBlocks: LayerNorm → Self-Attention (16 heads, head_dim=72, 2D rotary pos emb) → MLP (1152→4304→1152, GELU)
Patch merger (vision adapter): Groups 4 adjacent patches (spatial_merge_size=2) and projects to language model dimension (4608→1024)

The model was re-authored from scratch using the LiteRT Generative API. The vision encoder and adapter are exported as separate TFLite models bundled alongside the language model.

Files

File	Size	Description
`qwen35_mm_q8_ekv2048.litertlm`	~1.2 GB	LiteRT-LM bundle (LM + vision encoder + vision adapter + tokenizer)
`qwen35_mm_q8_ekv2048.tflite`	~757 MB	Language model TFLite
`qwen35_vision_encoder_q8.tflite`	~88 MB	Vision encoder TFLite
`qwen35_vision_adapter_q8.tflite`	~12 MB	Vision adapter TFLite
`qwen35_embedder_q8.tflite`	~245 MB	Text embedder TFLite
`tokenizer.json`	~11 MB	HuggingFace tokenizer
`tokenizer_config.json`	~2 KB	Tokenizer configuration

Signatures

Language Model

Signature	Input Length	Outputs
`prefill_64`	64 tokens	Updated KV cache
`prefill_128`	128 tokens	Updated KV cache
`prefill_256`	256 tokens	Updated KV cache
`prefill_512`	512 tokens	Updated KV cache
`decode`	1 token	Logits + Updated KV cache

Vision Encoder

Signature	Patches	Approx. Image Size
`encode_256`	256	256×256
`encode_576`	576	384×384
`encode_1024`	1024	512×512
`encode_2304`	2304	768×768

Vision Adapter

Signature	Merged Tokens	From Patches
`adapt_64`	64	256
`adapt_144`	144	576
`adapt_256`	256	1024
`adapt_576`	576	2304

Usage

Python (ai-edge-litert)

import numpy as np
from ai_edge_litert import interpreter as tfl_interpreter

# Load model
interp = tfl_interpreter.Interpreter(model_path="qwen35_mm_q8_ekv2048.tflite")
interp.allocate_tensors()

# Initialize KV cache (24 layers, mixed shapes)
kv_cache = {}  # See inference_tflite.py for full initialization

# Prefill
prefill_runner = interp.get_signature_runner("prefill_64")
tokens = np.array([[...]], dtype=np.int32)  # Padded to 64
input_pos = np.arange(64, dtype=np.int32)
output = prefill_runner(tokens=tokens, input_pos=input_pos, **kv_cache)

# Decode loop
decode_runner = interp.get_signature_runner("decode")
for step in range(max_tokens):
    output = decode_runner(tokens=next_token, input_pos=pos, **kv_cache)
    next_token = np.argmax(output["logits"][0, -1])

Tokenizer

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("g-ntovas/Qwen3.5-0.8B-LiteRT")

Conversion Details

Source: Qwen/Qwen3.5-0.8B (multimodal model)
Method: Custom re-authoring using LiteRT Generative API
Quantization: Dynamic INT8 (dynamic_int8)
Export: Per-signature tracing with fixed prefill lengths and patch counts
Vision: Encoder and adapter exported as separate TFLite models, bundled into .litertlm

Limitations

Video input is not yet supported (encoder architecture supports it, but the data processor returns UNIMPLEMENTED for video)
Prompts are padded to the nearest prefill signature length, which may introduce minor quality differences for the linear attention layers
The recurrent GatedDeltaNet implementation may produce slightly different outputs compared to the chunk-based HuggingFace implementation due to floating-point operation ordering

License

This model inherits the Apache 2.0 license from the original Qwen/Qwen3.5-0.8B model.

Citation

If you use this model, please cite the original Qwen3.5 paper:

@misc{qwen3.5,
  title={Qwen3.5 Technical Report},
  author={Qwen Team},
  year={2026},
  url={https://huggingface.co/Qwen/Qwen3.5-0.8B}
}

Downloads last month: 355

Model tree for GabrieleConte/Qwen3.5-0.8B-LiteRT

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

(209)

this model