Qwen3.5-0.8B LiteRT (Multimodal)

This repository contains a LiteRT (formerly TFLite) conversion of Qwen/Qwen3.5-0.8B for on-device inference, packaged in the LiteRT-LM .litertlm format. Includes the full multimodal pipeline: language model, vision encoder, and vision adapter for image understanding.

Model Details

Property Value
Base Model Qwen/Qwen3.5-0.8B
Architecture Hybrid attention (GatedDeltaNet + Full Attention) + ViT vision encoder
Parameters 752M (language) + 675M (vision encoder) + 10M (vision adapter)
Quantization Dynamic INT8
KV Cache Length 2048
Prefill Signatures 64, 128, 256, 512
Vision Signatures 256, 576, 1024, 2304 patches
Format .litertlm (LiteRT-LM container)

Architecture

Language Model

Qwen3.5-0.8B uses a hybrid attention architecture that combines:

  • 18 GatedDeltaNet layers (linear attention with recurrent delta rule) at positions 0-2, 4-6, 8-10, 12-14, 16-18, 20-22
  • 6 Full Attention layers (standard multi-head attention with output gating and partial RoPE) at positions 3, 7, 11, 15, 19, 23

Vision Encoder

The vision encoder is a 27-layer Vision Transformer (ViT):

  • Patch embedding: Conv3d (3→1152, kernel=[2,16,16]) with learned position embeddings (bilinear interpolation from 48×48 grid)
  • 27 VisionBlocks: LayerNorm → Self-Attention (16 heads, head_dim=72, 2D rotary pos emb) → MLP (1152→4304→1152, GELU)
  • Patch merger (vision adapter): Groups 4 adjacent patches (spatial_merge_size=2) and projects to language model dimension (4608→1024)

The model was re-authored from scratch using the LiteRT Generative API. The vision encoder and adapter are exported as separate TFLite models bundled alongside the language model.

Files

File Size Description
qwen35_mm_q8_ekv2048.litertlm ~1.2 GB LiteRT-LM bundle (LM + vision encoder + vision adapter + tokenizer)
qwen35_mm_q8_ekv2048.tflite ~757 MB Language model TFLite
qwen35_vision_encoder_q8.tflite ~88 MB Vision encoder TFLite
qwen35_vision_adapter_q8.tflite ~12 MB Vision adapter TFLite
qwen35_embedder_q8.tflite ~245 MB Text embedder TFLite
tokenizer.json ~11 MB HuggingFace tokenizer
tokenizer_config.json ~2 KB Tokenizer configuration

Signatures

Language Model

Signature Input Length Outputs
prefill_64 64 tokens Updated KV cache
prefill_128 128 tokens Updated KV cache
prefill_256 256 tokens Updated KV cache
prefill_512 512 tokens Updated KV cache
decode 1 token Logits + Updated KV cache

Vision Encoder

Signature Patches Approx. Image Size
encode_256 256 256×256
encode_576 576 384×384
encode_1024 1024 512×512
encode_2304 2304 768×768

Vision Adapter

Signature Merged Tokens From Patches
adapt_64 64 256
adapt_144 144 576
adapt_256 256 1024
adapt_576 576 2304

Usage

Python (ai-edge-litert)

import numpy as np
from ai_edge_litert import interpreter as tfl_interpreter

# Load model
interp = tfl_interpreter.Interpreter(model_path="qwen35_mm_q8_ekv2048.tflite")
interp.allocate_tensors()

# Initialize KV cache (24 layers, mixed shapes)
kv_cache = {}  # See inference_tflite.py for full initialization

# Prefill
prefill_runner = interp.get_signature_runner("prefill_64")
tokens = np.array([[...]], dtype=np.int32)  # Padded to 64
input_pos = np.arange(64, dtype=np.int32)
output = prefill_runner(tokens=tokens, input_pos=input_pos, **kv_cache)

# Decode loop
decode_runner = interp.get_signature_runner("decode")
for step in range(max_tokens):
    output = decode_runner(tokens=next_token, input_pos=pos, **kv_cache)
    next_token = np.argmax(output["logits"][0, -1])

Tokenizer

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("g-ntovas/Qwen3.5-0.8B-LiteRT")

Conversion Details

  • Source: Qwen/Qwen3.5-0.8B (multimodal model)
  • Method: Custom re-authoring using LiteRT Generative API
  • Quantization: Dynamic INT8 (dynamic_int8)
  • Export: Per-signature tracing with fixed prefill lengths and patch counts
  • Vision: Encoder and adapter exported as separate TFLite models, bundled into .litertlm

Limitations

  • Video input is not yet supported (encoder architecture supports it, but the data processor returns UNIMPLEMENTED for video)
  • Prompts are padded to the nearest prefill signature length, which may introduce minor quality differences for the linear attention layers
  • The recurrent GatedDeltaNet implementation may produce slightly different outputs compared to the chunk-based HuggingFace implementation due to floating-point operation ordering

License

This model inherits the Apache 2.0 license from the original Qwen/Qwen3.5-0.8B model.

Citation

If you use this model, please cite the original Qwen3.5 paper:

@misc{qwen3.5,
  title={Qwen3.5 Technical Report},
  author={Qwen Team},
  year={2026},
  url={https://huggingface.co/Qwen/Qwen3.5-0.8B}
}
Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GabrieleConte/Qwen3.5-0.8B-LiteRT

Finetuned
Qwen/Qwen3.5-0.8B
Finetuned
(36)
this model