GLM-OCR ONNX (int8) for Browser WebGPU

Browser-ready ONNX export of zai-org/GLM-OCR (0.9B params). Runs entirely client-side via onnxruntime-web with WebGPU. No server needed.

Components

Base Models

File Size Description
vision_encoder_int8.onnx ~394 MB CogViT vision encoder (int8)
language_model_int8.onnx ~471 MB GLM-0.5B decoder with 3D spatial RoPE (int8)
text_embeddings.onnx ~348 MB Token embedding layer
tokenizer.json ~7 MB Tokenizer

KV Cache Models (fast autoregressive decoding)

File Size Description
kv/prefill_int8.onnx ~471 MB Full sequence prefill -> logits + KV cache
kv/decode_int8.onnx ~471 MB Single token + KV cache -> logits + updated cache

Performance

Mode Speed 100 tokens
Without KV cache ~0.3 tok/s ~5 min
With KV cache ~20 tok/s ~7 sec

3D Spatial Position IDs

The language model accepts 3D position_ids [4, batch, seq_len] for full spatial awareness:

  • Channel 0: temporal (0 for images)
  • Channel 1: sequential position
  • Channel 2: row position
  • Channel 3: column position

Export Details

  • Base model: zai-org/GLM-OCR (0.9B params)
  • Quantization: int8 dynamic (onnxruntime)
  • Vision encoder: TorchScript exporter, opset 14
  • Language model: Dynamo exporter, opset 18
  • KV cache: Packed tensor [num_layers*2, batch, kv_heads, seq, head_dim]
  • 3D RoPE: Preserved via explicit position_ids input

License

Apache 2.0 (same as base model)

Downloads last month
47
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brad-agi/glm-ocr-onnx-webgpu

Base model

zai-org/GLM-OCR
Quantized
(13)
this model