brad-agi
/

glm-ocr-onnx-webgpu

@@ -1,47 +1,72 @@
 # GLM-OCR ONNX (int8) for Browser WebGPU
 Browser-ready ONNX export of [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR) (0.9B params).
 ## Components
 | File | Size | Description |
 |---|---|---|
-| `vision_encoder_int8.onnx` | ~394 MB | CogViT vision encoder (int8 quantized) |
 | `language_model_int8.onnx` | ~471 MB | GLM-0.5B decoder with 3D spatial RoPE (int8) |
 | `text_embeddings.onnx` | ~348 MB | Token embedding layer |
 | `tokenizer.json` | ~7 MB | Tokenizer |
-## Usage with onnxruntime-web
-```javascript
-import * as ort from 'onnxruntime-web/webgpu';
-// Load vision encoder
-const visionSession = await ort.InferenceSession.create('vision_encoder_int8.onnx');
-// Preprocess image to patches: (num_patches, 1176) where 1176 = 3*2*14*14
-const pixelValues = preprocessImage(imageData, 336);
-const gridThw = new ort.Tensor('int64', [1n, 24n, 24n], [1, 3]);
-// Run vision encoder
-const visionOutput = await visionSession.run({
-  pixel_values: pixelValues,
-  grid_thw: gridThw
-});
-```
-## 3D Position IDs (for full spatial quality)
-The language model accepts 3D position_ids with shape `[4, batch, seq_len]`:
 - Channel 0: temporal (0 for images)
 - Channel 1: sequential position
-- Channel 2: row position (from vision grid)
-- Channel 3: column position (from vision grid)
 ## Export Details
-- Quantization: int8 dynamic (onnxruntime quantize_dynamic)
-- Vision encoder: TorchScript exporter, opset 14
-- Language model: Dynamo exporter, opset 18
-- Causal masking: disabled (not needed for autoregressive generation)
-- 3D RoPE: preserved via explicit position_ids input

+---
+base_model: zai-org/GLM-OCR
+library_name: onnxruntime
+tags:
+  - onnx
+  - webgpu
+  - browser
+  - ocr
+  - vision
+  - quantized
+  - int8
+license: apache-2.0
+language:
+  - en
+  - zh
+  - ja
+  - ko
+  - fr
+  - de
+  - es
+  - ru
+pipeline_tag: image-text-to-text
+---
 # GLM-OCR ONNX (int8) for Browser WebGPU
 Browser-ready ONNX export of [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR) (0.9B params).
+Runs entirely client-side via onnxruntime-web with WebGPU. No server needed.
 ## Components
+### Base Models
 | File | Size | Description |
 |---|---|---|
+| `vision_encoder_int8.onnx` | ~394 MB | CogViT vision encoder (int8) |
 | `language_model_int8.onnx` | ~471 MB | GLM-0.5B decoder with 3D spatial RoPE (int8) |
 | `text_embeddings.onnx` | ~348 MB | Token embedding layer |
 | `tokenizer.json` | ~7 MB | Tokenizer |
+### KV Cache Models (fast autoregressive decoding)
+| File | Size | Description |
+|---|---|---|
+| `kv/prefill_int8.onnx` | ~471 MB | Full sequence prefill -> logits + KV cache |
+| `kv/decode_int8.onnx` | ~471 MB | Single token + KV cache -> logits + updated cache |
+## Performance
+| Mode | Speed | 100 tokens |
+|---|---|---|
+| Without KV cache | ~0.3 tok/s | ~5 min |
+| **With KV cache** | **~20 tok/s** | **~7 sec** |
+## 3D Spatial Position IDs
+The language model accepts 3D position_ids `[4, batch, seq_len]` for full spatial awareness:
 - Channel 0: temporal (0 for images)
 - Channel 1: sequential position
+- Channel 2: row position
+- Channel 3: column position
 ## Export Details
+- **Base model**: [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR) (0.9B params)
+- **Quantization**: int8 dynamic (onnxruntime)
+- **Vision encoder**: TorchScript exporter, opset 14
+- **Language model**: Dynamo exporter, opset 18
+- **KV cache**: Packed tensor `[num_layers*2, batch, kv_heads, seq, head_dim]`
+- **3D RoPE**: Preserved via explicit position_ids input
+## License
+Apache 2.0 (same as base model)