brad-agi commited on
Commit
5ea0daf
·
verified ·
1 Parent(s): 7c14ef8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +51 -26
README.md CHANGED
@@ -1,47 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # GLM-OCR ONNX (int8) for Browser WebGPU
2
 
3
  Browser-ready ONNX export of [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR) (0.9B params).
 
4
 
5
  ## Components
6
 
 
7
  | File | Size | Description |
8
  |---|---|---|
9
- | `vision_encoder_int8.onnx` | ~394 MB | CogViT vision encoder (int8 quantized) |
10
  | `language_model_int8.onnx` | ~471 MB | GLM-0.5B decoder with 3D spatial RoPE (int8) |
11
  | `text_embeddings.onnx` | ~348 MB | Token embedding layer |
12
  | `tokenizer.json` | ~7 MB | Tokenizer |
13
 
14
- ## Usage with onnxruntime-web
15
-
16
- ```javascript
17
- import * as ort from 'onnxruntime-web/webgpu';
18
-
19
- // Load vision encoder
20
- const visionSession = await ort.InferenceSession.create('vision_encoder_int8.onnx');
21
 
22
- // Preprocess image to patches: (num_patches, 1176) where 1176 = 3*2*14*14
23
- const pixelValues = preprocessImage(imageData, 336);
24
- const gridThw = new ort.Tensor('int64', [1n, 24n, 24n], [1, 3]);
25
 
26
- // Run vision encoder
27
- const visionOutput = await visionSession.run({
28
- pixel_values: pixelValues,
29
- grid_thw: gridThw
30
- });
31
- ```
32
 
33
- ## 3D Position IDs (for full spatial quality)
34
 
35
- The language model accepts 3D position_ids with shape `[4, batch, seq_len]`:
36
  - Channel 0: temporal (0 for images)
37
  - Channel 1: sequential position
38
- - Channel 2: row position (from vision grid)
39
- - Channel 3: column position (from vision grid)
40
 
41
  ## Export Details
42
 
43
- - Quantization: int8 dynamic (onnxruntime quantize_dynamic)
44
- - Vision encoder: TorchScript exporter, opset 14
45
- - Language model: Dynamo exporter, opset 18
46
- - Causal masking: disabled (not needed for autoregressive generation)
47
- - 3D RoPE: preserved via explicit position_ids input
 
 
 
 
 
 
1
+ ---
2
+ base_model: zai-org/GLM-OCR
3
+ library_name: onnxruntime
4
+ tags:
5
+ - onnx
6
+ - webgpu
7
+ - browser
8
+ - ocr
9
+ - vision
10
+ - quantized
11
+ - int8
12
+ license: apache-2.0
13
+ language:
14
+ - en
15
+ - zh
16
+ - ja
17
+ - ko
18
+ - fr
19
+ - de
20
+ - es
21
+ - ru
22
+ pipeline_tag: image-text-to-text
23
+ ---
24
+
25
  # GLM-OCR ONNX (int8) for Browser WebGPU
26
 
27
  Browser-ready ONNX export of [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR) (0.9B params).
28
+ Runs entirely client-side via onnxruntime-web with WebGPU. No server needed.
29
 
30
  ## Components
31
 
32
+ ### Base Models
33
  | File | Size | Description |
34
  |---|---|---|
35
+ | `vision_encoder_int8.onnx` | ~394 MB | CogViT vision encoder (int8) |
36
  | `language_model_int8.onnx` | ~471 MB | GLM-0.5B decoder with 3D spatial RoPE (int8) |
37
  | `text_embeddings.onnx` | ~348 MB | Token embedding layer |
38
  | `tokenizer.json` | ~7 MB | Tokenizer |
39
 
40
+ ### KV Cache Models (fast autoregressive decoding)
41
+ | File | Size | Description |
42
+ |---|---|---|
43
+ | `kv/prefill_int8.onnx` | ~471 MB | Full sequence prefill -> logits + KV cache |
44
+ | `kv/decode_int8.onnx` | ~471 MB | Single token + KV cache -> logits + updated cache |
 
 
45
 
46
+ ## Performance
 
 
47
 
48
+ | Mode | Speed | 100 tokens |
49
+ |---|---|---|
50
+ | Without KV cache | ~0.3 tok/s | ~5 min |
51
+ | **With KV cache** | **~20 tok/s** | **~7 sec** |
 
 
52
 
53
+ ## 3D Spatial Position IDs
54
 
55
+ The language model accepts 3D position_ids `[4, batch, seq_len]` for full spatial awareness:
56
  - Channel 0: temporal (0 for images)
57
  - Channel 1: sequential position
58
+ - Channel 2: row position
59
+ - Channel 3: column position
60
 
61
  ## Export Details
62
 
63
+ - **Base model**: [zai-org/GLM-OCR](https://huggingface.co/zai-org/GLM-OCR) (0.9B params)
64
+ - **Quantization**: int8 dynamic (onnxruntime)
65
+ - **Vision encoder**: TorchScript exporter, opset 14
66
+ - **Language model**: Dynamo exporter, opset 18
67
+ - **KV cache**: Packed tensor `[num_layers*2, batch, kv_heads, seq, head_dim]`
68
+ - **3D RoPE**: Preserved via explicit position_ids input
69
+
70
+ ## License
71
+
72
+ Apache 2.0 (same as base model)