GLM-ASR-Nano-2512 (Core ML)
Core ML conversion of GLM-ASR-Nano-2512 for on-device automatic speech recognition on iOS 17+ (encoder) and iOS 18+ (decoder/prefill). The pipeline runs in three stages: audio encoder + projector, prefill (KV cache), and stateful decoder.
- Source: aoiandroid/GLM-ASR-Nano-2512 (2.26B parameters, GlmAsrForConditionalGeneration)
- Hub: aoiandroid/glm-asr-nano-2512-coreml
Pipeline
Audio (1, 160000) [16 kHz mono, up to ~10 s]
|
v Stage A: encoder_with_projector (iOS 17+)
projected_audio (1, 375, 2048)
|
v Stage B1: prefill (iOS 18+, stateful)
KV cache filled
|
v Stage B2: decoder (iOS 18+, stateful)
token-by-token decode until EOS
|
v tokenizer.decode()
Transcription text
Contents
| Artifact | Description | Size | iOS |
|---|---|---|---|
glm_asr_nano_2512_encoder_with_projector.mlpackage |
Stage A: audio_tower + multi_modal_projector | ~1,330 MB | 17+ |
glm_asr_nano_2512_decoder.mlpackage |
Stage B2: stateful decoder (float16 KV cache) | ~2,828 MB | 18+ |
glm_asr_nano_2512_decoder_8bit.mlpackage |
Stage B2: 8-bit palettized decoder | ~1,414 MB | 18+ |
glm_asr_nano_2512_prefill.mlpackage |
Stage B1: prefill (hidden_states -> KV cache) | ~2,947 MB | 18+ |
tokenizer.json, tokenizer_config.json |
Tokenizer for decoding | - | - |
processor_config.json, config.json |
Processor and model config | - | - |
chat_template.jinja |
Chat template for transcription prompt | - | - |
Model specs
- Decoder: Llama-based (28 layers, 4 KV heads / 16 Q heads, hidden 2048, head_dim 128)
- vocab_size: 59,264
- Max context: 1,024 (audio prefix 375 + text tokens)
- KV cache: stateful; shared between prefill and decoder on iOS 18+
Usage on iOS / macOS
- Load Stage A:
glm_asr_nano_2512_encoder_with_projector.mlpackagewith Core ML. Input:input_featuresshape(1, 128, T)(e.g. 16 kHz mono → mel). Output:projected_audio(1, 375, 2048)for 10 s audio. - Build full
hidden_states: embed(prefix) + projected_audio + embed(suffix) with total length 390 (prefix 3 + 375 + suffix 12). Use the same prompt as the source model (e.g. "Please transcribe this audio into text"). - Load prefill:
glm_asr_nano_2512_prefill.mlpackage. Run once withhidden_statesto fill the KV cache (stateful). - Load decoder:
glm_asr_nano_2512_decoder.mlpackageorglm_asr_nano_2512_decoder_8bit.mlpackage. Use the same state as prefill (shared state). Decode token-by-token from position 390 until EOS. - Decode token ids with the provided tokenizer to get the transcription.
For correct quality, the prefill must receive the full hidden_states (prefix + projected_audio + suffix); using projected_audio only leads to degraded output.
Limitations
- Prefill: Requires full hidden_states (text prefix/suffix + projected audio). RoPE positions assume total length 390 for 10 s audio.
- Fixed audio length: Stage A is traced for a fixed audio length (e.g. 10 s); pad or trim to match.
- Platform: Decoder and prefill use Core ML stateful API; iOS 18+ or macOS 14+ required for the full pipeline. Encoder runs on iOS 17+.
- EOS IDs: Resolve from tokenizer/generation_config at runtime; do not hardcode.
License
Apache 2.0 (inherited from the source model).
Citation
@software{glm_asr_nano_2512,
author = {Zhipu AI},
title = {GLM-ASR-Nano-2512},
year = {2025},
url = {https://github.com/zai-org/GLM-ASR}
}
References
- Source model: aoiandroid/GLM-ASR-Nano-2512
- GLM-ASR: zai-org/GLM-ASR
- Downloads last month
- 22