GLM-ASR-Nano-2512 (Core ML)

Core ML conversion of GLM-ASR-Nano-2512 for on-device automatic speech recognition on iOS 17+ (encoder) and iOS 18+ (decoder/prefill). The pipeline runs in three stages: audio encoder + projector, prefill (KV cache), and stateful decoder.

Source: aoiandroid/GLM-ASR-Nano-2512 (2.26B parameters, GlmAsrForConditionalGeneration)
Hub: aoiandroid/glm-asr-nano-2512-coreml

Pipeline

Audio (1, 160000)  [16 kHz mono, up to ~10 s]
  |
  v  Stage A: encoder_with_projector (iOS 17+)
projected_audio (1, 375, 2048)
  |
  v  Stage B1: prefill (iOS 18+, stateful)
KV cache filled
  |
  v  Stage B2: decoder (iOS 18+, stateful)
token-by-token decode until EOS
  |
  v  tokenizer.decode()
Transcription text

Artifact	Description	Size	iOS
`glm_asr_nano_2512_encoder_with_projector.mlpackage`	Stage A: audio_tower + multi_modal_projector	~1,330 MB	17+
`glm_asr_nano_2512_decoder.mlpackage`	Stage B2: stateful decoder (float16 KV cache)	~2,828 MB	18+
`glm_asr_nano_2512_decoder_8bit.mlpackage`	Stage B2: 8-bit palettized decoder	~1,414 MB	18+
`glm_asr_nano_2512_prefill.mlpackage`	Stage B1: prefill (hidden_states -> KV cache)	~2,947 MB	18+
`tokenizer.json`, `tokenizer_config.json`	Tokenizer for decoding	-	-
`processor_config.json`, `config.json`	Processor and model config	-	-
`chat_template.jinja`	Chat template for transcription prompt	-	-

Model specs

Decoder: Llama-based (28 layers, 4 KV heads / 16 Q heads, hidden 2048, head_dim 128)
vocab_size: 59,264
Max context: 1,024 (audio prefix 375 + text tokens)
KV cache: stateful; shared between prefill and decoder on iOS 18+

Usage on iOS / macOS

Load Stage A: glm_asr_nano_2512_encoder_with_projector.mlpackage with Core ML. Input: input_features shape (1, 128, T) (e.g. 16 kHz mono → mel). Output: projected_audio (1, 375, 2048) for 10 s audio.
Build full hidden_states: embed(prefix) + projected_audio + embed(suffix) with total length 390 (prefix 3 + 375 + suffix 12). Use the same prompt as the source model (e.g. "Please transcribe this audio into text").
Load prefill: glm_asr_nano_2512_prefill.mlpackage. Run once with hidden_states to fill the KV cache (stateful).
Load decoder: glm_asr_nano_2512_decoder.mlpackage or glm_asr_nano_2512_decoder_8bit.mlpackage. Use the same state as prefill (shared state). Decode token-by-token from position 390 until EOS.
Decode token ids with the provided tokenizer to get the transcription.

For correct quality, the prefill must receive the full hidden_states (prefix + projected_audio + suffix); using projected_audio only leads to degraded output.

Limitations

Prefill: Requires full hidden_states (text prefix/suffix + projected audio). RoPE positions assume total length 390 for 10 s audio.
Fixed audio length: Stage A is traced for a fixed audio length (e.g. 10 s); pad or trim to match.
Platform: Decoder and prefill use Core ML stateful API; iOS 18+ or macOS 14+ required for the full pipeline. Encoder runs on iOS 17+.
EOS IDs: Resolve from tokenizer/generation_config at runtime; do not hardcode.

License

Apache 2.0 (inherited from the source model).

Citation

@software{glm_asr_nano_2512,
  author = {Zhipu AI},
  title = {GLM-ASR-Nano-2512},
  year = {2025},
  url = {https://github.com/zai-org/GLM-ASR}
}

References

Source model: aoiandroid/GLM-ASR-Nano-2512
GLM-ASR: zai-org/GLM-ASR

Downloads last month: 22

aoiandroid
/

glm-asr-nano-2512-coreml