GLM-ASR-Nano-2512 (Core ML)

Core ML conversion of GLM-ASR-Nano-2512 for on-device automatic speech recognition on iOS 17+ (encoder) and iOS 18+ (decoder/prefill). The pipeline runs in three stages: audio encoder + projector, prefill (KV cache), and stateful decoder.

Pipeline

Audio (1, 160000)  [16 kHz mono, up to ~10 s]
  |
  v  Stage A: encoder_with_projector (iOS 17+)
projected_audio (1, 375, 2048)
  |
  v  Stage B1: prefill (iOS 18+, stateful)
KV cache filled
  |
  v  Stage B2: decoder (iOS 18+, stateful)
token-by-token decode until EOS
  |
  v  tokenizer.decode()
Transcription text

Contents

Artifact Description Size iOS
glm_asr_nano_2512_encoder_with_projector.mlpackage Stage A: audio_tower + multi_modal_projector ~1,330 MB 17+
glm_asr_nano_2512_decoder.mlpackage Stage B2: stateful decoder (float16 KV cache) ~2,828 MB 18+
glm_asr_nano_2512_decoder_8bit.mlpackage Stage B2: 8-bit palettized decoder ~1,414 MB 18+
glm_asr_nano_2512_prefill.mlpackage Stage B1: prefill (hidden_states -> KV cache) ~2,947 MB 18+
tokenizer.json, tokenizer_config.json Tokenizer for decoding - -
processor_config.json, config.json Processor and model config - -
chat_template.jinja Chat template for transcription prompt - -

Model specs

  • Decoder: Llama-based (28 layers, 4 KV heads / 16 Q heads, hidden 2048, head_dim 128)
  • vocab_size: 59,264
  • Max context: 1,024 (audio prefix 375 + text tokens)
  • KV cache: stateful; shared between prefill and decoder on iOS 18+

Usage on iOS / macOS

  1. Load Stage A: glm_asr_nano_2512_encoder_with_projector.mlpackage with Core ML. Input: input_features shape (1, 128, T) (e.g. 16 kHz mono → mel). Output: projected_audio (1, 375, 2048) for 10 s audio.
  2. Build full hidden_states: embed(prefix) + projected_audio + embed(suffix) with total length 390 (prefix 3 + 375 + suffix 12). Use the same prompt as the source model (e.g. "Please transcribe this audio into text").
  3. Load prefill: glm_asr_nano_2512_prefill.mlpackage. Run once with hidden_states to fill the KV cache (stateful).
  4. Load decoder: glm_asr_nano_2512_decoder.mlpackage or glm_asr_nano_2512_decoder_8bit.mlpackage. Use the same state as prefill (shared state). Decode token-by-token from position 390 until EOS.
  5. Decode token ids with the provided tokenizer to get the transcription.

For correct quality, the prefill must receive the full hidden_states (prefix + projected_audio + suffix); using projected_audio only leads to degraded output.

Limitations

  • Prefill: Requires full hidden_states (text prefix/suffix + projected audio). RoPE positions assume total length 390 for 10 s audio.
  • Fixed audio length: Stage A is traced for a fixed audio length (e.g. 10 s); pad or trim to match.
  • Platform: Decoder and prefill use Core ML stateful API; iOS 18+ or macOS 14+ required for the full pipeline. Encoder runs on iOS 17+.
  • EOS IDs: Resolve from tokenizer/generation_config at runtime; do not hardcode.

License

Apache 2.0 (inherited from the source model).

Citation

@software{glm_asr_nano_2512,
  author = {Zhipu AI},
  title = {GLM-ASR-Nano-2512},
  year = {2025},
  url = {https://github.com/zai-org/GLM-ASR}
}

References

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support