Stateful Runtime Notes
This document describes the current stateful runtime contract for the zenz-CoreML export and explains how it differs from the earlier broken stateful revision.
Summary
The current Core ML export keeps a single stateful GPT-2 style model and expects the runtime to use it in two phases:
- Prompt prefill: send the full prompt once.
- Incremental decode: send exactly one token per step while reusing the same Core ML state.
The stateful path stores its cache in three Core ML states:
keyCachevalueCachepastLen
Input Contract
The stateful model takes:
input_idsattention_mask
input_ids is the current token chunk.
attention_mask is the active sequence mask for the current step. For prompt prefill, this is the prompt length. For incremental decode, this should reflect the total active sequence length, not just the length of the single decode token.
That distinction matters. The previous broken stateful export and runtime path treated decode-time attention as length 1, which pushed the model into a malformed cache/position regime and produced garbage tokens after the first step.
What Was Wrong Before
The previous stateful export had three coupled problems.
1. Cache slicing was not trace-safe
The exporter converted pastLen into a Python scalar during tracing. That allowed PyTorch tracing to lock in cache slice behavior instead of preserving a dynamic state-driven cache read path.
In practice this meant the exported Core ML stateful model could stop behaving like a real cached model, even though it still exposed keyCache, valueCache, and pastLen.
2. Decode-time attention shape was too narrow
Incremental decode was effectively driven as if the model only saw the current token, not the current token plus the active cached context length.
This caused output patterns like:
- first generated token looks plausible
- following tokens collapse into invalid byte pieces
- decoded text appears as repeated
�
3. Token semantics were read from stale config metadata
The tokenizer assets identify:
<s></s>[PAD][UNK]
The earlier runtime path could rely on config.json values that did not match the tokenizer package exactly. That made EOS detection and cleanup unreliable.
Current Runtime Guidance
Use the tokenizer assets bundled with the exported model package and treat tokenizer.json as the source of truth for special-token IDs.
For a stateful generation loop:
- Encode the full prompt.
- Run one prefill call with the full prompt and a matching prompt-length attention mask.
- Reuse the same Core ML state.
- For each next step:
- send one token in
input_ids - send an
attention_maskwhose length matches the active sequence length - stop on the tokenizer-defined EOS token
- send one token in
Why The Export Still Uses One Stateful Model
This repo intentionally keeps a single stateful model instead of shipping separate prefill and decode model packages.
Reasons:
- simpler artifact management
- simpler downstream app integration
- fewer opportunities for runtime mismatch between two model packages
The optimization target here is not “more packages”. The target is “correct cached incremental decode on one stateful package”.
Benchmark Interpretation
Any benchmark run collected from the earlier broken stateful export should not be used to judge model quality.
Those runs measured a faulty decode path. They may still contain meaningful latency numbers for app overhead, but not for actual language-generation correctness.
Re-run quality and latency checks only after:
- the updated stateful artifacts are pulled from HF
- downstream compiled caches are invalidated
- the consuming app rebuilds against the new artifacts