Stateful Runtime Notes

This document describes the current stateful runtime contract for the zenz-CoreML export and explains how it differs from the earlier broken stateful revision.

Summary

The current Core ML export keeps a single stateful GPT-2 style model and expects the runtime to use it in two phases:

Prompt prefill: send the full prompt once.
Incremental decode: send exactly one token per step while reusing the same Core ML state.

The stateful path stores its cache in three Core ML states:

keyCache
valueCache
pastLen

Input Contract

The stateful model takes:

input_ids
attention_mask

input_ids is the current token chunk.

attention_mask is the active sequence mask for the current step. For prompt prefill, this is the prompt length. For incremental decode, this should reflect the total active sequence length, not just the length of the single decode token.

That distinction matters. The previous broken stateful export and runtime path treated decode-time attention as length 1, which pushed the model into a malformed cache/position regime and produced garbage tokens after the first step.

What Was Wrong Before

The previous stateful export had three coupled problems.

1. Cache slicing was not trace-safe

The exporter converted pastLen into a Python scalar during tracing. That allowed PyTorch tracing to lock in cache slice behavior instead of preserving a dynamic state-driven cache read path.

In practice this meant the exported Core ML stateful model could stop behaving like a real cached model, even though it still exposed keyCache, valueCache, and pastLen.

2. Decode-time attention shape was too narrow

Incremental decode was effectively driven as if the model only saw the current token, not the current token plus the active cached context length.

This caused output patterns like:

first generated token looks plausible
following tokens collapse into invalid byte pieces
decoded text appears as repeated �

3. Token semantics were read from stale config metadata

The tokenizer assets identify:

<s>
</s>
[PAD]
[UNK]

The earlier runtime path could rely on config.json values that did not match the tokenizer package exactly. That made EOS detection and cleanup unreliable.

Current Runtime Guidance

Use the tokenizer assets bundled with the exported model package and treat tokenizer.json as the source of truth for special-token IDs.

For a stateful generation loop:

Encode the full prompt.
Run one prefill call with the full prompt and a matching prompt-length attention mask.
Reuse the same Core ML state.
For each next step:
- send one token in input_ids
- send an attention_mask whose length matches the active sequence length
- stop on the tokenizer-defined EOS token

Why The Export Still Uses One Stateful Model

This repo intentionally keeps a single stateful model instead of shipping separate prefill and decode model packages.

Reasons:

simpler artifact management
simpler downstream app integration
fewer opportunities for runtime mismatch between two model packages

The optimization target here is not “more packages”. The target is “correct cached incremental decode on one stateful package”.

Benchmark Interpretation

Any benchmark run collected from the earlier broken stateful export should not be used to judge model quality.

Those runs measured a faulty decode path. They may still contain meaningful latency numbers for app overhead, but not for actual language-generation correctness.

Re-run quality and latency checks only after:

the updated stateful artifacts are pulled from HF
downstream compiled caches are invalidated
the consuming app rebuilds against the new artifacts