zenz-coreml / docs /stateful-runtime-notes.md

Initialize model repository contents

df17805 verified 11 days ago

3.81 kB

	# Stateful Runtime Notes

	This document describes the current stateful runtime contract for the `zenz-CoreML` export and explains how it differs from the earlier broken stateful revision.

	## Summary

	The current Core ML export keeps a single stateful GPT-2 style model and expects the runtime to use it in two phases:

	1. Prompt prefill: send the full prompt once.
	2. Incremental decode: send exactly one token per step while reusing the same Core ML state.

	The stateful path stores its cache in three Core ML states:

	- `keyCache`
	- `valueCache`
	- `pastLen`

	## Input Contract

	The stateful model takes:

	- `input_ids`
	- `attention_mask`

	`input_ids` is the current token chunk.

	`attention_mask` is the active sequence mask for the current step. For prompt prefill, this is the prompt length. For incremental decode, this should reflect the total active sequence length, not just the length of the single decode token.

	That distinction matters. The previous broken stateful export and runtime path treated decode-time attention as length `1`, which pushed the model into a malformed cache/position regime and produced garbage tokens after the first step.

	## What Was Wrong Before

	The previous stateful export had three coupled problems.

	### 1. Cache slicing was not trace-safe

	The exporter converted `pastLen` into a Python scalar during tracing. That allowed PyTorch tracing to lock in cache slice behavior instead of preserving a dynamic state-driven cache read path.

	In practice this meant the exported Core ML stateful model could stop behaving like a real cached model, even though it still exposed `keyCache`, `valueCache`, and `pastLen`.

	### 2. Decode-time attention shape was too narrow

	Incremental decode was effectively driven as if the model only saw the current token, not the current token plus the active cached context length.

	This caused output patterns like:

	- first generated token looks plausible
	- following tokens collapse into invalid byte pieces
	- decoded text appears as repeated `�`

	### 3. Token semantics were read from stale config metadata

	The tokenizer assets identify:

	- `<s>`
	- `</s>`
	- `[PAD]`
	- `[UNK]`

	The earlier runtime path could rely on `config.json` values that did not match the tokenizer package exactly. That made EOS detection and cleanup unreliable.

	## Current Runtime Guidance

	Use the tokenizer assets bundled with the exported model package and treat `tokenizer.json` as the source of truth for special-token IDs.

	For a stateful generation loop:

	1. Encode the full prompt.
	2. Run one prefill call with the full prompt and a matching prompt-length attention mask.
	3. Reuse the same Core ML state.
	4. For each next step:
	- send one token in `input_ids`
	- send an `attention_mask` whose length matches the active sequence length
	- stop on the tokenizer-defined EOS token

	## Why The Export Still Uses One Stateful Model

	This repo intentionally keeps a single stateful model instead of shipping separate `prefill` and `decode` model packages.

	Reasons:

	- simpler artifact management
	- simpler downstream app integration
	- fewer opportunities for runtime mismatch between two model packages

	The optimization target here is not “more packages”. The target is “correct cached incremental decode on one stateful package”.

	## Benchmark Interpretation

	Any benchmark run collected from the earlier broken stateful export should not be used to judge model quality.

	Those runs measured a faulty decode path. They may still contain meaningful latency numbers for app overhead, but not for actual language-generation correctness.

	Re-run quality and latency checks only after:

	- the updated stateful artifacts are pulled from HF
	- downstream compiled caches are invalidated
	- the consuming app rebuilds against the new artifacts