LLaDA-8B-Instruct โ Core ML (seq_len=192, argmax)
This repo hosts a Core ML ML Program export of GSAI-ML/LLaDA-8B-Instruct suitable for macOS/visionOS apps. It is a single forward-pass denoiser export; you must run the diffusion sampling loop in your app.
What was done to this model
- Source model:
GSAI-ML/LLaDA-8B-Instruct(HF,trust_remote_code=True). - Core ML format: ML Program (
.mlpackage). - Export script:
llada_metal4_kit/python/export_llada_to_mlpackage.py. - Sequence length: fixed
seq_len=192. - Output mode:
argmaxโ returns per-positiontoken_ids+scores(max logits) instead of full logits. - Precision: fp16 weights + compute hint.
- Minimum target:
iOS15(macOS 12 class). This forces fp16 outputs to be cast to fp32 at the tail for compatibility. - Torch export: attempted
torch.export, fell back totorch.jit.tracedue to data-dependent control flow inattention_maskhandling.
Inputs / Outputs
Inputs (shapes are fixed):
input_ids: int32[1, 192]attention_mask: int32[1, 192]
Outputs:
token_ids: int32[1, 192](argmax at each position)scores: fp32[1, 192](max logit per position; fp16 would be used on iOS16+/macOS13+)
Output names may appear as
var_XXXXin Core ML depending on conversion; rely on type/shape if needed.
Usage notes
- This is not autoregressive. You must implement the diffusion/iterative remasking loop in your app.
- There is no KV-cache; each step runs a full forward pass.
- Tokenization is not included; use the original HF tokenizer for LLaDA.
- The package includes ~16 GB of weights. Expect large storage and memory usage.
How to run generation in a diffusion loop
This Core ML export only runs one denoiser step. To generate text, run many denoiser passes and progressively unmask target positions.
Important token IDs
Use the same tokenizer as GSAI-ML/LLaDA-8B-Instruct:
- BOS:
<|startoftext|>=126080 - EOS/PAD:
<|endoftext|>=126081 - Diffusion mask token:
<|mdm_mask|>=126336
For diffusion generation, use
<|mdm_mask|>for masked target positions (not[gMASK]).
Required loop structure (seq_len=192)
- Tokenize prompt (optionally with chat template).
- Build
input_ids[1,192]:- Prompt tokens at the front.
- Generation region initialized to
<|mdm_mask|>. - Remaining tail padded with EOS/PAD.
- Build
attention_mask[1,192]:1for prompt + generation region.0for padded tail.
- Repeat for
steps:- Run model once.
- Read per-position
token_idsandscores. - For currently masked positions only, commit highest-confidence tokens this step.
- Keep uncommitted positions masked.
- Stop when no masked positions remain or step budget is exhausted.
- Decode generated region up to first EOS.
Minimal pseudocode
# ids: [1,192], mask: [1,192]
for step in range(steps):
pred_ids, conf = model(ids, mask) # token_ids + scores
masked_pos = where(ids == MDM_MASK_ID and mask == 1)
if not masked_pos:
break
# choose top-k masked positions by confidence for this step
commit = topk(masked_pos, key=conf, k=schedule(step))
ids[commit] = pred_ids[commit]
# leave the rest as MDM_MASK_ID
Practical defaults
max_new_tokens:32to128steps:24to64(more steps = higher quality, slower)- Compute units:
.all(or CPU-only for debugging)
App scaffolding
If youโre using this with the included Swift kit:
- Core ML runner:
LLaDACoreMLRunner - Sampler loop:
LLaDASampler - UI:
llada_apps(macOS + visionOS)
License / Attribution
This is a converted artifact of GSAI-ML/LLaDA-8B-Instruct. Refer to the original model repo for licensing and usage restrictions.
- Downloads last month
- 5
Model tree for oraculumai/llada-8b-coreml-seq192-private
Base model
GSAI-ML/LLaDA-8B-Instruct