LLaDA-8B-Instruct — Core ML (seq_len=192, argmax)

This repo hosts a Core ML ML Program export of GSAI-ML/LLaDA-8B-Instruct suitable for macOS/visionOS apps. It is a single forward-pass denoiser export; you must run the diffusion sampling loop in your app.

What was done to this model

Source model: GSAI-ML/LLaDA-8B-Instruct (HF, trust_remote_code=True).
Core ML format: ML Program (.mlpackage).
Export script: llada_metal4_kit/python/export_llada_to_mlpackage.py.
Sequence length: fixed seq_len=192.
Output mode: argmax — returns per-position token_ids + scores (max logits) instead of full logits.
Precision: fp16 weights + compute hint.
Minimum target: iOS15 (macOS 12 class). This forces fp16 outputs to be cast to fp32 at the tail for compatibility.
Torch export: attempted torch.export, fell back to torch.jit.trace due to data-dependent control flow in attention_mask handling.

Inputs / Outputs

Inputs (shapes are fixed):

input_ids: int32 [1, 192]
attention_mask: int32 [1, 192]

Outputs:

token_ids: int32 [1, 192] (argmax at each position)
scores: fp32 [1, 192] (max logit per position; fp16 would be used on iOS16+/macOS13+)

Output names may appear as var_XXXX in Core ML depending on conversion; rely on type/shape if needed.

Usage notes

This is not autoregressive. You must implement the diffusion/iterative remasking loop in your app.
There is no KV-cache; each step runs a full forward pass.
Tokenization is not included; use the original HF tokenizer for LLaDA.
The package includes ~16 GB of weights. Expect large storage and memory usage.

How to run generation in a diffusion loop

This Core ML export only runs one denoiser step. To generate text, run many denoiser passes and progressively unmask target positions.

Important token IDs

Use the same tokenizer as GSAI-ML/LLaDA-8B-Instruct:

BOS: <|startoftext|> = 126080
EOS/PAD: <|endoftext|> = 126081
Diffusion mask token: <|mdm_mask|> = 126336

For diffusion generation, use <|mdm_mask|> for masked target positions (not [gMASK]).

Required loop structure (seq_len=192)

Tokenize prompt (optionally with chat template).
Build input_ids[1,192]:
- Prompt tokens at the front.
- Generation region initialized to <|mdm_mask|>.
- Remaining tail padded with EOS/PAD.
Build attention_mask[1,192]:
- 1 for prompt + generation region.
- 0 for padded tail.
Repeat for steps:
- Run model once.
- Read per-position token_ids and scores.
- For currently masked positions only, commit highest-confidence tokens this step.
- Keep uncommitted positions masked.
Stop when no masked positions remain or step budget is exhausted.
Decode generated region up to first EOS.

Minimal pseudocode

# ids: [1,192], mask: [1,192]
for step in range(steps):
    pred_ids, conf = model(ids, mask)  # token_ids + scores
    masked_pos = where(ids == MDM_MASK_ID and mask == 1)
    if not masked_pos:
        break

    # choose top-k masked positions by confidence for this step
    commit = topk(masked_pos, key=conf, k=schedule(step))
    ids[commit] = pred_ids[commit]
    # leave the rest as MDM_MASK_ID

Practical defaults

max_new_tokens: 32 to 128
steps: 24 to 64 (more steps = higher quality, slower)
Compute units: .all (or CPU-only for debugging)

App scaffolding

If you’re using this with the included Swift kit:

Core ML runner: LLaDACoreMLRunner
Sampler loop: LLaDASampler
UI: llada_apps (macOS + visionOS)

License / Attribution

This is a converted artifact of GSAI-ML/LLaDA-8B-Instruct. Refer to the original model repo for licensing and usage restrictions.

Downloads last month: 2

Model tree for oraculumai/llada-8b-coreml-seq192-private

Base model

GSAI-ML/LLaDA-8B-Instruct

Quantized

(11)

this model