LLaDA-8B-Instruct โ€” Core ML (seq_len=192, argmax)

This repo hosts a Core ML ML Program export of GSAI-ML/LLaDA-8B-Instruct suitable for macOS/visionOS apps. It is a single forward-pass denoiser export; you must run the diffusion sampling loop in your app.

What was done to this model

  • Source model: GSAI-ML/LLaDA-8B-Instruct (HF, trust_remote_code=True).
  • Core ML format: ML Program (.mlpackage).
  • Export script: llada_metal4_kit/python/export_llada_to_mlpackage.py.
  • Sequence length: fixed seq_len=192.
  • Output mode: argmax โ€” returns per-position token_ids + scores (max logits) instead of full logits.
  • Precision: fp16 weights + compute hint.
  • Minimum target: iOS15 (macOS 12 class). This forces fp16 outputs to be cast to fp32 at the tail for compatibility.
  • Torch export: attempted torch.export, fell back to torch.jit.trace due to data-dependent control flow in attention_mask handling.

Inputs / Outputs

Inputs (shapes are fixed):

  • input_ids: int32 [1, 192]
  • attention_mask: int32 [1, 192]

Outputs:

  • token_ids: int32 [1, 192] (argmax at each position)
  • scores: fp32 [1, 192] (max logit per position; fp16 would be used on iOS16+/macOS13+)

Output names may appear as var_XXXX in Core ML depending on conversion; rely on type/shape if needed.

Usage notes

  • This is not autoregressive. You must implement the diffusion/iterative remasking loop in your app.
  • There is no KV-cache; each step runs a full forward pass.
  • Tokenization is not included; use the original HF tokenizer for LLaDA.
  • The package includes ~16 GB of weights. Expect large storage and memory usage.

How to run generation in a diffusion loop

This Core ML export only runs one denoiser step. To generate text, run many denoiser passes and progressively unmask target positions.

Important token IDs

Use the same tokenizer as GSAI-ML/LLaDA-8B-Instruct:

  • BOS: <|startoftext|> = 126080
  • EOS/PAD: <|endoftext|> = 126081
  • Diffusion mask token: <|mdm_mask|> = 126336

For diffusion generation, use <|mdm_mask|> for masked target positions (not [gMASK]).

Required loop structure (seq_len=192)

  1. Tokenize prompt (optionally with chat template).
  2. Build input_ids[1,192]:
    • Prompt tokens at the front.
    • Generation region initialized to <|mdm_mask|>.
    • Remaining tail padded with EOS/PAD.
  3. Build attention_mask[1,192]:
    • 1 for prompt + generation region.
    • 0 for padded tail.
  4. Repeat for steps:
    • Run model once.
    • Read per-position token_ids and scores.
    • For currently masked positions only, commit highest-confidence tokens this step.
    • Keep uncommitted positions masked.
  5. Stop when no masked positions remain or step budget is exhausted.
  6. Decode generated region up to first EOS.

Minimal pseudocode

# ids: [1,192], mask: [1,192]
for step in range(steps):
    pred_ids, conf = model(ids, mask)  # token_ids + scores
    masked_pos = where(ids == MDM_MASK_ID and mask == 1)
    if not masked_pos:
        break

    # choose top-k masked positions by confidence for this step
    commit = topk(masked_pos, key=conf, k=schedule(step))
    ids[commit] = pred_ids[commit]
    # leave the rest as MDM_MASK_ID

Practical defaults

  • max_new_tokens: 32 to 128
  • steps: 24 to 64 (more steps = higher quality, slower)
  • Compute units: .all (or CPU-only for debugging)

App scaffolding

If youโ€™re using this with the included Swift kit:

  • Core ML runner: LLaDACoreMLRunner
  • Sampler loop: LLaDASampler
  • UI: llada_apps (macOS + visionOS)

License / Attribution

This is a converted artifact of GSAI-ML/LLaDA-8B-Instruct. Refer to the original model repo for licensing and usage restrictions.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for oraculumai/llada-8b-coreml-seq192-private

Quantized
(10)
this model