Llama 3.2 1B SpinQuant โ€” Core ML

Core ML conversion of Llama 3.2 1B (SpinQuant-style variant). Converted from meta-llama/Llama-3.2-1B-Instruct on macOS with optional 8-bit and 4-bit palettization. Quantization is weight-only; activations stay FP16.

Files

Path Description
fp16/ Base model (FP16 compute)
int8/ INT8 linear symmetric (weight-only)
int4/ INT4 k-means palettized (PTQ, weight-only)

Requirements

  • macOS 15+ / Xcode 15+
  • coremltools 8.0+

If conversion fails with "No space left on device", set TMPDIR to a path with free space and re-run conversion.

Usage (Instruct-style chat)

These Core ML models use the same chat format as the base Llama-3.2-1B-Instruct. You must use that model's tokenizer and apply_chat_template so prompt and decoding stay aligned.

  1. Download this repo (e.g. fp16/, int8/, or int4/) and the tokenizer from the base model:

    • huggingface_hub.snapshot_download(repo_id="...", ...) for this Core ML repo.
    • Load tokenizer: tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") (accept the license and log in if required.)
  2. Build the prompt with the standard messages format and chat template (same as the official Instruct model):

    messages = [{"role": "user", "content": "Your instruction or question here."}]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
  3. Tokenize and pad to the sequence length the model was converted with (e.g. 512). Use the same pad_token_id as the tokenizer (often 0):

    enc = tokenizer(prompt, return_tensors="pt", add_special_tokens=False, truncation=True, max_length=seq_len - max_new_tokens)
    input_ids = enc["input_ids"].numpy().astype(np.int32)
    # Pad to seq_len with pad_token_id
    # attention_mask: shape (1, seq_len), 1 for real tokens, 0 for padding
    
  4. Load the Core ML model from the folder you downloaded (fp16, int8, or int4). That folder has the same structure as an .mlpackage (Manifest.json + Data/):

    import coremltools as ct
    mlmodel = ct.models.MLModel("/path/to/downloaded_repo/fp16")  # or int8, int4
    
  5. Run inference: call mlmodel.predict({"input_ids": input_ids, "attention_mask": attention_mask}) to get logits. For autoregressive decoding, take argmax on the last position of logits, append the token to the sequence, and repeat with a sliding window of length seq_len until EOS or max_new_tokens. Decode the generated token IDs with tokenizer.decode(..., skip_special_tokens=True).

Important: sequence length and pad_token_id must match what was used at conversion time (typically seq_len=512). Using a different length or tokenizer will produce wrong results.

License

Llama 3.2 Community License. See Meta Llama.

Converted: 2026-03-18

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for aoiandroid/llama32-1b-spinquant-coreml

Quantized
(353)
this model