Llama 3.2 1B SpinQuant — Core ML

Core ML conversion of Llama 3.2 1B (SpinQuant-style variant). Converted from meta-llama/Llama-3.2-1B-Instruct on macOS with optional 8-bit and 4-bit palettization. Quantization is weight-only; activations stay FP16.

Files

Path	Description
`fp16/`	Base model (FP16 compute)
`int8/`	INT8 linear symmetric (weight-only)
`int4/`	INT4 k-means palettized (PTQ, weight-only)

Requirements

macOS 15+ / Xcode 15+
coremltools 8.0+

If conversion fails with "No space left on device", set TMPDIR to a path with free space and re-run conversion.

Usage (Instruct-style chat)

These Core ML models use the same chat format as the base Llama-3.2-1B-Instruct. You must use that model's tokenizer and apply_chat_template so prompt and decoding stay aligned.

Download this repo (e.g. fp16/, int8/, or int4/) and the tokenizer from the base model:
- huggingface_hub.snapshot_download(repo_id="...", ...) for this Core ML repo.
- Load tokenizer: tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") (accept the license and log in if required.)

Build the prompt with the standard messages format and chat template (same as the official Instruct model):

messages = [{"role": "user", "content": "Your instruction or question here."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

Tokenize and pad to the sequence length the model was converted with (e.g. 512). Use the same pad_token_id as the tokenizer (often 0):

enc = tokenizer(prompt, return_tensors="pt", add_special_tokens=False, truncation=True, max_length=seq_len - max_new_tokens)
input_ids = enc["input_ids"].numpy().astype(np.int32)
# Pad to seq_len with pad_token_id
# attention_mask: shape (1, seq_len), 1 for real tokens, 0 for padding

Load the Core ML model from the folder you downloaded (fp16, int8, or int4). That folder has the same structure as an .mlpackage (Manifest.json + Data/):
```
import coremltools as ct
mlmodel = ct.models.MLModel("/path/to/downloaded_repo/fp16")  # or int8, int4
```
Run inference: call mlmodel.predict({"input_ids": input_ids, "attention_mask": attention_mask}) to get logits. For autoregressive decoding, take argmax on the last position of logits, append the token to the sequence, and repeat with a sliding window of length seq_len until EOS or max_new_tokens. Decode the generated token IDs with tokenizer.decode(..., skip_special_tokens=True).

Important: sequence length and pad_token_id must match what was used at conversion time (typically seq_len=512). Using a different length or tokenizer will produce wrong results.

License

Llama 3.2 Community License. See Meta Llama.

Converted: 2026-03-18

Downloads last month: -

Model tree for aoiandroid/llama32-1b-spinquant-coreml

Base model

meta-llama/Llama-3.2-1B-Instruct

Quantized

(381)

this model

Collection including aoiandroid/llama32-1b-spinquant-coreml

llama

Collection

aoiandroid models matching llama search • 5 items • Updated May 3