Llama 3.2 1B QLORA โ Core ML
Core ML conversion of Llama 3.2 1B (QLoRA-style variant). Converted from meta-llama/Llama-3.2-1B-Instruct on macOS with optional 8-bit and 4-bit palettization. Quantization is weight-only; activations stay FP16.
Files
| Path | Description |
|---|---|
fp16/ |
Base model (FP16 compute) |
int8/ |
INT8 linear symmetric (weight-only) |
int4/ |
INT4 k-means palettized (PTQ, weight-only) |
Requirements
- macOS 15+ / Xcode 15+
- coremltools 8.0+
If conversion fails with "No space left on device", set TMPDIR to a path with free space and re-run conversion.
Usage (Instruct-style chat)
These Core ML models use the same chat format as the base Llama-3.2-1B-Instruct. You must use that model's tokenizer and apply_chat_template so prompt and decoding stay aligned.
Download this repo (e.g.
fp16/,int8/, orint4/) and the tokenizer from the base model:huggingface_hub.snapshot_download(repo_id="...", ...)for this Core ML repo.- Load tokenizer:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")(accept the license and log in if required.)
Build the prompt with the standard messages format and chat template (same as the official Instruct model):
messages = [{"role": "user", "content": "Your instruction or question here."}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)Tokenize and pad to the sequence length the model was converted with (e.g. 512). Use the same
pad_token_idas the tokenizer (often 0):enc = tokenizer(prompt, return_tensors="pt", add_special_tokens=False, truncation=True, max_length=seq_len - max_new_tokens) input_ids = enc["input_ids"].numpy().astype(np.int32) # Pad to seq_len with pad_token_id # attention_mask: shape (1, seq_len), 1 for real tokens, 0 for paddingLoad the Core ML model from the folder you downloaded (
fp16,int8, orint4). That folder has the same structure as an.mlpackage(Manifest.json + Data/):import coremltools as ct mlmodel = ct.models.MLModel("/path/to/downloaded_repo/fp16") # or int8, int4Run inference: call
mlmodel.predict({"input_ids": input_ids, "attention_mask": attention_mask})to getlogits. For autoregressive decoding, takeargmaxon the last position oflogits, append the token to the sequence, and repeat with a sliding window of lengthseq_lenuntil EOS ormax_new_tokens. Decode the generated token IDs withtokenizer.decode(..., skip_special_tokens=True).
Important: sequence length and pad_token_id must match what was used at conversion time (typically seq_len=512). Using a different length or tokenizer will produce wrong results.
License
Llama 3.2 Community License. See Meta Llama.
Converted: 2026-03-18
- Downloads last month
- 16
Model tree for aoiandroid/llama32-1b-qlora-coreml
Base model
meta-llama/Llama-3.2-1B-Instruct