codewithdark
/

Llama-3.2-3B-4bit-mlx

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions

Llama-3.2-3B-4bit-mlx / README.md

codewithdark's picture

Update README.md

3fedeac verified 26 days ago

|

history blame contribute delete

2.57 kB

	---
	license: apache-2.0
	base_model: meta-llama/Llama-3.2-3B
	library_name: mlx
	language:
	- en
	tags:
	- quantllm
	- mlx
	- mlx-lm
	- apple-silicon
	- 4bit
	- transformers
	---

	# Llama-3.2-3B-4bit-mlx
	![Format](https://img.shields.io/badge/format-MLX-orange) ![Quantization](https://img.shields.io/badge/quantization-4bit-blue) ![QuantLLM](https://img.shields.io/badge/made%20with-QuantLLM-green)


	## Description

	This is meta-llama/Llama-3.2-3B converted to MLX format optimized for Apple Silicon (M1/M2/M3) Macs.

	- Base Model: [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)
	- Format: MLX
	- Quantization: 4bit
	- Created with: [QuantLLM](https://github.com/codewithdark-git/QuantLLM)


	## Usage

	### Generate text with mlx-lm

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("codewithdark/Llama-3.2-3B-4bit-mlx")

	prompt = "Write a story about Einstein"
	messages = [{"role": "user", "content": prompt}]
	prompt = tokenizer.apply_chat_template(
	messages, add_generation_prompt=True
	)

	text = generate(model, tokenizer, prompt=prompt, verbose=True)
	```

	### With streaming

	```python
	from mlx_lm import load, stream_generate

	model, tokenizer = load("codewithdark/Llama-3.2-3B-4bit-mlx")

	prompt = "Explain quantum computing"
	messages = [{"role": "user", "content": prompt}]
	prompt = tokenizer.apply_chat_template(
	messages, add_generation_prompt=True
	)

	for token in stream_generate(model, tokenizer, prompt=prompt, max_tokens=500):
	print(token, end="", flush=True)
	```

	### Command Line

	```bash
	# Install mlx-lm
	pip install mlx-lm

	# Generate text
	python -m mlx_lm.generate --model codewithdark/Llama-3.2-3B-4bit-mlx --prompt "Hello!"

	# Chat mode
	python -m mlx_lm.chat --model codewithdark/Llama-3.2-3B-4bit-mlx
	```

	## Requirements

	- Apple Silicon Mac (M1/M2/M3/M4)
	- macOS 13.0 or later
	- Python 3.10+
	- mlx-lm: `pip install mlx-lm`


	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) \|
	\| Format \| MLX \|
	\| Quantization \| 4bit \|
	\| License \| apache-2.0 \|
	\| Created \| 2025-12-19 \|



	---

	## About QuantLLM

	This model was converted using [QuantLLM](https://github.com/codewithdark-git/QuantLLM) -
	the ultra-fast LLM quantization and export library.

	```python
	from quantllm import turbo

	# Load and quantize any model
	model = turbo("meta-llama/Llama-3.2-3B")

	# Export to any format
	model.export("mlx", quantization="4bit")
	```

	⭐ Star us on [GitHub](https://github.com/codewithdark-git/QuantLLM)!