mlx-community
/

Nemotron-Mini-4B-Instruct-4bit-mlx

4-bit precision

Model card Files Files and versions

Nemotron-Mini-4B-Instruct-4bit-mlx / README.md

c2p-cmd's picture

Upload README.md with huggingface_hub

af2c835 verified about 1 month ago

|

history blame contribute delete

1.32 kB

	---
	language:
	- en
	license: other
	license_name: nvidia-open-model-license
	license_link: >-
	https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
	tags:
	- mlx
	- llm
	- nemotron
	- apple-silicon
	base_model: nvidia/Nemotron-Mini-4B-Instruct
	---

	# Nemotron-Mini-4B-Instruct-4bit-mlx

	This model was converted from [nvidia/Nemotron-Mini-4B-Instruct](https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct)
	to [MLX](https://github.com/ml-explore/mlx) format for use on Apple Silicon.

	Quantization: 4-bit default affine quantization (~4.5 bpw)

	## Usage

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("mlx-community/Nemotron-Mini-4B-Instruct-4bit-mlx")

	prompt = (
	"<extra_id_0>System\n"
	"You are a helpful, honest AI assistant.\n\n"
	"<extra_id_1>User\n"
	"Who are you?\n"
	"<extra_id_1>Assistant\n"
	)

	print(generate(model, tokenizer, prompt, max_tokens=256))
	```

	## Benchmark (Apple Silicon, single prompt, 23 tokens)

	\| Variant \| tok/s \|
	\|---\|---\|
	\| bf16 (this) \| 2.47 \|
	\| 4-bit default \| 4.37 \|
	\| mxfp4-q4 \| 4.56 \|
	\| nvfp4-q4 \| 9.69 \|
	\| mixed-3-6 \| 9.72 \|

	## Original model

	See [nvidia/Nemotron-Mini-4B-Instruct](https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct)
	for the original model card, license, and usage terms.