mlx-community/Nemotron-Mini-4B-Instruct-bf16-mlx

This model was converted from nvidia/Nemotron-Mini-4B-Instruct to MLX format for use on Apple Silicon.

Quantization: No quantization – full bfloat16

Usage

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/{repo_name}")

prompt = (
    "<extra_id_0>System\\n"
    "You are a helpful, honest AI assistant.\\n\\n"
    "<extra_id_1>User\\n"
    "Who are you?\\n"
    "<extra_id_1>Assistant\\n"
)

print(generate(model, tokenizer, prompt, max_tokens=256))

Benchmark (Apple Silicon, single prompt, 23 tokens)

Variant	tok/s
bf16 (this)	2.47
4-bit default	4.37
mxfp4-q4	4.56
nvfp4-q4	9.69
mixed-3-6	9.72

Original model

See nvidia/Nemotron-Mini-4B-Instruct for the original model card, license, and usage terms.

Downloads last month: 2,106

Safetensors

Model size

4B params

Tensor type

BF16

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Nemotron-Mini-4B-Instruct-bf16-mlx

Base model

nvidia/Nemotron-Mini-4B-Instruct

Finetuned

(6)

this model