ncorder/llama-embed-nemotron-8b-mlx-2bit

MLX conversion of nvidia/llama-embed-nemotron-8b — the #1 embedding model under 20B parameters on the MTEB leaderboard, outperforming models 3x its size.

Parameters: 7.5B
Quantization: 2-bit affine (group_size=64)
Model size: 2.4 GB
Architecture: Llama-3.1-8B with bidirectional attention
Embedding dimension: 4096
Max sequence length: 32,768 tokens
Converted with: mlx-embeddings

Note: 2-bit quantization shows measurable quality degradation. The irrelevant document score inflates significantly compared to the reference. Suitable for coarse ranking or memory-constrained environments, but 4-bit is recommended for most use cases.

All variants

Variant	Size	Relevant ↑	Irrelevant ↓	Margin ↑
fp16	15 GB	0.3763	0.0579	0.3184
8-bit	7.5 GB	0.3780	0.0583	0.3197
4-bit	4.0 GB	0.3826	0.0783	0.3043
2-bit	2.4 GB	0.4799	0.2873	0.1926

Reference (original bf16 PyTorch): relevant=0.3771, irrelevant=0.0581, margin=0.3190

Usage

pip install mlx-embeddings

from mlx_embeddings.utils import load
import mlx.core as mx

model, tokenizer = load("ncorder/llama-embed-nemotron-8b-mlx-2bit")

query = "Instruct: Given a question, retrieve passages that answer the question\nQuery: How do neural networks learn patterns from examples?"
document = "Deep learning models adjust their weights through backpropagation."

def embed(text):
    inputs = tokenizer(text, return_tensors="np", padding=True)
    out = model(
        mx.array(inputs["input_ids"]),
        mx.array(inputs["attention_mask"])
    )
    return out.text_embeds

q_emb = embed(query)
d_emb = embed(document)
score = (q_emb @ d_emb.T).item()
print(f"Similarity: {score:.4f}")

Query formatting

This model is instruction-aware. For retrieval, prefix queries with:

Instruct: {task_instruction}
Query: {your_query}

Documents are embedded without any prefix.

License

This model inherits the NVIDIA license from the original — research/non-commercial use only. Also subject to the Llama 3.1 Community License.

Credits

Original model by NVIDIA: nvidia/llama-embed-nemotron-8b
Conversion via mlx-embeddings by Prince Canuma
Technical report

Downloads last month: 55

Safetensors

Model size

0.7B params

Tensor type

F16

U32

MLX

Hardware compatibility

Quantized

Model tree for ncorder/llama-embed-nemotron-8b-mlx-2bit

Base model

nvidia/llama-embed-nemotron-8b

Finetuned

(5)

this model

Paper for ncorder/llama-embed-nemotron-8b-mlx-2bit

Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks

Paper • 2511.07025 • Published Nov 10, 2025 • 16