ncorder/llama-embed-nemotron-8b-mlx-2bit

MLX conversion of nvidia/llama-embed-nemotron-8b — the #1 embedding model under 20B parameters on the MTEB leaderboard, outperforming models 3x its size.

  • Parameters: 7.5B
  • Quantization: 2-bit affine (group_size=64)
  • Model size: 2.4 GB
  • Architecture: Llama-3.1-8B with bidirectional attention
  • Embedding dimension: 4096
  • Max sequence length: 32,768 tokens
  • Converted with: mlx-embeddings

Note: 2-bit quantization shows measurable quality degradation. The irrelevant document score inflates significantly compared to the reference. Suitable for coarse ranking or memory-constrained environments, but 4-bit is recommended for most use cases.

All variants

Variant Size Relevant ↑ Irrelevant ↓ Margin ↑
fp16 15 GB 0.3763 0.0579 0.3184
8-bit 7.5 GB 0.3780 0.0583 0.3197
4-bit 4.0 GB 0.3826 0.0783 0.3043
2-bit 2.4 GB 0.4799 0.2873 0.1926

Reference (original bf16 PyTorch): relevant=0.3771, irrelevant=0.0581, margin=0.3190

Usage

pip install mlx-embeddings
from mlx_embeddings.utils import load
import mlx.core as mx

model, tokenizer = load("ncorder/llama-embed-nemotron-8b-mlx-2bit")

query = "Instruct: Given a question, retrieve passages that answer the question\nQuery: How do neural networks learn patterns from examples?"
document = "Deep learning models adjust their weights through backpropagation."

def embed(text):
    inputs = tokenizer(text, return_tensors="np", padding=True)
    out = model(
        mx.array(inputs["input_ids"]),
        mx.array(inputs["attention_mask"])
    )
    return out.text_embeds

q_emb = embed(query)
d_emb = embed(document)
score = (q_emb @ d_emb.T).item()
print(f"Similarity: {score:.4f}")

Query formatting

This model is instruction-aware. For retrieval, prefix queries with:

Instruct: {task_instruction}
Query: {your_query}

Documents are embedded without any prefix.

License

This model inherits the NVIDIA license from the original — research/non-commercial use only. Also subject to the Llama 3.1 Community License.

Credits

Downloads last month
78
Safetensors
Model size
0.7B params
Tensor type
F16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ncorder/llama-embed-nemotron-8b-mlx-2bit

Finetuned
(5)
this model

Paper for ncorder/llama-embed-nemotron-8b-mlx-2bit