Instructions to use ncorder/llama-embed-nemotron-8b-mlx-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use ncorder/llama-embed-nemotron-8b-mlx-2bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir llama-embed-nemotron-8b-mlx-2bit ncorder/llama-embed-nemotron-8b-mlx-2bit
- sentence-transformers
How to use ncorder/llama-embed-nemotron-8b-mlx-2bit with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("ncorder/llama-embed-nemotron-8b-mlx-2bit", trust_remote_code=True) sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
ncorder/llama-embed-nemotron-8b-mlx-2bit
MLX conversion of nvidia/llama-embed-nemotron-8b — the #1 embedding model under 20B parameters on the MTEB leaderboard, outperforming models 3x its size.
- Parameters: 7.5B
- Quantization: 2-bit affine (group_size=64)
- Model size: 2.4 GB
- Architecture: Llama-3.1-8B with bidirectional attention
- Embedding dimension: 4096
- Max sequence length: 32,768 tokens
- Converted with: mlx-embeddings
Note: 2-bit quantization shows measurable quality degradation. The irrelevant document score inflates significantly compared to the reference. Suitable for coarse ranking or memory-constrained environments, but 4-bit is recommended for most use cases.
All variants
| Variant | Size | Relevant ↑ | Irrelevant ↓ | Margin ↑ |
|---|---|---|---|---|
| fp16 | 15 GB | 0.3763 | 0.0579 | 0.3184 |
| 8-bit | 7.5 GB | 0.3780 | 0.0583 | 0.3197 |
| 4-bit | 4.0 GB | 0.3826 | 0.0783 | 0.3043 |
| 2-bit | 2.4 GB | 0.4799 | 0.2873 | 0.1926 |
Reference (original bf16 PyTorch): relevant=0.3771, irrelevant=0.0581, margin=0.3190
Usage
pip install mlx-embeddings
from mlx_embeddings.utils import load
import mlx.core as mx
model, tokenizer = load("ncorder/llama-embed-nemotron-8b-mlx-2bit")
query = "Instruct: Given a question, retrieve passages that answer the question\nQuery: How do neural networks learn patterns from examples?"
document = "Deep learning models adjust their weights through backpropagation."
def embed(text):
inputs = tokenizer(text, return_tensors="np", padding=True)
out = model(
mx.array(inputs["input_ids"]),
mx.array(inputs["attention_mask"])
)
return out.text_embeds
q_emb = embed(query)
d_emb = embed(document)
score = (q_emb @ d_emb.T).item()
print(f"Similarity: {score:.4f}")
Query formatting
This model is instruction-aware. For retrieval, prefix queries with:
Instruct: {task_instruction}
Query: {your_query}
Documents are embedded without any prefix.
License
This model inherits the NVIDIA license from the original — research/non-commercial use only. Also subject to the Llama 3.1 Community License.
Credits
- Original model by NVIDIA:
nvidia/llama-embed-nemotron-8b - Conversion via
mlx-embeddingsby Prince Canuma - Technical report
- Downloads last month
- 78
Quantized
Model tree for ncorder/llama-embed-nemotron-8b-mlx-2bit
Base model
nvidia/llama-embed-nemotron-8b