Instructions to use nvidia/NV-Embed-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use nvidia/NV-Embed-v1 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("nvidia/NV-Embed-v1", trust_remote_code=True) sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
How much VRAM is needed to run this model? Like for the bare minimum length etc?
I have 3 GPUs, an NVIDIA 4070 TI (12GB), NVIDIA 4060 TI (16GB), and an NVIDIA Tesla T4 (16GB) and I can't get it to split using this:
"
from transformers import AutoModel
from torch.nn import DataParallel
embedding_model = AutoModel.from_pretrained("nvidia/NV-Embed-v1")
for module_key, module in embedding_model._modules.items():
embedding_model._modules[module_key] = DataParallel(module)
"
and changing the batch size using this:
"
get the embeddings with DataLoader (spliting the datasets into multiple mini-batches)
batch_size=2
query_embeddings = model._do_encode(queries, batch_size=batch_size, instruction=query_prefix, max_length=max_length)
passage_embeddings = model._do_encode(passages, batch_size=batch_size, instruction=passage_prefix, max_length=max_length)
"
and setting the max embedding length to 512 still causes OOM on both the 4070 TI and the 4060 TI. So how much VRAM does this model need and what can I do to run it on my system?
You will realise it only loaded on one gpu, not the 3. That's why you get oom error.
How could we load it into multiple GPUs, or can it not be originally?