Instructions to use leeroy-jankins/nomi with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use leeroy-jankins/nomi with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="leeroy-jankins/nomi",
	filename="nomnom-embed-text-v1.5.Q4_K_M.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use leeroy-jankins/nomi with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf leeroy-jankins/nomi:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf leeroy-jankins/nomi:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf leeroy-jankins/nomi:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf leeroy-jankins/nomi:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf leeroy-jankins/nomi:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf leeroy-jankins/nomi:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf leeroy-jankins/nomi:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf leeroy-jankins/nomi:Q4_K_M

Use Docker

docker model run hf.co/leeroy-jankins/nomi:Q4_K_M

LM Studio
Jan
Ollama
How to use leeroy-jankins/nomi with Ollama:
```
ollama run hf.co/leeroy-jankins/nomi:Q4_K_M
```

Unsloth Studio new

How to use leeroy-jankins/nomi with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for leeroy-jankins/nomi to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for leeroy-jankins/nomi to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for leeroy-jankins/nomi to start chatting

Docker Model Runner
How to use leeroy-jankins/nomi with Docker Model Runner:
```
docker model run hf.co/leeroy-jankins/nomi:Q4_K_M
```

Lemonade

How to use leeroy-jankins/nomi with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull leeroy-jankins/nomi:Q4_K_M

Run and chat with the model

lemonade run user.nomi-Q4_K_M

List all available models

lemonade list

Overview

Nomi is a long-context local embedding model option used in Chonky. It is derived from Nomic's nomic-embed-text-v1.5, a text embedding model designed for strong retrieval quality with support for long context windows, task-style instruction, RAG, and document indexing scenarios where instruction-prefixed embedding inputs are desirable.

The upstream Nomic model family is built around embedding tasks that benefit from explicit input prefixes such as search_query: and task-specific input conventions. That makes it a particularly good candidate for retrieval pipelines where the query and stored corpus should be encoded with deliberate role distinctions.

A chonky chonk

Use Nomnom when you want:

strong local semantic retrieval
long-context document embedding support
a model designed around explicit task prefixes
a local embedder well suited for RAG-style search workflows

Nomnom is a strong fit for:

semantic indexing of long document chunks
retrieval-augmented generation workflows
embedding queries separately from stored passage content
local experimentation with prefix-aware embedding strategies

Base Model Lineage

Nomnom is derived from:

nomic-ai/nomic-embed-text-v1.5

Key characteristics of the model include:

long-context support in the original transformer implementation
strong retrieval-oriented design
support for task instruction prefixes such as search_query: and other prefixed task modes
support for reduced embedding sizes in the upstream family through Matryoshka-style representation behavior

Important Prefix Behavior

The upstream Nomic model family expects task instruction prefixes at the beginning of text strings for best results. In practice, that means inputs may need prefixes such as:

search_query: for user queries
task-specific prefixes for other workflows depending on how you use the model

  from sentence_transformers import SentenceTransformer
  
  model = SentenceTransformer("chonks/nomnom-embed-text-v1.5", trust_remote_code=True)
  sentences = ['search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten']
  embeddings = model.encode(sentences)
  print(embeddings)

GGUF-Specific Context Note

The GGUF release is suitable for local llama.cpp-style inference, but the long-context behavior available in the original transformer implementation may require additional context extension settings in llama.cpp-based runtimes to fully match upstream context length expectations.

Local File Layout Expected by Chonky

Chonky expects Nomnom at:

chonks/nomi/nomnom-embed-text-v1.5.Q4_K_M.gguf

Features

local GGUF-based embedding model
no cloud API dependency
retrieval-oriented embedding behavior
especially attractive for query/document search pipelines
supports local embedding generation for chunked corpora
aligns well with Chonky's semantic search and vector storage workflows

Recommended Chonky Usage

Nomnom is recommended when:

you want a local retrieval model with strong search-oriented lineage
you plan to distinguish query strings from indexed corpus text
you want a local embedding path aligned to modern RAG conventions
you value long-context model lineage for document-heavy tasks

Usage

Important: the text prompt must include a task instruction prefix, instructing the model which task is being performed.

For example, if you are implementing a RAG application, you embed your documents as search_document: <text here> and embed your user queries as search_query: <text here>.

Task instruction prefixes

`search_document`

Purpose: embed texts as documents from a dataset

This prefix is used for embedding texts as documents, for example as documents for a RAG index.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomi/nomnom-embed-text-v1.5.Q4_K_M.gguf", trust_remote_code=True)
sentences = ['search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten']
embeddings = model.encode(sentences)
print(embeddings)

`search_query`

Purpose: embed texts as questions to answer

This prefix is used for embedding texts as questions that documents from a dataset could resolve, for example as queries to be answered by a RAG application.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomi/nomnom-embed-text-v1.5.Q4_K_M.gguf", trust_remote_code=True)
sentences = ['search_query: Who is Laurens van Der Maaten?']
embeddings = model.encode(sentences)
print(embeddings)

`clustering`

Purpose: embed texts to group them into clusters

This prefix is used for embedding texts in order to group them into clusters, discover common topics, or remove semantic duplicates.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomi/nomnom-embed-text-v1.5.Q4_K_M.gguf", trust_remote_code=True)
sentences = ['clustering: the quick brown fox']
embeddings = model.encode(sentences)
print(embeddings)

`classification`

Purpose: embed texts to classify them

This prefix is used for embedding texts into vectors that will be used as features for a classification model

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomi/nomnom-embed-text-v1.5.Q4_K_M.gguf", trust_remote_code=True)
sentences = ['classification: the quick brown fox']
embeddings = model.encode(sentences)
print(embeddings)

Sentence Transformers

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
matryoshka_dim = 512
model = SentenceTransformer("nomi/nomnom-embed-text-v1.5.Q4_K_M.gguf", trust_remote_code=True)
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
embeddings = model.encode(sentences, convert_to_tensor=True)
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :matryoshka_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings)

Transformers

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('nomi/nomnom-embed-text-v1.5.Q4_K_M.gguf', trust_remote_code=True, safe_serialization=True)
model.eval()
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
+ matryoshka_dim = 512
with torch.no_grad():
    model_output = model(**encoded_input)
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
+ embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
+ embeddings = embeddings[:, :matryoshka_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings)

The model natively supports scaling of the sequence length past 2048 tokens. To do so,

- tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+ tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=8192)
- model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1.5', trust_remote_code=True)
+ model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1.5', trust_remote_code=True, rotary_scaling_factor=2)

Transformers.js

import { pipeline, layer_norm } from '@huggingface/transformers';
// Create a feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'nomi/nomnom-embed-text-v1.5.Q4_K_M.gguf');
// Define sentences
const texts = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?'];
// Compute sentence embeddings
let embeddings = await extractor(texts, { pooling: 'mean' });
console.log(embeddings); // Tensor of shape [2, 768]
const matryoshka_dim = 512;
embeddings = layer_norm(embeddings, [embeddings.dims[1]])
    .slice(null, [0, matryoshka_dim])
    .normalize(2, -1);
console.log(embeddings.tolist());

Nomic API

The easiest way to use Nomic Embed is through the Nomic Embedding API.

Generating embeddings with the nomic Python client is as easy as

from nomi import embed
output = embed.text(
    texts=['Nomi Noms Noms Embedding API', '#keepAIOpen'],
    model='nomnom-embed-text-v1.5.Q4_K_M.gguf',
    task_type='search_document',
    dimensionality=256,
)
print(output)

Downloads last month: 4

GGUF

Model size

0.1B params

Architecture

nomic-bert

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leeroy-jankins/nomi

Base model

nomic-ai/nomic-embed-text-v1.5

Quantized

nomic-ai/nomic-embed-text-v1.5-GGUF