Buckets:
| pipeline_tag: feature-extraction | |
| tags: | |
| - embedding | |
| - jina-embeddings-v5 | |
| - feature-extraction | |
| - sentence-transformers | |
| - multimodal | |
| - vision | |
| - audio | |
| - vllm | |
| - video | |
| - image-feature-extraction | |
| - audio-feature-extraction | |
| - video-feature-extraction | |
| - sentence-similarity | |
| language: | |
| - multilingual | |
| inference: false | |
| license: cc-by-nc-4.0 | |
| library_name: transformers | |
| <br><br> | |
| <p align="center"> | |
| <img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px"> | |
| </p> | |
| ### **jina-embeddings-v5-omni-nano**: Multi-Task Omni Embedding Base (Nano) | |
| [ArXiv](https://arxiv.org/abs/2605.08384) | [Blog](https://www.elastic.co/search-labs/blog/jina-embeddings-v5-omni-all-media-one-index) | |
| <p align="center"> | |
| <img src="omni_frontier.png" alt="Average score vs. parameter count for open-weight omni embedding models" width="520px"> | |
| </p> | |
| *Average score vs. parameter count across image (MIEB-Lite), video (MMEB-V), and audio (MAEB) benchmarks — `jina-v5-omni-nano` and `jina-v5-omni-small` define the open-weight frontier (Table 1 in the [ArXiv report](https://arxiv.org/abs/2605.08384)).* | |
|  | |
| ### Model Overview | |
| `jina-embeddings-v5-omni-nano` is a multimodal embedding model that accepts **text, images, video, and audio** and produces embeddings in a shared vector space aligned with the text-only [`jinaai/jina-embeddings-v5-text-nano`](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano) — so you can index with text and query with any modality, no reindexing. For higher performance at a larger size, see [`jinaai/jina-embeddings-v5-omni-small`](https://huggingface.co/jinaai/jina-embeddings-v5-omni-small). | |
| This is the **base** repository — it holds all task adapters (retrieval, classification, clustering, text-matching). For a single task, pre-merged task-specific variants are also available: | |
| - [`jinaai/jina-embeddings-v5-omni-nano-retrieval`](https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano-retrieval) — query–document semantic search and RAG (raw-transformers users prepend `Query: ` / `Document: ` to text; sentence-transformers users call `encode_query()` / `encode_document()`). | |
| - [`jinaai/jina-embeddings-v5-omni-nano-classification`](https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano-classification) — assigning labels via embedding similarity — zero-shot and few-shot classification across modalities. | |
| - [`jinaai/jina-embeddings-v5-omni-nano-clustering`](https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano-clustering) — grouping semantically similar items — topic discovery, deduplication, exploratory analysis. | |
| - [`jinaai/jina-embeddings-v5-omni-nano-text-matching`](https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano-text-matching) — symmetric pairwise similarity scoring — STS, paraphrase and near-duplicate detection. | |
| | Feature | Value | | |
| | --- | --- | | |
| | Parameters | ~1.04B | | |
| | Embedding Dimension | 768 | | |
| | Supported Tasks | `retrieval`, `classification`, `clustering`, `text-matching` | | |
| | Max Sequence Length | 8192 | | |
| | Pooling Strategy | Last-token | | |
| | Supported Inputs | text, image, video, audio | | |
| | Supported File Types | images: `.jpg`, `.jpeg`, `.png`, `.gif`, `.webp`, `.bmp`, `.tif`, `.tiff`, `.avif`, `.heic`, `.svg`; video: `.mp4`, `.avi`, `.mov`, `.mkv`, `.webm`, `.flv`, `.wmv`; audio: `.wav`, `.mp3`, `.flac`, `.ogg`, `.m4a`, `.opus`; documents: `.pdf` | | |
| ### Via Elastic Inference Service | |
| The fastest way to use v5-omni in production. [Elastic Inference Service (EIS)](https://www.elastic.co/docs/explore-analyze/elastic-inference/eis) provides managed embedding inference with built-in scaling, so you can generate embeddings directly within your Elastic deployment. | |
| ```bash | |
| # Retrieve the configuration of the preconfigured omni-nano inference endpoint | |
| GET /_inference/embedding/.jina-embeddings-v5-omni-nano | |
| # Generate an embedding for a single piece of text using the predefined endpoint | |
| POST _inference/embedding/.jina-embeddings-v5-omni-nano | |
| { | |
| "input": [ | |
| "This is a test" | |
| ] | |
| } | |
| # Fuse a text description and an image into a single embedding via a multimodal content block | |
| POST _inference/embedding/.jina-embeddings-v5-omni-nano | |
| { | |
| "input": [ | |
| { | |
| "content": [ | |
| { "type": "text", "value": "A small blue square" }, | |
| { "type": "image", "format": "base64", "value": "<BASE64_IMAGE_DATA>" } | |
| ] | |
| } | |
| ] | |
| } | |
| # Create a custom endpoint that truncates omni-nano embeddings to 32 dimensions | |
| PUT _inference/embedding/jina-omni-nano-32d | |
| { | |
| "service": "elastic", | |
| "service_settings": { | |
| "model_id": "jina-embeddings-v5-omni-nano", | |
| "dimensions": 32 | |
| } | |
| } | |
| ``` | |
| See the [Elastic Inference Service documentation](https://www.elastic.co/docs/explore-analyze/elastic-inference/eis) for setup details. | |
| ### Install | |
| ```bash | |
| # core | |
| pip install transformers torch pillow numpy | |
| # Optional — install only the extras for the modalities you actually use: | |
| pip install librosa soundfile # audio decoding | |
| pip install av imageio # video decoding (pure-Python, no codec daemon) | |
| pip install pdf2image pypdfium2 # PDF rendering | |
| pip install cairosvg pillow # SVG rendering | |
| pip install "vllm==0.20.1" # high-throughput serving (validated) | |
| pip install sentence-transformers # one-call multimodal API | |
| ``` | |
| For minimum versions see the Requirements section below (transformers >= 4.57, torch >= 2.5; vLLM path validated with vllm == 0.20.1). | |
| ### Quickstart | |
| ```python | |
| from PIL import Image | |
| import librosa, torch | |
| from transformers import AutoModel, AutoProcessor, WhisperFeatureExtractor | |
| repo = "jinaai/jina-embeddings-v5-omni-nano" | |
| model = AutoModel.from_pretrained(repo, trust_remote_code=True, default_task="retrieval").eval() | |
| proc = AutoProcessor.from_pretrained(repo, trust_remote_code=True) | |
| # model.embed(**inputs) returns L2-normalized last-token embeddings. | |
| t_vec = model.embed(**proc(text="Query: Which planet is known as the Red Planet?", return_tensors="pt").to(model.device)) | |
| i_vec = model.embed(**proc(images=Image.open("photo.jpg"), text="<image>", return_tensors="pt").to(model.device)) | |
| v_vec = model.embed(**proc(videos="clip.mp4", text="<image>", return_tensors="pt").to(model.device)) | |
| # Audio has no string placeholder — build token ids from config. | |
| audio, _ = librosa.load("speech.wav", sr=16000) | |
| feat = WhisperFeatureExtractor(feature_size=128)(audio, sampling_rate=16000, return_tensors="pt")["input_features"] | |
| cfg = model.config | |
| n = feat.shape[-1] // 4 | |
| ids = torch.tensor([[cfg.audio_start_token_id, *[cfg.audio_token_id]*n, cfg.audio_end_token_id]]) | |
| a_vec = model.embed( | |
| input_ids=ids.to(model.device), | |
| attention_mask=torch.ones_like(ids).to(model.device), | |
| input_features=feat.to(model.device, dtype=next(model.parameters()).dtype), | |
| ) | |
| ``` | |
| For retrieval, use `encode_query()` for query-side embeddings and `encode_document()` for document-side embeddings. A bare `encode(text)` call does not know which retrieval side you intended. | |
| For non-retrieval tasks (classification / clustering / text-matching), load with `default_task="classification"` (or the matching task) and prepend `"Document: "` to text inputs on the raw `model.embed(...)` path — e.g. `proc(text="Document: A cute cat sitting on a mat.", return_tensors="pt")`. These tasks have no query/document distinction; the `Document: ` prefix is the only one used. | |
| No `dtype`, `device`, `min_pixels`, or custom pooling code needed — sensible defaults live in the model config (bf16 weights, 256–1280 vision tokens). | |
| <details> | |
| <summary>Requirements</summary> | |
| - `transformers>=4.57` (recommend >=5.1 for the small variants) | |
| - `torch>=2.5` | |
| Optional: | |
| - `sentence-transformers` — one-call API for all 4 modalities | |
| - `librosa` — audio decoding | |
| - `av` — video decoding (`pip install av`) | |
| - `vllm==0.20.1` — high-throughput serving; H100 deployments may also need DeepGEMM installed for vLLM FP8 kernels | |
| </details> | |
| ### Selective Modality Loading | |
| By default all components (vision + audio towers + text encoder) are loaded. | |
| To save memory, pick a subset — the unused towers are skipped at load time: | |
| ```python | |
| from transformers import AutoModel | |
| AutoModel.from_pretrained("jinaai/jina-embeddings-v5-omni-nano", trust_remote_code=True, modality="omni") # all (default) | |
| AutoModel.from_pretrained("jinaai/jina-embeddings-v5-omni-nano", trust_remote_code=True, modality="vision") # vision + text | |
| AutoModel.from_pretrained("jinaai/jina-embeddings-v5-omni-nano", trust_remote_code=True, modality="audio") # audio + text | |
| AutoModel.from_pretrained("jinaai/jina-embeddings-v5-omni-nano", trust_remote_code=True, modality="text") # text only | |
| ``` | |
| Same parameter works via `sentence-transformers`: | |
| ```python | |
| SentenceTransformer("jinaai/jina-embeddings-v5-omni-nano", trust_remote_code=True, model_kwargs={"modality": "vision"}) | |
| ``` | |
| ### Via sentence-transformers | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| # Base repo holds all 4 task adapters — pick one at load time. | |
| model = SentenceTransformer( | |
| "jinaai/jina-embeddings-v5-omni-nano", | |
| trust_remote_code=True, | |
| model_kwargs={"default_task": "retrieval"}, | |
| ) | |
| # URLs, local paths (with or without extension), PIL.Image, np.ndarray, | |
| # torch.Tensor, bytes, and BytesIO are all accepted directly. | |
| q_vec = model.encode_query("Which planet is known as the Red Planet?") | |
| d_vec = model.encode_document("Mars is often referred to as the Red Planet due to its reddish appearance.") | |
| i_vec = model.encode("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg") | |
| v_vec = model.encode("https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4") # needs `pip install av` | |
| a_vec = model.encode("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") # needs `pip install librosa soundfile` | |
| # Fused multimodal — a tuple becomes ONE embedding in a single forward pass: | |
| emb = model.encode(("Winter boots, waterproof leather upper", | |
| "https://.../boot.jpg", | |
| "https://.../boot.mp4")) | |
| ``` | |
| For non-retrieval tasks (classification / clustering / text-matching), reload | |
| with the corresponding `default_task` and use `encode_document(...)` (or | |
| `encode(text, prompt_name="document")`) — a bare `encode(text)` does not | |
| apply the `"Document: "` prefix and is off-distribution. | |
| No `dtype`, `device`, `min_pixels`, or custom pooling code needed — sensible defaults live in the model config (bf16 weights, 256–1280 vision tokens). | |
| <!-- VIDEO_INPUT_TYPES_DETAILS --> | |
| <details><summary>Accepted video inputs</summary> | |
| Path (`.mp4 .avi .mov .mkv .webm .flv .wmv`, or extensionless — content-sniffed), HTTP(S) URL, `bytes`/`io.BytesIO`, and in-memory `np.ndarray` / `torch.Tensor` of shape `(T, H, W, 3|4)` with dtype `uint8`. In-memory arrays are encoded to MP4 on the fly (needs `pip install imageio imageio-ffmpeg`). | |
| ```python | |
| import numpy as np | |
| # (T, H, W, 3) uint8 — e.g. from decord, imageio, or an rgb frame buffer | |
| frames = np.zeros((16, 224, 224, 3), dtype=np.uint8) | |
| v_vec = model.encode(frames) | |
| ``` | |
| </details> | |
| ### Via vLLM | |
| The base repo holds all 4 task adapters. Pick **one task per vLLM instance** via `hf_overrides`: | |
| ```python | |
| from vllm import LLM | |
| llm = LLM( | |
| model="jinaai/jina-embeddings-v5-omni-nano", | |
| runner="pooling", | |
| trust_remote_code=True, | |
| hf_overrides={"task": "retrieval"}, # or: classification / clustering / text-matching | |
| ) | |
| # Retrieval: prepend "Query: " for queries, "Document: " for documents. | |
| # Non-retrieval (classification / clustering / text-matching): prepend "Document: " to every text input. | |
| outs = llm.embed([{"prompt": "Query: Which planet is known as the Red Planet?"}]) | |
| ``` | |
| Or via CLI: | |
| ```bash | |
| vllm serve jinaai/jina-embeddings-v5-omni-nano \ | |
| --trust-remote-code \ | |
| --hf-overrides '{"task": "retrieval"}' | |
| ``` | |
| Alternatively set `JINA_V5_TASK=retrieval` in the environment. Output is bit-exact | |
| with the corresponding pre-merged `-retrieval` / `-classification` / `-clustering` / | |
| `-text-matching` variant. | |
| ### Matryoshka (truncating embeddings) | |
| All three backends support truncating the full embedding to a shorter dimension | |
| with L2 re-normalization, so the result stays unit-norm: | |
| ```python | |
| # transformers | |
| vec = model.embed(truncate_dim=256, **proc(text="hello", return_tensors="pt")) | |
| # or | |
| vec = model.encode(["hello"], task="retrieval", truncate_dim=256) | |
| # sentence-transformers | |
| vec = model.encode("hello", truncate_dim=256) | |
| # vLLM — ask the pooler for a smaller embedding | |
| from vllm import PoolingParams | |
| outs = llm.embed(prompts, pooling_params=PoolingParams(dimensions=256)) | |
| # or truncate + renormalize the full-dim output yourself: | |
| import numpy as np | |
| full = np.asarray(outs[0].outputs.embedding) | |
| vec = full[:256] / np.linalg.norm(full[:256]) | |
| ``` | |
| <!-- BATCHING_SECTION_START --> | |
| ### Batching | |
| Pass a list to encode many inputs in one call. | |
| ```python | |
| # sentence-transformers — any modality | |
| t_vecs = model.encode(["query 1", "query 2"]) | |
| i_vecs = model.encode([Image.open("a.jpg"), Image.open("b.jpg")]) | |
| v_vecs = model.encode(["clip1.mp4", "clip2.mp4"]) | |
| a_vecs = model.encode(["speech1.wav", "speech2.wav"]) | |
| # raw transformers — text (native padded batch) | |
| inputs = proc(text=["query 1", "query 2"], padding=True, truncation=True, return_tensors="pt").to(model.device) | |
| vecs = model.embed(**inputs) # (2, dim) | |
| # vLLM — list of request dicts, any modality | |
| outs = llm.embed([ | |
| {"prompt": "query 1"}, | |
| {"prompt": "query 2"}, | |
| ]) | |
| ``` | |
| For `sentence-transformers`, images / video / audio are forwarded per-sample (one forward pass each). Text is truly batched. For large-scale multimodal throughput, prefer `vLLM`. | |
| <!-- BATCHING_SECTION_END --> | |
| ### Compatibility | |
| Embeddings produced by this model share a vector space with: | |
| - [`jinaai/jina-embeddings-v5-text-nano`](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano) — text-only | |
| - `jinaai/jina-embeddings-v5-text-nano` (via matching adapter) | |
| You can index text with the `v5-text-nano` model and query it with image, | |
| video, or audio embeddings from `jina-embeddings-v5-omni-nano` — no reindexing. | |
| ### License | |
| CC BY-NC 4.0. For commercial use, [contact us](mailto:sales@jina.ai). | |
Xet Storage Details
- Size:
- 14.3 kB
- Xet hash:
- fc7deea4e1709bd052474c6c06f73fc11b5287f3f4cdfa053618d0bcc4fa2293
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.