--- license: mit pipeline_tag: feature-extraction tags: - feature-extraction - sentence-similarity - conteb - contextual-embeddings language: - multilingual ---

Perplexity Logo

pplx-embed-v1: Diffusion-Pretrained Dense and Contextual Embeddings

`pplx-embed-v1` and `pplx-embed-context-v1` are state-of-the-art text embedding models optimized for real-world, web-scale retrieval tasks. - Use **`pplx-embed-v1`** for independent text embedding (queries, documents, semantic search) - Use **`pplx-embed-context-v1`** for document chunks in RAG systems where surrounding context matters > [!IMPORTANT] > `pplx-embed-v1` and `pplx-embed-context-v1` natively produce *unnormalized* int8-quantized embeddings. Ensure that you compare them via *cosine similarity*. ![diag.png](assets/diag.png) ## Models | Model | Dimensions | Context | MRL | Quantization | Instruction | Pooling | |:-----:|:----------:|:-------:|:---:|:------------:|:-----------:|:-------:| | `pplx-embed-v1-0.6B` | 1024 | 32K | Yes | INT8/BINARY | No | Mean | | `pplx-embed-v1-4B` | 2560 | 32K | Yes | INT8/BINARY | No | Mean | | `pplx-embed-context-v1-0.6B` | 1024 | 32K | Yes | INT8/BINARY | No | Mean | | `pplx-embed-context-v1-4B` | 2560 | 32K | Yes | INT8/BINARY | No | Mean | All models are built on diffusion continued pre-trained Qwen3 at Perplexity AI. Many modern embedding models rely on instruction tuning, where users prepend an instruction string to the text being embedded. This can yield a 2%-3% lift on benchmarks, but it also introduces prompt-selection overhead and can make indexing pipelines brittle (small instruction changes can shift embedding space). We deliberately **avoid** this requirement: you can embed the text you want to index directly, without having to choose or maintain an instruction prefix. ## Usage
Via API (Contextualized Embeddings) ```bash curl -X POST https://api.perplexity.ai/v1/contextualizedembeddings \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "input": [ [ "Curiosity begins in childhood with endless questions about the world.", "As we grow, curiosity drives us to explore new ideas and challenge assumptions.", "Scientific breakthroughs often start with a simple curious question." ], [ "The curiosity rover explores Mars, searching for signs of ancient life.", "Each discovery on Mars sparks new questions about our place in the universe." ] ], "model": "pplx-embed-context-v1-4b" }' ```
Using Transformers ```python from transformers import AutoModel model_ctx = AutoModel.from_pretrained( "perplexity-ai/pplx-embed-context-v1-4B", trust_remote_code=True ) doc_chunks = [ [ "Curiosity begins in childhood with endless questions about the world.", "As we grow, curiosity drives us to explore new ideas.", "Scientific breakthroughs often start with a curious question." ], [ "The curiosity rover explores Mars searching for ancient life.", "Each discovery on Mars sparks new questions about the universe." ] ] # Returns list of numpy arrays (one per document) # embeddings[0].shape = (3, 1024), embeddings[1].shape = (2, 1024) embeddings = model_ctx.encode(doc_chunks) ```
Using ONNX models ```python import onnxruntime as ort from transformers import AutoTokenizer import numpy as np import torch def quantize_int8_tanh(x): normalized = torch.tanh(x) rounded = torch.round(normalized * 127) clamped = torch.clamp(rounded, -128, 127) return clamped def quantize_binary(x): return torch.where(x >= 0, 1.0, -1.0) def mean_pooling( token_embeddings: torch.Tensor, attention_mask: torch.Tensor ) -> torch.Tensor: """Apply mean pooling to token embeddings.""" input_mask_expanded = ( attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() ) return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp( input_mask_expanded.sum(1), min=1e-9 ) def extract_chunks_from_concatenated( input_ids: torch.Tensor, token_embeddings: torch.Tensor, attention_mask: torch.Tensor, tokenizer, ) -> list[list[torch.Tensor]]: """ Extract individual chunk embeddings from concatenated sequence using late chunking. This method splits concatenated sequences like "[chunk1][SEP][chunk2][SEP]..." back into individual chunk embeddings by finding SEP token positions. Args: input_ids: Token IDs (batch_size, seq_len) token_embeddings: Token embeddings (batch_size, seq_len, hidden_dim) attention_mask: Attention mask (batch_size, seq_len) Returns: list[list[torch.Tensor]]: List of documents, each containing list of chunk embeddings Note: The sep_token_id is retrieved tokenizer.sep_token_id. Common values: pplx-embed-1=151643, BERT=102, varies by tokenizer. """ sep_token_id = tokenizer.sep_token_id batch_size = input_ids.shape[0] all_doc_chunks = [] for batch_idx in range(batch_size): # non-pad sep tokens valid_positions = attention_mask[batch_idx].bool() sep_positions = ( (input_ids[batch_idx] == sep_token_id) & valid_positions ).nonzero(as_tuple=True)[0] chunk_embeddings = [] start_pos = 0 for sep_pos in sep_positions: chunk_tokens = token_embeddings[batch_idx, start_pos:sep_pos] chunk_mask = attention_mask[batch_idx, start_pos:sep_pos] chunk_emb = mean_pooling( chunk_tokens.unsqueeze(0), chunk_mask.unsqueeze(0) ).squeeze(0) chunk_embeddings.append(chunk_emb) start_pos = sep_pos + 1 # Handle the last chunk (after the last SEP token) last_valid_pos = attention_mask[batch_idx].sum().item() chunk_tokens = token_embeddings[batch_idx, start_pos:last_valid_pos] chunk_mask = attention_mask[batch_idx, start_pos:last_valid_pos] if chunk_mask.sum() > 0: chunk_emb = mean_pooling( chunk_tokens.unsqueeze(0), chunk_mask.unsqueeze(0) ).squeeze(0) else: # Empty chunk - create zero embedding chunk_emb = torch.zeros( token_embeddings.shape[-1], device=token_embeddings.device, dtype=token_embeddings.dtype, ) chunk_embeddings.append(chunk_emb) all_doc_chunks.append(chunk_embeddings) return all_doc_chunks hf_path=("perplexity-ai/pplx-embed-context-v1-4b") onnx_path=("onnx/model.onnx") tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True) session = ort.InferenceSession(onnx_path) texts = [ [ "Curiosity begins in childhood with endless questions about the world.", "As we grow, curiosity drives us to explore new ideas.", "Scientific breakthroughs often start with a curious question." ], [ "The curiosity rover explores Mars searching for ancient life.", "Each discovery on Mars sparks new questions about the universe." ] ] doc_strings = [ tokenizer.sep_token.join(chunks) for chunks in texts ] tokenized = tokenizer( doc_strings, padding=True, truncation=True, return_tensors="np", ) onnx_inputs = { "input_ids": tokenized["input_ids"].astype(np.int64), "attention_mask": tokenized["attention_mask"].astype(np.int64), } # Run inference onnx_outputs = session.run([out.name for out in session.get_outputs()], onnx_inputs) # onnx_outputs is a list with one element: [last_hidden_state] last_hidden_state = onnx_outputs[0] batch_chunk_embeddings = extract_chunks_from_concatenated( input_ids=torch.tensor(onnx_inputs["input_ids"]), token_embeddings=torch.tensor(last_hidden_state), attention_mask=torch.tensor(onnx_inputs["attention_mask"]), tokenizer=tokenizer, ) batch_chunk_embeddings = [ torch.stack([chunk for chunk in doc_chunks], dim=0) for doc_chunks in batch_chunk_embeddings ] int8_embeddings = [quantize_int8_tanh(x) for x in batch_chunk_embeddings] binary_embeddings = [quantize_binary(x) for x in batch_chunk_embeddings] bits = [np.where(doc.numpy() >= 0, True, False) for doc in binary_embeddings] packed_embeddings = [np.packbits(b, axis=-1) for b in bits] ```
## Technical Details For comprehensive technical details and evaluation results, see our paper on arXiv: https://arxiv.org/abs/2602.11151.