pplx-embed-v1: Diffusion-Pretrained Dense and Contextual Embeddings

pplx-embed-v1 and pplx-embed-context-v1 are state-of-the-art text embedding models optimized for real-world, web-scale retrieval tasks.

Use pplx-embed-v1 for independent text embedding (queries, documents, semantic search)
Use pplx-embed-context-v1 for document chunks in RAG systems where surrounding context matters

pplx-embed-v1 and pplx-embed-context-v1 natively produce unnormalized int8-quantized embeddings. Ensure that you compare them via cosine similarity.

Models

Model	Dimensions	Context	MRL	Quantization	Instruction	Pooling
`pplx-embed-v1-0.6B`	1024	32K	Yes	INT8/BINARY	No	Mean
`pplx-embed-v1-4B`	2560	32K	Yes	INT8/BINARY	No	Mean
`pplx-embed-context-v1-0.6B`	1024	32K	Yes	INT8/BINARY	No	Mean
`pplx-embed-context-v1-4B`	2560	32K	Yes	INT8/BINARY	No	Mean

_{All models are built on diffusion continued pre-trained Qwen3 at Perplexity AI.}

_{Many modern embedding models rely on instruction tuning, where users prepend an instruction string to the text being embedded. This can yield a 2%-3% lift on benchmarks, but it also introduces prompt-selection overhead and can make indexing pipelines brittle (small instruction changes can shift embedding space). We deliberately avoid this requirement: you can embed the text you want to index directly, without having to choose or maintain an instruction prefix.}

Usage

Via API (Contextualized Embeddings)

curl -X POST https://api.perplexity.ai/v1/contextualizedembeddings \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": [
      [
        "Curiosity begins in childhood with endless questions about the world.",
        "As we grow, curiosity drives us to explore new ideas and challenge assumptions.",
        "Scientific breakthroughs often start with a simple curious question."
      ],
      [
        "The curiosity rover explores Mars, searching for signs of ancient life.",
        "Each discovery on Mars sparks new questions about our place in the universe."
      ]
    ],
    "model": "pplx-embed-context-v1-4b"
  }'

Using Transformers

from transformers import AutoModel

model_ctx = AutoModel.from_pretrained(
    "perplexity-ai/pplx-embed-context-v1-4B",
    trust_remote_code=True
)

doc_chunks = [
    [
        "Curiosity begins in childhood with endless questions about the world.",
        "As we grow, curiosity drives us to explore new ideas.",
        "Scientific breakthroughs often start with a curious question."
    ],
    [
        "The curiosity rover explores Mars searching for ancient life.",
        "Each discovery on Mars sparks new questions about the universe."
    ]
]
# Returns list of numpy arrays (one per document)
# embeddings[0].shape = (3, 1024), embeddings[1].shape = (2, 1024)
embeddings = model_ctx.encode(doc_chunks)

Using ONNX models


import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
import torch

def quantize_int8_tanh(x):
    normalized = torch.tanh(x)
    rounded = torch.round(normalized * 127)
    clamped = torch.clamp(rounded, -128, 127)
    return clamped

def quantize_binary(x):
    return torch.where(x >= 0, 1.0, -1.0)

def mean_pooling(
    token_embeddings: torch.Tensor, attention_mask: torch.Tensor
) -> torch.Tensor:
    """Apply mean pooling to token embeddings."""
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )

def extract_chunks_from_concatenated(
    input_ids: torch.Tensor,
    token_embeddings: torch.Tensor,
    attention_mask: torch.Tensor,
    tokenizer,
) -> list[list[torch.Tensor]]:
    """
    Extract individual chunk embeddings from concatenated sequence using late chunking.

    This method splits concatenated sequences like "[chunk1][SEP][chunk2][SEP]..."
    back into individual chunk embeddings by finding SEP token positions.

    Args:
        input_ids: Token IDs (batch_size, seq_len)
        token_embeddings: Token embeddings (batch_size, seq_len, hidden_dim)
        attention_mask: Attention mask (batch_size, seq_len)

    Returns:
        list[list[torch.Tensor]]: List of documents, each containing list of chunk embeddings

    Note:
        The sep_token_id is retrieved tokenizer.sep_token_id.
        Common values: pplx-embed-1=151643, BERT=102, varies by tokenizer.
    """
    sep_token_id = tokenizer.sep_token_id
    batch_size = input_ids.shape[0]

    all_doc_chunks = []

    for batch_idx in range(batch_size):
        # non-pad sep tokens
        valid_positions = attention_mask[batch_idx].bool()
        sep_positions = (
            (input_ids[batch_idx] == sep_token_id) & valid_positions
        ).nonzero(as_tuple=True)[0]

        chunk_embeddings = []
        start_pos = 0

        for sep_pos in sep_positions:
            chunk_tokens = token_embeddings[batch_idx, start_pos:sep_pos]
            chunk_mask = attention_mask[batch_idx, start_pos:sep_pos]

            chunk_emb = mean_pooling(
                chunk_tokens.unsqueeze(0), chunk_mask.unsqueeze(0)
            ).squeeze(0)

            chunk_embeddings.append(chunk_emb)

            start_pos = sep_pos + 1

        # Handle the last chunk (after the last SEP token)
        last_valid_pos = attention_mask[batch_idx].sum().item()

        chunk_tokens = token_embeddings[batch_idx, start_pos:last_valid_pos]
        chunk_mask = attention_mask[batch_idx, start_pos:last_valid_pos]

        if chunk_mask.sum() > 0:
            chunk_emb = mean_pooling(
                chunk_tokens.unsqueeze(0), chunk_mask.unsqueeze(0)
            ).squeeze(0)
        else:
            # Empty chunk - create zero embedding
            chunk_emb = torch.zeros(
                token_embeddings.shape[-1],
                device=token_embeddings.device,
                dtype=token_embeddings.dtype,
            )

        chunk_embeddings.append(chunk_emb)

        all_doc_chunks.append(chunk_embeddings)

    return all_doc_chunks


hf_path=("perplexity-ai/pplx-embed-context-v1-4b")
onnx_path=("onnx/model.onnx")

tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True)
session = ort.InferenceSession(onnx_path)

texts = [
    [
        "Curiosity begins in childhood with endless questions about the world.",
        "As we grow, curiosity drives us to explore new ideas.",
        "Scientific breakthroughs often start with a curious question."
    ],
    [
        "The curiosity rover explores Mars searching for ancient life.",
        "Each discovery on Mars sparks new questions about the universe."
    ]
]
doc_strings = [
    tokenizer.sep_token.join(chunks) for chunks in texts
]

tokenized = tokenizer(
    doc_strings,
    padding=True,
    truncation=True,
    return_tensors="np",
)
onnx_inputs = {
    "input_ids": tokenized["input_ids"].astype(np.int64),
    "attention_mask": tokenized["attention_mask"].astype(np.int64),
}

# Run inference
onnx_outputs = session.run([out.name for out in session.get_outputs()], onnx_inputs)
# onnx_outputs is a list with one element: [last_hidden_state]
last_hidden_state = onnx_outputs[0]

batch_chunk_embeddings = extract_chunks_from_concatenated(
    input_ids=torch.tensor(onnx_inputs["input_ids"]),
    token_embeddings=torch.tensor(last_hidden_state),
    attention_mask=torch.tensor(onnx_inputs["attention_mask"]),
    tokenizer=tokenizer,
)

batch_chunk_embeddings = [
    torch.stack([chunk for chunk in doc_chunks], dim=0)
    for doc_chunks in batch_chunk_embeddings
]

int8_embeddings = [quantize_int8_tanh(x) for x in batch_chunk_embeddings]
binary_embeddings = [quantize_binary(x) for x in batch_chunk_embeddings]

bits = [np.where(doc.numpy() >=  0, True, False) for doc in binary_embeddings]
packed_embeddings = [np.packbits(b, axis=-1) for b in bits]