fix: remove ubinary from README

3597c21 verified 5 days ago

8.78 kB

	---
	license: mit
	pipeline_tag: feature-extraction
	tags:
	- feature-extraction
	- sentence-similarity
	- conteb
	- contextual-embeddings
	language:
	- multilingual
	---


	<p align="center">
	<img src="assets/logo.svg" alt="Perplexity Logo" width="400">
	</p>

	<p align="center">pplx-embed-v1: Diffusion-Pretrained Dense and Contextual Embeddings</p>

	`pplx-embed-v1` and `pplx-embed-context-v1` are state-of-the-art text embedding models optimized for real-world, web-scale retrieval tasks.

	- Use `pplx-embed-v1` for independent text embedding (queries, documents, semantic search)
	- Use `pplx-embed-context-v1` for document chunks in RAG systems where surrounding context matters

	> [!IMPORTANT]
	> `pplx-embed-v1` and `pplx-embed-context-v1` natively produce unnormalized int8-quantized embeddings. Ensure that you compare them via cosine similarity.


	![diag.png](assets/diag.png)

	## Models

	\| Model \| Dimensions \| Context \| MRL \| Quantization \| Instruction \| Pooling \|
	\|:-----:\|:----------:\|:-------:\|:---:\|:------------:\|:-----------:\|:-------:\|
	\| `pplx-embed-v1-0.6B` \| 1024 \| 32K \| Yes \| INT8/BINARY \| No \| Mean \|
	\| `pplx-embed-v1-4B` \| 2560 \| 32K \| Yes \| INT8/BINARY \| No \| Mean \|
	\| `pplx-embed-context-v1-0.6B` \| 1024 \| 32K \| Yes \| INT8/BINARY \| No \| Mean \|
	\| `pplx-embed-context-v1-4B` \| 2560 \| 32K \| Yes \| INT8/BINARY \| No \| Mean \|

	<sub>All models are built on diffusion continued pre-trained Qwen3 at Perplexity AI.</sub>

	<sub>Many modern embedding models rely on instruction tuning, where users prepend an instruction string to the text being embedded. This can yield a 2%-3% lift on benchmarks, but it also introduces prompt-selection overhead and can make indexing pipelines brittle (small instruction changes can shift embedding space). We deliberately avoid this requirement: you can embed the text you want to index directly, without having to choose or maintain an instruction prefix.</sub>

	## Usage

	<details>
	<summary>Via API (Contextualized Embeddings)</summary>

	```bash
	curl -X POST https://api.perplexity.ai/v1/contextualizedembeddings \
	-H "Authorization: Bearer YOUR_API_KEY" \
	-H "Content-Type: application/json" \
	-d '{
	"input": [
	[
	"Curiosity begins in childhood with endless questions about the world.",
	"As we grow, curiosity drives us to explore new ideas and challenge assumptions.",
	"Scientific breakthroughs often start with a simple curious question."
	],
	[
	"The curiosity rover explores Mars, searching for signs of ancient life.",
	"Each discovery on Mars sparks new questions about our place in the universe."
	]
	],
	"model": "pplx-embed-context-v1-4b"
	}'
	```

	</details>

	<details>
	<summary>Using Transformers</summary>

	```python
	from transformers import AutoModel

	model_ctx = AutoModel.from_pretrained(
	"perplexity-ai/pplx-embed-context-v1-4B",
	trust_remote_code=True
	)

	doc_chunks = [
	[
	"Curiosity begins in childhood with endless questions about the world.",
	"As we grow, curiosity drives us to explore new ideas.",
	"Scientific breakthroughs often start with a curious question."
	],
	[
	"The curiosity rover explores Mars searching for ancient life.",
	"Each discovery on Mars sparks new questions about the universe."
	]
	]
	# Returns list of numpy arrays (one per document)
	# embeddings[0].shape = (3, 1024), embeddings[1].shape = (2, 1024)
	embeddings = model_ctx.encode(doc_chunks)
	```

	</details>

	<details>
	<summary>Using ONNX models</summary>

	```python

	import onnxruntime as ort
	from transformers import AutoTokenizer
	import numpy as np
	import torch

	def quantize_int8_tanh(x):
	normalized = torch.tanh(x)
	rounded = torch.round(normalized * 127)
	clamped = torch.clamp(rounded, -128, 127)
	return clamped

	def quantize_binary(x):
	return torch.where(x >= 0, 1.0, -1.0)

	def mean_pooling(
	token_embeddings: torch.Tensor, attention_mask: torch.Tensor
	) -> torch.Tensor:
	"""Apply mean pooling to token embeddings."""
	input_mask_expanded = (
	attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
	)
	return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
	input_mask_expanded.sum(1), min=1e-9
	)

	def extract_chunks_from_concatenated(
	input_ids: torch.Tensor,
	token_embeddings: torch.Tensor,
	attention_mask: torch.Tensor,
	tokenizer,
	) -> list[list[torch.Tensor]]:
	"""
	Extract individual chunk embeddings from concatenated sequence using late chunking.

	This method splits concatenated sequences like "[chunk1][SEP][chunk2][SEP]..."
	back into individual chunk embeddings by finding SEP token positions.

	Args:
	input_ids: Token IDs (batch_size, seq_len)
	token_embeddings: Token embeddings (batch_size, seq_len, hidden_dim)
	attention_mask: Attention mask (batch_size, seq_len)

	Returns:
	list[list[torch.Tensor]]: List of documents, each containing list of chunk embeddings

	Note:
	The sep_token_id is retrieved tokenizer.sep_token_id.
	Common values: pplx-embed-1=151643, BERT=102, varies by tokenizer.
	"""
	sep_token_id = tokenizer.sep_token_id
	batch_size = input_ids.shape[0]

	all_doc_chunks = []

	for batch_idx in range(batch_size):
	# non-pad sep tokens
	valid_positions = attention_mask[batch_idx].bool()
	sep_positions = (
	(input_ids[batch_idx] == sep_token_id) & valid_positions
	).nonzero(as_tuple=True)[0]

	chunk_embeddings = []
	start_pos = 0

	for sep_pos in sep_positions:
	chunk_tokens = token_embeddings[batch_idx, start_pos:sep_pos]
	chunk_mask = attention_mask[batch_idx, start_pos:sep_pos]

	chunk_emb = mean_pooling(
	chunk_tokens.unsqueeze(0), chunk_mask.unsqueeze(0)
	).squeeze(0)

	chunk_embeddings.append(chunk_emb)

	start_pos = sep_pos + 1

	# Handle the last chunk (after the last SEP token)
	last_valid_pos = attention_mask[batch_idx].sum().item()

	chunk_tokens = token_embeddings[batch_idx, start_pos:last_valid_pos]
	chunk_mask = attention_mask[batch_idx, start_pos:last_valid_pos]

	if chunk_mask.sum() > 0:
	chunk_emb = mean_pooling(
	chunk_tokens.unsqueeze(0), chunk_mask.unsqueeze(0)
	).squeeze(0)
	else:
	# Empty chunk - create zero embedding
	chunk_emb = torch.zeros(
	token_embeddings.shape[-1],
	device=token_embeddings.device,
	dtype=token_embeddings.dtype,
	)

	chunk_embeddings.append(chunk_emb)

	all_doc_chunks.append(chunk_embeddings)

	return all_doc_chunks


	hf_path=("perplexity-ai/pplx-embed-context-v1-4b")
	onnx_path=("onnx/model.onnx")

	tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True)
	session = ort.InferenceSession(onnx_path)

	texts = [
	[
	"Curiosity begins in childhood with endless questions about the world.",
	"As we grow, curiosity drives us to explore new ideas.",
	"Scientific breakthroughs often start with a curious question."
	],
	[
	"The curiosity rover explores Mars searching for ancient life.",
	"Each discovery on Mars sparks new questions about the universe."
	]
	]
	doc_strings = [
	tokenizer.sep_token.join(chunks) for chunks in texts
	]

	tokenized = tokenizer(
	doc_strings,
	padding=True,
	truncation=True,
	return_tensors="np",
	)
	onnx_inputs = {
	"input_ids": tokenized["input_ids"].astype(np.int64),
	"attention_mask": tokenized["attention_mask"].astype(np.int64),
	}

	# Run inference
	onnx_outputs = session.run([out.name for out in session.get_outputs()], onnx_inputs)
	# onnx_outputs is a list with one element: [last_hidden_state]
	last_hidden_state = onnx_outputs[0]

	batch_chunk_embeddings = extract_chunks_from_concatenated(
	input_ids=torch.tensor(onnx_inputs["input_ids"]),
	token_embeddings=torch.tensor(last_hidden_state),
	attention_mask=torch.tensor(onnx_inputs["attention_mask"]),
	tokenizer=tokenizer,
	)

	batch_chunk_embeddings = [
	torch.stack([chunk for chunk in doc_chunks], dim=0)
	for doc_chunks in batch_chunk_embeddings
	]

	int8_embeddings = [quantize_int8_tanh(x) for x in batch_chunk_embeddings]
	binary_embeddings = [quantize_binary(x) for x in batch_chunk_embeddings]

	bits = [np.where(doc.numpy() >= 0, True, False) for doc in binary_embeddings]
	packed_embeddings = [np.packbits(b, axis=-1) for b in bits]
	```

	</details>

	## Technical Details

	For comprehensive technical details and evaluation results, see our paper on arXiv: https://arxiv.org/abs/2602.11151.