File size: 6,248 Bytes
8753d53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
license: mit
pipeline_tag: feature-extraction
tags:
- feature-extraction
- sentence-similarity
- mteb
- sentence-transformers
language:
- multilingual
---


<p align="center">
  <img src="assets/logo.svg" alt="Perplexity Logo" width="400">
</p>

<p align="center">pplx-embed-v1: Diffusion-Pretrained Dense and Contextual Embeddings</p>

`pplx-embed-v1` and `pplx-embed-context-v1` are state-of-the-art text embedding models optimized for real-world, web-scale retrieval tasks.

- Use **`pplx-embed-v1`** for independent text embedding (queries, documents, semantic search)
- Use **`pplx-embed-context-v1`** for document chunks in RAG systems where surrounding context matters

> [!IMPORTANT]
> `pplx-embed-v1` and `pplx-embed-context-v1` natively produce *unnormalized* int8-quantized embeddings. Ensure that you compare them via *cosine similarity*.


![diag.png](assets/diag.png)

## Models

| Model | Dimensions | Context | MRL | Quantization | Instruction | Pooling |
|:-----:|:----------:|:-------:|:---:|:------------:|:-----------:|:-------:|
| `pplx-embed-v1-0.6B` | 1024 | 32K | Yes | INT8/BINARY | No | Mean |
| `pplx-embed-v1-4B` | 2560 | 32K | Yes | INT8/BINARY | No | Mean |
| `pplx-embed-context-v1-0.6B` | 1024 | 32K | Yes | INT8/BINARY | No | Mean |
| `pplx-embed-context-v1-4B` | 2560 | 32K | Yes | INT8/BINARY | No | Mean |

<sub>All models are built on diffusion continued pre-trained Qwen3 at Perplexity AI.</sub>

<sub>Many modern embedding models rely on instruction tuning, where users prepend an instruction string to the text being embedded. This can yield a 2%-3% lift on benchmarks, but it also introduces prompt-selection overhead and can make indexing pipelines brittle (small instruction changes can shift embedding space). We deliberately **avoid** this requirement: you can embed the text you want to index directly, without having to choose or maintain an instruction prefix.</sub>

## Usage

<details>
<summary>Via API</summary>

```bash
curl -X POST https://api.perplexity.ai/v1/embeddings \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": [
      "Scientists explore the universe driven by curiosity.",
      "Children learn through curious exploration.",
      "Historical discoveries began with curious questions.",
      "Animals use curiosity to adapt and survive.",
      "Philosophy examines the nature of curiosity."
    ],
    "model": "pplx-embed-v1-4b"
  }'
```

</details>


<details>
<summary>Using SentenceTransformers</summary>

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "perplexity-ai/pplx-embed-v1-4B",
    trust_remote_code=True
)

texts = [
    "Scientists explore the universe driven by curiosity.",
    "Children learn through curious exploration.",
    "Historical discoveries began with curious questions.",
    "Animals use curiosity to adapt and survive.",
    "Philosophy examines the nature of curiosity.",
]

embeddings = model.encode(texts) # Shape: (5, 2560), quantized to int8
embeddings = model.encode(texts, quantization="binary") # Shape: (5, 2560), quantized to binary
```

</details>

<details>
<summary> Using ONNX models </summary>

```python

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("perplexity-ai/pplx-embed-v1-4b", trust_remote_code=True)
session = ort.InferenceSession("onnx/model.onnx")


texts = [
    "Scientists explore the universe driven by curiosity.",
    "Children learn through curious exploration.",
    "Historical discoveries began with curious questions.",
    "Animals use curiosity to adapt and survive.",
    "Philosophy examines the nature of curiosity.",
]

tokenized = tokenizer(
    texts,
    padding=True,
    truncation=True,
    return_tensors="np"
)

onnx_inputs = {
    "input_ids": tokenized["input_ids"].astype(np.int64),
    "attention_mask": tokenized["attention_mask"].astype(np.int64),
}

# Run inference
onnx_embeddings = session.run([out.name for out in session.get_outputs()], onnx_inputs)

# ONNX produces both int8 and binary precision embeddings:
int8_embeddings = onnx_embeddings[2]
binary_embeddings = onnx_embeddings[3]
packed_embeddings = np.packbits(binary_embeddings != -1, axis=-1)
```

</details>

<details>
<summary>Using Text Embeddings Inference (TEI)</summary>

> [!NOTE]
> Text Embeddings Inference v1.9.2+ is required.

> [!IMPORTANT]
> Currently, only int8-quantized embeddings are available via TEI. Remember to use cosine similarity with unnormalized int8 embeddings.

- CPU w/ Candle:

```bash
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 --model-id perplexity-ai/pplx-embed-v1-4B --dtype float32
```

- CPU w/ ORT (ONNX Runtime):

```bash
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 --model-id onnx-community/pplx-embed-v1-4B --dtype float32
```

- GPU w/ CUDA:

```bash
docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id perplexity-ai/pplx-embed-v1-4B --dtype float32
```

> If you hit OOM during warmup, lower --max-batch-tokens and --max-client-batch-size. Set --max-batch-tokens to max_sequence_length × batch_size (e.g., 2048 tokens × 8 sequences = 16384).

> Alternatively, when running in CUDA you can use the architecture / compute capability specific
> container instead of the `cuda-1.9`, as that includes the binaries for Turing, Ampere, Hopper and
> Blackwell, so using a dedicated container will be lighter e.g., `ampere-1.9`.

And then you can send requests to it via cURL to `/embed`:

```bash
curl http://0.0.0.0:8080/embed \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      "Scientists explore the universe driven by curiosity.",
      "Children learn through curious exploration.",
      "Historical discoveries began with curious questions.",
      "Animals use curiosity to adapt and survive.",
      "Philosophy examines the nature of curiosity."
    ],
    "normalize": false
  }'
```

</details>

## Technical Details

For comprehensive technical details and evaluation results, see our paper on arXiv: https://arxiv.org/abs/2602.11151.