bowang0911 commited on
Commit
964a2aa
·
1 Parent(s): f862230

Update README.md (#9)

Browse files

- Update README.md (6dfa7e22117fbfd67b53b889676962385fa8339f)

Files changed (1) hide show
  1. README.md +52 -0
README.md CHANGED
@@ -134,6 +134,58 @@ packed_embeddings = np.packbits(binary_embeddings != -1, axis=-1)
134
 
135
  </details>
136
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
  ## Technical Details
138
 
139
  For comprehensive technical details and evaluation results, see our paper on arXiv: https://arxiv.org/abs/2602.11151.
 
134
 
135
  </details>
136
 
137
+ <details>
138
+ <summary>Using Text Embeddings Inference (TEI)</summary>
139
+
140
+ > [!NOTE]
141
+ > Text Embeddings Inference v1.9.2+ is required.
142
+
143
+ > [!IMPORTANT]
144
+ > Currently, only int8-quantized embeddings are available via TEI. Remember to use cosine similarity with unnormalized int8 embeddings.
145
+
146
+ - CPU w/ Candle:
147
+
148
+ ```bash
149
+ docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 --model-id perplexity-ai/pplx-embed-v1-4B --dtype float32
150
+ ```
151
+
152
+ - CPU w/ ORT (ONNX Runtime):
153
+
154
+ ```bash
155
+ docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 --model-id onnx-community/pplx-embed-v1-4B --dtype float32
156
+ ```
157
+
158
+ - GPU w/ CUDA:
159
+
160
+ ```bash
161
+ docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id perplexity-ai/pplx-embed-v1-4B --dtype float32
162
+ ```
163
+
164
+ > If you hit OOM during warmup, lower --max-batch-tokens and --max-client-batch-size. Set --max-batch-tokens to max_sequence_length × batch_size (e.g., 2048 tokens × 8 sequences = 16384).
165
+
166
+ > Alternatively, when running in CUDA you can use the architecture / compute capability specific
167
+ > container instead of the `cuda-1.9`, as that includes the binaries for Turing, Ampere, Hopper and
168
+ > Blackwell, so using a dedicated container will be lighter e.g., `ampere-1.9`.
169
+
170
+ And then you can send requests to it via cURL to `/embed`:
171
+
172
+ ```bash
173
+ curl http://0.0.0.0:8080/embed \
174
+ -H "Content-Type: application/json" \
175
+ -d '{
176
+ "inputs": [
177
+ "Scientists explore the universe driven by curiosity.",
178
+ "Children learn through curious exploration.",
179
+ "Historical discoveries began with curious questions.",
180
+ "Animals use curiosity to adapt and survive.",
181
+ "Philosophy examines the nature of curiosity."
182
+ ],
183
+ "normalize": false
184
+ }'
185
+ ```
186
+
187
+ </details>
188
+
189
  ## Technical Details
190
 
191
  For comprehensive technical details and evaluation results, see our paper on arXiv: https://arxiv.org/abs/2602.11151.