Commit
·
eec040e
1
Parent(s):
206ab7e
Usage section for TEI (#4)
Browse files- docs: added usage section for TEI (48b327cf88cf893c73dcebd372043472c3ff1014)
- docs: update snippet to not use normalization (9238a3b6917121615a94512ec8a84a3d51cf0107)
- fix: use /embed endpoint (6f75acd873b3c46c9e34ed45b061030b24e4e199)
- fix: remove mention of OpenAI API (df3b2f9a723081dc4606908bed889141f2cb5be3)
README.md
CHANGED
|
@@ -86,6 +86,56 @@ embeddings = model.encode(texts, quantization="binary") # Shape: (5, 2560), quan
|
|
| 86 |
|
| 87 |
</details>
|
| 88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
## Technical Details
|
| 91 |
|
|
|
|
| 86 |
|
| 87 |
</details>
|
| 88 |
|
| 89 |
+
<details>
|
| 90 |
+
<summary>Using Text Embeddings Inference (TEI)</summary>
|
| 91 |
+
|
| 92 |
+
> [!NOTE]
|
| 93 |
+
> Text Embeddings Inference v1.9.0 will be released stable soon, in the meantime
|
| 94 |
+
> feel free to use the latest containers or rather via SHA ``.
|
| 95 |
+
|
| 96 |
+
> [!IMPORTANT]
|
| 97 |
+
> Currently, only int8-quantized embeddings are available via TEI. Remember to use cosine similarity with unnormalized int8 embeddings.
|
| 98 |
+
|
| 99 |
+
- CPU w/ Candle:
|
| 100 |
+
|
| 101 |
+
```bash
|
| 102 |
+
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-latest --model-id perplexity-ai/pplx-embed-1-4B --auto-truncate
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
- CPU w/ ORT (ONNX Runtime):
|
| 106 |
+
|
| 107 |
+
```bash
|
| 108 |
+
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-latest --model-id onnx-community/pplx-embed-1-4B --auto-truncate
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
- GPU w/ CUDA:
|
| 112 |
+
|
| 113 |
+
```bash
|
| 114 |
+
docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-latest --model-id perplexity-ai/pplx-embed-1-4B --auto-truncate
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
> Alternatively, when running in CUDA you can use the architecture / compute capability specific
|
| 118 |
+
> container instead of the `cuda-latest`, as that includes the binaries for Turing, Ampere and
|
| 119 |
+
> Hopper, so using a dedicated container will be lighter e.g., `ampere-latest`.
|
| 120 |
+
|
| 121 |
+
And then you can send requests to it via cURL to `/embed`:
|
| 122 |
+
|
| 123 |
+
```bash
|
| 124 |
+
curl http://0.0.0.0:8080/embed \
|
| 125 |
+
-H "Content-Type: application/json" \
|
| 126 |
+
-d '{
|
| 127 |
+
"inputs": [
|
| 128 |
+
"Scientists explore the universe driven by curiosity.",
|
| 129 |
+
"Children learn through curious exploration.",
|
| 130 |
+
"Historical discoveries began with curious questions.",
|
| 131 |
+
"Animals use curiosity to adapt and survive.",
|
| 132 |
+
"Philosophy examines the nature of curiosity.",
|
| 133 |
+
],
|
| 134 |
+
"normalize" false
|
| 135 |
+
}'
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
</details>
|
| 139 |
|
| 140 |
## Technical Details
|
| 141 |
|