Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / text-embeddings-inference /pr_742 /en /cli_arguments.md

rtrm

about 2 months ago

preview code

download

raw

8.13 kB

	# CLI arguments

	To see all options to serve your models, run the following:

	```console
	$ text-embeddings-router --help
	Text Embedding Webserver

	Usage: text-embeddings-router [OPTIONS] --model-id

	Options:
	--model-id
	The Hugging Face model ID, can be any model listed on with the `text-embeddings-inference` tag (meaning it's compatible with Text Embeddings Inference).

	Alternatively, the specified ID can also be a path to a local directory containing the necessary model files saved by the `save_pretrained(...)` methods of either Transformers or Sentence Transformers.

	[env: MODEL_ID=]

	--revision
	The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`

	[env: REVISION=]

	--tokenization-workers
	Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation. Default to the number of CPU cores on the machine

	[env: TOKENIZATION_WORKERS=]

	--dtype
	The dtype to be forced upon the model

	[env: DTYPE=]
	[possible values: float16, float32]

	--served-model-name
	The name of the model that is being served. If not specified, defaults to `--model-id`. It is only used for the OpenAI-compatible endpoints via HTTP

	[env: SERVED_MODEL_NAME=]

	--pooling
	Optionally control the pooling method for embedding models.

	If `pooling` is not set, the pooling configuration will be parsed from the model `1_Pooling/config.json` configuration.

	If `pooling` is set, it will override the model pooling configuration

	[env: POOLING=]

	Possible values:
	- cls: Select the CLS token as embedding
	- mean: Apply Mean pooling to the model embeddings
	- splade: Apply SPLADE (Sparse Lexical and Expansion) to the model embeddings. This option is only available if the loaded model is a `ForMaskedLM` Transformer model
	- last-token: Select the last token as embedding

	--max-concurrent-requests
	The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them wait for too long and is usually good to handle backpressure correctly

	[env: MAX_CONCURRENT_REQUESTS=]
	[default: 512]

	--max-batch-tokens
	IMPORTANT This is one critical control to allow maximum usage of the available hardware.

	This represents the total amount of potential tokens within a batch.

	For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.

	Overall this number should be the largest possible until the model is compute bound. Since the actual memory overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.

	[env: MAX_BATCH_TOKENS=]
	[default: 16384]

	--max-batch-requests
	Optionally control the maximum number of individual requests in a batch

	[env: MAX_BATCH_REQUESTS=]

	--max-client-batch-size
	Control the maximum number of inputs that a client can send in a single request

	[env: MAX_CLIENT_BATCH_SIZE=]
	[default: 32]

	--auto-truncate
	Control automatic truncation of inputs that exceed the model's maximum supported size. Defaults to `true` (truncation enabled). Set to `false` to disable truncation; when disabled and the model's maximum input length exceeds `--max-batch-tokens`, the server will refuse to start with an error instead of silently truncating sequences.

	Unused for gRPC servers

	[env: AUTO_TRUNCATE=]

	--default-prompt-name
	The name of the prompt that should be used by default for encoding. If not set, no prompt will be applied.

	Must be a key in the `sentence-transformers` configuration `prompts` dictionary.

	For example if ``default_prompt_name`` is "query" and the ``prompts`` is {"query": "query: ", ...}, then the sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text to encode.

	The argument '--default-prompt-name ' cannot be used with '--default-prompt `

	[env: DEFAULT_PROMPT_NAME=]

	--default-prompt
	The prompt that should be used by default for encoding. If not set, no prompt will be applied.

	For example if ``default_prompt`` is "query: " then the sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text to encode.

	The argument '--default-prompt ' cannot be used with '--default-prompt-name `

	[env: DEFAULT_PROMPT=]

	--dense-path
	Optionally, define the path to the Dense module required for some embedding models.

	Some embedding models require an extra `Dense` module which contains a single Linear layer and an activation function. By default, those `Dense` modules are stored under the `2_Dense` directory, but there might be cases where different `Dense` modules are provided, to convert the pooled embeddings into different dimensions, available as `2_Dense_` e.g. https://huggingface.co/NovaSearch/stella_en_400M_v5.

	Note that this argument is optional, only required to be set if there is no `modules.json` file or when you want to override a single Dense module path, only when running with the `candle` backend.

	[env: DENSE_PATH=]

	--hf-token
	Your Hugging Face Hub token. If neither `--hf-token` nor `HF_TOKEN` is set, the token will be read from the `$HF_HOME/token` path, if it exists. This ensures access to private or gated models, and allows for a more permissive rate limiting

	[env: HF_TOKEN=]

	--hostname
	The IP address to listen on

	[env: HOSTNAME=]
	[default: 0.0.0.0]

	-p, --port
	The port to listen on

	[env: PORT=]
	[default: 3000]

	--uds-path
	The name of the unix socket some text-embeddings-inference backends will use as they communicate internally with gRPC

	[env: UDS_PATH=]
	[default: /tmp/text-embeddings-inference-server]

	--huggingface-hub-cache
	The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance

	[env: HUGGINGFACE_HUB_CACHE=]

	--payload-limit
	Payload size limit in bytes

	Default is 2MB

	[env: PAYLOAD_LIMIT=]
	[default: 2000000]

	--api-key
	Set an api key for request authorization.

	By default the server responds to every request. With an api key set, the requests must have the Authorization header set with the api key as Bearer token.

	[env: API_KEY=]

	--json-output
	Outputs the logs in JSON format (useful for telemetry)

	[env: JSON_OUTPUT=]

	--disable-spans
	Whether or not to include the log trace through spans

	[env: DISABLE_SPANS=]

	--otlp-endpoint
	The grpc endpoint for opentelemetry. Telemetry is sent to this endpoint as OTLP over gRPC. e.g. `http://localhost:4317`

	[env: OTLP_ENDPOINT=]

	--otlp-service-name
	The service name for opentelemetry. e.g. `text-embeddings-inference.server`

	[env: OTLP_SERVICE_NAME=]
	[default: text-embeddings-inference.server]

	--prometheus-port
	The Prometheus port to listen on

	[env: PROMETHEUS_PORT=]
	[default: 9000]

	--cors-allow-origin
	Unused for gRPC servers

	[env: CORS_ALLOW_ORIGIN=]

	-h, --help
	Print help (see a summary with '-h')

	-V, --version
	Print version
	```

Xet Storage Details

Size:: 8.13 kB
Xet hash:: e1b490fe936b52bf00aa870a100cade13773cac70061fa73ca7008ee1d672490

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.