Update vLLM usage docs: remove config_vllm.json overwrite, relax version pin, and clarify minimal required flags
#18
by nvidia-oliver-holworthy - opened
This PR updates the vLLM Usage section in README.md to reflect current behavior for nvidia/llama-nemotron-embed-1b-v2.
What changed
- Updated version guidance from
vllm==0.16.0tovllm>=0.14.0. - Removed the outdated step to overwrite
config.jsonwithconfig_vllm.json. - Simplified the serving command to the minimal required invocation using the HF repo ID:
vllm serve nvidia/llama-nemotron-embed-1b-v2 --trust-remote-code- Clarified that a local path can also be used instead of the HF repo ID.
- Removed
--runner poolingand--pooler-configfrom recommended flags. - Kept only operational optional flags (
--dtype,--data-parallel-size,--port). - Added clarification that mean pooling is already configured in
config.json("pooling": "avg"), so overriding pooler config is generally unnecessary and not recommended for retrieval quality. - Updated the OpenAI SDK example to include
api_key="EMPTY"(required by the OpenAI Python client even for local vLLM). - Added an offline vLLM Python API example (
LLM(...).embed(...)) to clearly distinguish online serving vs offline inference usage.
Why
The model now works with modern vLLM using its default config.json, so the previous config replacement workflow is no longer needed. The updated docs reduce setup friction and align examples with current vLLM usage patterns.
Validation
- Confirmed startup works with minimal command and no config replacement on vLLM
0.14.0. - Confirmed
config.jsonalready encodes the expected pooling default. - Confirmed OpenAI client requires an API key field and works with
api_key="EMPTY"for local vLLM. - Confirmed output embeddings match the reference PyTorch/Transformers implementation.
nvidia-oliver-holworthy changed pull request status to open
ybabakhin changed pull request status to merged