ybabakhin radekosmulski-nvidia commited on
Commit
ac8c77b
·
verified ·
1 Parent(s): f3cb564

Add vllm config and information (#11)

Browse files

- Add vllm config and information (2941eb750cac5adce958ef5a56878b3be7c83539)


Co-authored-by: Radek Osmulski <radekosmulski-nvidia@users.noreply.huggingface.co>

Files changed (2) hide show
  1. README.md +33 -0
  2. config_vllm.json +38 -0
README.md CHANGED
@@ -137,6 +137,39 @@ print(scores.tolist())
137
 
138
  ```
139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  ### **Software Integration**
141
 
142
  **Runtime Engine:** Llama Nemotron embedding NIM
 
137
 
138
  ```
139
 
140
+ #### vLLM Usage
141
+
142
+ 1. Ensure you are using `vllm==0.11.0`.
143
+ 2. Clone [this model's repository](https://huggingface.co/nvidia/llama-nemotron-embed-1b-v2/tree/main).
144
+ 3. Overwrite `config.json` with `config_vllm.json`.
145
+ 4. Start the vLLM server with the following command (replace the `<path_to_the_cloned_repository>` and `<num_gpus_to_use>` with your values):
146
+ ```
147
+ vllm serve \
148
+ <path_to_the_cloned_repository> \
149
+ --trust-remote-code \
150
+ --runner pooling \
151
+ --model-impl vllm \
152
+ --override-pooler-config '{\"pooling_type\": \"MEAN\"}' \
153
+ --data-parallel-size <num_gpus_to_use> \
154
+ --dtype float32 \
155
+ --port 8000
156
+ ```
157
+
158
+ You can now access the model using the OpenAI sdk, for instance:
159
+
160
+ ```
161
+ from openai import OpenAI
162
+ client = OpenAI(base_url="http://localhost:8000/v1")
163
+ models = client.models.list()
164
+ model_name = models.data[0].id
165
+
166
+ response = client.embeddings.create(
167
+ input=['query: summit define'],
168
+ model=model_name
169
+ )
170
+ response.data[0].embedding
171
+ ```
172
+
173
  ### **Software Integration**
174
 
175
  **Runtime Engine:** Llama Nemotron embedding NIM
config_vllm.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "nvidia/llama-3.2-nv-embedqa-1b-v2",
3
+ "architectures": [
4
+ "LlamaModel"
5
+ ],
6
+ "attention_bias": false,
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": 128000,
9
+ "eos_token_id": 128001,
10
+ "head_dim": 64,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 2048,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 8192,
15
+ "max_position_embeddings": 131072,
16
+ "is_causal": false,
17
+ "mlp_bias": false,
18
+ "model_type": "llama",
19
+ "num_attention_heads": 32,
20
+ "num_hidden_layers": 16,
21
+ "num_key_value_heads": 8,
22
+ "pooling": "avg",
23
+ "pretraining_tp": 1,
24
+ "rms_norm_eps": 1e-05,
25
+ "rope_scaling": {
26
+ "factor": 32.0,
27
+ "high_freq_factor": 4.0,
28
+ "low_freq_factor": 1.0,
29
+ "original_max_position_embeddings": 8192,
30
+ "rope_type": "llama3"
31
+ },
32
+ "rope_theta": 500000.0,
33
+ "tie_word_embeddings": true,
34
+ "torch_dtype": "bfloat16",
35
+ "transformers_version": "4.44.2",
36
+ "use_cache": true,
37
+ "vocab_size": 128256
38
+ }