alexmarques commited on
Commit
2722d19
·
verified ·
1 Parent(s): d0db297

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -16
README.md CHANGED
@@ -69,32 +69,41 @@ This optimization reduces the number of bits per parameter from 16 to 4, reducin
69
 
70
  ### Use with vLLM
71
 
72
- This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 
 
 
 
 
73
 
74
  ```python
75
- from transformers import AutoTokenizer
76
- from vllm import LLM, SamplingParams
77
 
78
- max_model_len, tp_size = 4096, 1
79
- model_name = "neuralmagic/Mistral-Small-24B-Instruct-2501-quantized.w4a16"
80
- tokenizer = AutoTokenizer.from_pretrained(model_name)
81
- llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
82
- sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
83
 
84
- messages_list = [
85
- [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
86
- ]
 
87
 
88
- prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
89
 
90
- outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
91
 
92
- generated_text = [output.outputs[0].text for output in outputs]
 
 
 
 
 
 
 
 
 
93
  print(generated_text)
94
  ```
95
 
96
- vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
97
-
98
  <details>
99
  <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
100
 
 
69
 
70
  ### Use with vLLM
71
 
72
+ 1. Initialize vLLM server:
73
+ ```
74
+ vllm serve RedHatAI/Mistral-Small-24B-Instruct-2501-quantized.w4a16 --tensor_parallel_size 1 --tokenizer_mode mistral
75
+ ```
76
+
77
+ 2. Send requests to the server:
78
 
79
  ```python
80
+ from openai import OpenAI
 
81
 
82
+ # Modify OpenAI's API key and API base to use vLLM's API server.
83
+ openai_api_key = "EMPTY"
84
+ openai_api_base = "http://<your-server-host>:8000/v1"
 
 
85
 
86
+ client = OpenAI(
87
+ api_key=openai_api_key,
88
+ base_url=openai_api_base,
89
+ )
90
 
91
+ model = "RedHatAI/Mistral-Small-24B-Instruct-2501-quantized.w4a16"
92
 
 
93
 
94
+ messages = [
95
+ {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
96
+ ]
97
+
98
+ outputs = client.chat.completions.create(
99
+ model=model,
100
+ messages=messages,
101
+ )
102
+
103
+ generated_text = outputs.choices[0].message.content
104
  print(generated_text)
105
  ```
106
 
 
 
107
  <details>
108
  <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
109