Commit
·
90fed38
1
Parent(s):
0ae4abd
Update README.md
Browse files
README.md
CHANGED
|
@@ -80,7 +80,10 @@ there were some limitations on its performance on longer context. Motivated by i
|
|
| 80 |
- **Model License:** Apache 2.0
|
| 81 |
- **Contact:** [GitHub issues](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/issues)
|
| 82 |
|
| 83 |
-
## How to Use MistralFlite from Python Code ##
|
|
|
|
|
|
|
|
|
|
| 84 |
### Install the necessary packages
|
| 85 |
|
| 86 |
Requires: [transformers](https://pypi.org/project/transformers/) 4.34.0 or later, [flash-attn](https://pypi.org/project/flash-attn/) 2.3.1.post1 or later,
|
|
@@ -128,7 +131,78 @@ for seq in sequences:
|
|
| 128 |
<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>
|
| 129 |
```
|
| 130 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
## How to Deploy MistralFlite on Amazon SageMaker ##
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
### Install the necessary packages
|
| 133 |
|
| 134 |
Requires: [sagemaker](https://pypi.org/project/sagemaker/) 2.192.1 or later.
|
|
@@ -231,72 +305,12 @@ result = call_endpoint(client, prompt, endpoint_name, parameters)
|
|
| 231 |
print(result)
|
| 232 |
```
|
| 233 |
|
| 234 |
-
## How to Serve MistralFlite on TGI ##
|
| 235 |
-
|
| 236 |
-
### Start TGI server ###
|
| 237 |
-
Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`
|
| 238 |
-
|
| 239 |
-
Example Docker parameters:
|
| 240 |
-
|
| 241 |
-
```shell
|
| 242 |
-
docker run -d --gpus all --shm-size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
|
| 243 |
-
--model-id amazon/MistralLite \
|
| 244 |
-
--max-input-length 16000 \
|
| 245 |
-
--max-total-tokens 16384 \
|
| 246 |
-
--max-batch-prefill-tokens 16384 \
|
| 247 |
-
--trust-remote-code
|
| 248 |
-
```
|
| 249 |
-
|
| 250 |
-
### Perform Inference ###
|
| 251 |
-
Example Python code for inference with TGI (requires `text_generation` 0.6.1 or later):
|
| 252 |
-
|
| 253 |
-
```shell
|
| 254 |
-
pip install text_generation==0.6.1
|
| 255 |
-
```
|
| 256 |
-
|
| 257 |
-
```python
|
| 258 |
-
from text_generation import Client
|
| 259 |
-
|
| 260 |
-
SERVER_PORT = 443
|
| 261 |
-
SERVER_HOST = "localhost"
|
| 262 |
-
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
|
| 263 |
-
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
|
| 264 |
-
|
| 265 |
-
def invoke_tgi(prompt,
|
| 266 |
-
random_seed=1,
|
| 267 |
-
max_new_tokens=400,
|
| 268 |
-
print_stream=True,
|
| 269 |
-
assist_role=True):
|
| 270 |
-
if (assist_role):
|
| 271 |
-
prompt = f"<|prompter|>{prompt}</s><|assistant|>"
|
| 272 |
-
output = ""
|
| 273 |
-
for response in tgi_client.generate_stream(
|
| 274 |
-
prompt,
|
| 275 |
-
do_sample=False,
|
| 276 |
-
max_new_tokens=max_new_tokens,
|
| 277 |
-
return_full_text=False,
|
| 278 |
-
#temperature=None,
|
| 279 |
-
#truncate=None,
|
| 280 |
-
#seed=random_seed,
|
| 281 |
-
#typical_p=0.2,
|
| 282 |
-
):
|
| 283 |
-
if hasattr(response, "token"):
|
| 284 |
-
if not response.token.special:
|
| 285 |
-
snippet = response.token.text
|
| 286 |
-
output += snippet
|
| 287 |
-
if (print_stream):
|
| 288 |
-
print(snippet, end='', flush=True)
|
| 289 |
-
return output
|
| 290 |
-
|
| 291 |
-
prompt = "What are the main challenges to support a long context for LLM?"
|
| 292 |
-
result = invoke_tgi(prompt)
|
| 293 |
-
```
|
| 294 |
-
|
| 295 |
-
**Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.
|
| 296 |
|
| 297 |
## How to Serve MistralFlite on vLLM ##
|
| 298 |
Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
|
| 299 |
|
|
|
|
|
|
|
| 300 |
### Using vLLM as a server ###
|
| 301 |
When using vLLM as a server, pass the --model amazon/MistralLite parameter, for example:
|
| 302 |
```shell
|
|
|
|
| 80 |
- **Model License:** Apache 2.0
|
| 81 |
- **Contact:** [GitHub issues](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/issues)
|
| 82 |
|
| 83 |
+
## How to Use MistralFlite from Python Code (HuggingFace transformers) ##
|
| 84 |
+
|
| 85 |
+
**Important** - For an end-to-end example Jupyter notebook, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/huggingface-transformers/example_usage.ipynb).
|
| 86 |
+
|
| 87 |
### Install the necessary packages
|
| 88 |
|
| 89 |
Requires: [transformers](https://pypi.org/project/transformers/) 4.34.0 or later, [flash-attn](https://pypi.org/project/flash-attn/) 2.3.1.post1 or later,
|
|
|
|
| 131 |
<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>
|
| 132 |
```
|
| 133 |
|
| 134 |
+
## How to Serve MistralFlite on TGI ##
|
| 135 |
+
**Important:**
|
| 136 |
+
- For an end-to-end example Jupyter notebook using the native TGI container, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/tgi/example_usage.ipynb).
|
| 137 |
+
- If the **input context length is greater than 12K tokens**, it is recommended using a custom TGI container, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/tgi-custom/example_usage.ipynb).
|
| 138 |
+
|
| 139 |
+
### Start TGI server ###
|
| 140 |
+
Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`
|
| 141 |
+
|
| 142 |
+
Example Docker parameters:
|
| 143 |
+
|
| 144 |
+
```shell
|
| 145 |
+
docker run -d --gpus all --shm-size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
|
| 146 |
+
--model-id amazon/MistralLite \
|
| 147 |
+
--max-input-length 16000 \
|
| 148 |
+
--max-total-tokens 16384 \
|
| 149 |
+
--max-batch-prefill-tokens 16384 \
|
| 150 |
+
--trust-remote-code
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
### Perform Inference ###
|
| 154 |
+
Example Python code for inference with TGI (requires `text_generation` 0.6.1 or later):
|
| 155 |
+
|
| 156 |
+
```shell
|
| 157 |
+
pip install text_generation==0.6.1
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
```python
|
| 161 |
+
from text_generation import Client
|
| 162 |
+
|
| 163 |
+
SERVER_PORT = 443
|
| 164 |
+
SERVER_HOST = "localhost"
|
| 165 |
+
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
|
| 166 |
+
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
|
| 167 |
+
|
| 168 |
+
def invoke_tgi(prompt,
|
| 169 |
+
random_seed=1,
|
| 170 |
+
max_new_tokens=400,
|
| 171 |
+
print_stream=True,
|
| 172 |
+
assist_role=True):
|
| 173 |
+
if (assist_role):
|
| 174 |
+
prompt = f"<|prompter|>{prompt}</s><|assistant|>"
|
| 175 |
+
output = ""
|
| 176 |
+
for response in tgi_client.generate_stream(
|
| 177 |
+
prompt,
|
| 178 |
+
do_sample=False,
|
| 179 |
+
max_new_tokens=max_new_tokens,
|
| 180 |
+
return_full_text=False,
|
| 181 |
+
#temperature=None,
|
| 182 |
+
#truncate=None,
|
| 183 |
+
#seed=random_seed,
|
| 184 |
+
#typical_p=0.2,
|
| 185 |
+
):
|
| 186 |
+
if hasattr(response, "token"):
|
| 187 |
+
if not response.token.special:
|
| 188 |
+
snippet = response.token.text
|
| 189 |
+
output += snippet
|
| 190 |
+
if (print_stream):
|
| 191 |
+
print(snippet, end='', flush=True)
|
| 192 |
+
return output
|
| 193 |
+
|
| 194 |
+
prompt = "What are the main challenges to support a long context for LLM?"
|
| 195 |
+
result = invoke_tgi(prompt)
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
**Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.
|
| 199 |
+
|
| 200 |
+
|
| 201 |
## How to Deploy MistralFlite on Amazon SageMaker ##
|
| 202 |
+
**Important:**
|
| 203 |
+
- For an end-to-end example Jupyter notebook using the SageMaker built-in container, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/sagemaker-tgi/example_usage.ipynb).
|
| 204 |
+
- If the **input context length is greater than 12K tokens**, it is recommended using a custom docker container, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/sagemaker-tgi-custom/example_usage.ipynb).
|
| 205 |
+
|
| 206 |
### Install the necessary packages
|
| 207 |
|
| 208 |
Requires: [sagemaker](https://pypi.org/project/sagemaker/) 2.192.1 or later.
|
|
|
|
| 305 |
print(result)
|
| 306 |
```
|
| 307 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 308 |
|
| 309 |
## How to Serve MistralFlite on vLLM ##
|
| 310 |
Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
|
| 311 |
|
| 312 |
+
**Important** - For an end-to-end example Jupyter notebook, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/vllm/example_usage.ipynb).
|
| 313 |
+
|
| 314 |
### Using vLLM as a server ###
|
| 315 |
When using vLLM as a server, pass the --model amazon/MistralLite parameter, for example:
|
| 316 |
```shell
|