Commit
·
d1f0846
1
Parent(s):
9f347ce
Update README.md
Browse files
README.md
CHANGED
|
@@ -83,11 +83,13 @@ there were some limitations on its performance on longer context. Motivated by i
|
|
| 83 |
## How to Use MistralFlite from Python Code ##
|
| 84 |
### Install the necessary packages
|
| 85 |
|
| 86 |
-
Requires: [transformers](https://pypi.org/project/transformers/) 4.34.0 or later,
|
|
|
|
| 87 |
|
| 88 |
```shell
|
| 89 |
pip install transformers==4.34.0
|
| 90 |
pip install flash-attn==2.3.1.post1 --no-build-isolation
|
|
|
|
| 91 |
```
|
| 92 |
### You can then try the following example code
|
| 93 |
|
|
@@ -112,7 +114,7 @@ prompt = "<|prompter|>What are the main challenges to support a long context for
|
|
| 112 |
|
| 113 |
sequences = pipeline(
|
| 114 |
prompt,
|
| 115 |
-
max_new_tokens=
|
| 116 |
do_sample=False,
|
| 117 |
return_full_text=False,
|
| 118 |
num_return_sequences=1,
|
|
@@ -225,31 +227,55 @@ Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggi
|
|
| 225 |
Example Docker parameters:
|
| 226 |
|
| 227 |
```shell
|
| 228 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 229 |
```
|
| 230 |
|
| 231 |
### Perform Inference ###
|
| 232 |
-
Example Python code for inference with TGI (requires
|
| 233 |
|
| 234 |
```shell
|
| 235 |
-
|
| 236 |
```
|
| 237 |
|
| 238 |
```python
|
| 239 |
-
from
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 253 |
```
|
| 254 |
|
| 255 |
**Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.
|
|
|
|
| 83 |
## How to Use MistralFlite from Python Code ##
|
| 84 |
### Install the necessary packages
|
| 85 |
|
| 86 |
+
Requires: [transformers](https://pypi.org/project/transformers/) 4.34.0 or later, [flash-attn](https://pypi.org/project/flash-attn/) 2.3.1.post1 or later,
|
| 87 |
+
and [accelerate](https://pypi.org/project/accelerate/) 0.23.0 or later.
|
| 88 |
|
| 89 |
```shell
|
| 90 |
pip install transformers==4.34.0
|
| 91 |
pip install flash-attn==2.3.1.post1 --no-build-isolation
|
| 92 |
+
pip install accelerate==0.23.0
|
| 93 |
```
|
| 94 |
### You can then try the following example code
|
| 95 |
|
|
|
|
| 114 |
|
| 115 |
sequences = pipeline(
|
| 116 |
prompt,
|
| 117 |
+
max_new_tokens=400,
|
| 118 |
do_sample=False,
|
| 119 |
return_full_text=False,
|
| 120 |
num_return_sequences=1,
|
|
|
|
| 227 |
Example Docker parameters:
|
| 228 |
|
| 229 |
```shell
|
| 230 |
+
docker run -d --gpus all --shm-size 1g -p 443:80 ghcr.io/huggingface/text-generation-inference:1.1.0 \
|
| 231 |
+
--model-id amazon/MistralLite \
|
| 232 |
+
--max-input-length 8192 \
|
| 233 |
+
--max-total-tokens 16384 \
|
| 234 |
+
--max-batch-prefill-tokens 16384
|
| 235 |
```
|
| 236 |
|
| 237 |
### Perform Inference ###
|
| 238 |
+
Example Python code for inference with TGI (requires `text_generation` 0.6.1 or later):
|
| 239 |
|
| 240 |
```shell
|
| 241 |
+
pip install text_generation==0.6.1
|
| 242 |
```
|
| 243 |
|
| 244 |
```python
|
| 245 |
+
from text_generation import Client
|
| 246 |
+
|
| 247 |
+
SERVER_PORT = 443
|
| 248 |
+
SERVER_HOST = "localhost"
|
| 249 |
+
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
|
| 250 |
+
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
|
| 251 |
+
|
| 252 |
+
def invoke_falconlite(prompt,
|
| 253 |
+
random_seed=1,
|
| 254 |
+
max_new_tokens=250,
|
| 255 |
+
print_stream=True,
|
| 256 |
+
assist_role=True):
|
| 257 |
+
if (assist_role):
|
| 258 |
+
prompt = f"<|prompter|>{prompt}<|/s|><|assistant|>"
|
| 259 |
+
output = ""
|
| 260 |
+
for response in tgi_client.generate_stream(
|
| 261 |
+
prompt,
|
| 262 |
+
do_sample=False,
|
| 263 |
+
max_new_tokens=max_new_tokens,
|
| 264 |
+
typical_p=0.2,
|
| 265 |
+
temperature=None,
|
| 266 |
+
truncate=None,
|
| 267 |
+
seed=random_seed,
|
| 268 |
+
):
|
| 269 |
+
if hasattr(response, "token"):
|
| 270 |
+
if not response.token.special:
|
| 271 |
+
snippet = response.token.text
|
| 272 |
+
output += snippet
|
| 273 |
+
if (print_stream):
|
| 274 |
+
print(snippet, end='', flush=True)
|
| 275 |
+
return output
|
| 276 |
+
|
| 277 |
+
prompt = "What are the main challenges to support a long context for LLM?"
|
| 278 |
+
result = invoke_falconlite(prompt)
|
| 279 |
```
|
| 280 |
|
| 281 |
**Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.
|