amazon
/

MistralLite

@@ -83,11 +83,13 @@ there were some limitations on its performance on longer context. Motivated by i
 ## How to Use MistralFlite from Python Code ##
 ### Install the necessary packages
-Requires: [transformers](https://pypi.org/project/transformers/) 4.34.0 or later, and [flash-attn](https://pypi.org/project/flash-attn/) 2.3.1.post1 or later.
 ```shell
 pip install transformers==4.34.0
 pip install flash-attn==2.3.1.post1 --no-build-isolation
 ```
 ### You can then try the following example code
@@ -112,7 +114,7 @@ prompt = "<|prompter|>What are the main challenges to support a long context for
 sequences = pipeline(
     prompt,
-    max_new_tokens=200,
     do_sample=False,
     return_full_text=False,
     num_return_sequences=1,
@@ -225,31 +227,55 @@ Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggi
 Example Docker parameters:
 ```shell
---model-id amazon/MistralLite --port 3000 --max-input-length 8192 --max-total-tokens 16384 --max-batch-prefill-tokens 16384
 ```
 ### Perform Inference ###
-Example Python code for inference with TGI (requires huggingface-hub 0.17.0 or later):
 ```shell
-pip3 install huggingface-hub==0.17.0
 ```
 ```python
-from huggingface_hub import InferenceClient
-endpoint_url = "https://your-endpoint-url-here"
-prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"
-client = InferenceClient(endpoint_url)
-response = client.text_generation(prompt,
-                                  max_new_tokens=100,
-                                  do_sample=False,
-                                  temperature=None,
-                                  )
-print(f"Model output: {response}")
 ```
 **Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.

 ## How to Use MistralFlite from Python Code ##
 ### Install the necessary packages
+Requires: [transformers](https://pypi.org/project/transformers/) 4.34.0 or later, [flash-attn](https://pypi.org/project/flash-attn/) 2.3.1.post1 or later,
+and [accelerate](https://pypi.org/project/accelerate/) 0.23.0 or later.
 ```shell
 pip install transformers==4.34.0
 pip install flash-attn==2.3.1.post1 --no-build-isolation
+pip install accelerate==0.23.0
 ```
 ### You can then try the following example code
 sequences = pipeline(
     prompt,
+    max_new_tokens=400,
     do_sample=False,
     return_full_text=False,
     num_return_sequences=1,
 Example Docker parameters:
 ```shell
+docker run -d --gpus all --shm-size 1g -p 443:80 ghcr.io/huggingface/text-generation-inference:1.1.0 \
+      --model-id amazon/MistralLite \
+      --max-input-length 8192 \
+      --max-total-tokens 16384 \
+      --max-batch-prefill-tokens 16384
 ```
 ### Perform Inference ###
+Example Python code for inference with TGI (requires `text_generation` 0.6.1 or later):
 ```shell
+pip install text_generation==0.6.1
 ```
 ```python
+from text_generation import Client
+SERVER_PORT = 443
+SERVER_HOST = "localhost"
+SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
+tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
+def invoke_falconlite(prompt,
+                      random_seed=1,
+                      max_new_tokens=250,
+                      print_stream=True,
+                      assist_role=True):
+    if (assist_role):
+        prompt = f"<|prompter|>{prompt}<|/s|><|assistant|>"
+    output = ""
+    for response in tgi_client.generate_stream(
+        prompt,
+        do_sample=False,
+        max_new_tokens=max_new_tokens,
+        typical_p=0.2,
+        temperature=None,
+        truncate=None,
+        seed=random_seed,
+    ):
+        if hasattr(response, "token"):
+            if not response.token.special:
+                snippet = response.token.text
+                output += snippet
+                if (print_stream):
+                    print(snippet, end='', flush=True)
+    return output
+prompt = "What are the main challenges to support a long context for LLM?"
+result = invoke_falconlite(prompt)
 ```
 **Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.