amazon
/

MistralLite

@@ -80,7 +80,10 @@ there were some limitations on its performance on longer context. Motivated by i
 - **Model License:** Apache 2.0
 - **Contact:** [GitHub issues](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/issues)
-## How to Use MistralFlite from Python Code ##
 ### Install the necessary packages
 Requires: [transformers](https://pypi.org/project/transformers/) 4.34.0 or later, [flash-attn](https://pypi.org/project/flash-attn/) 2.3.1.post1 or later,
@@ -128,7 +131,78 @@ for seq in sequences:
 <|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>
 ```
 ## How to Deploy MistralFlite on Amazon SageMaker ##
 ### Install the necessary packages
 Requires: [sagemaker](https://pypi.org/project/sagemaker/) 2.192.1 or later.
@@ -231,72 +305,12 @@ result = call_endpoint(client, prompt, endpoint_name, parameters)
 print(result)
 ```
-## How to Serve MistralFlite on TGI ##
-### Start TGI server ###
-Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`
-Example Docker parameters:
-```shell
-docker run -d --gpus all --shm-size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
-      --model-id amazon/MistralLite \
-      --max-input-length 16000 \
-      --max-total-tokens 16384 \
-      --max-batch-prefill-tokens 16384 \
-      --trust-remote-code
-```
-### Perform Inference ###
-Example Python code for inference with TGI (requires `text_generation` 0.6.1 or later):
-```shell
-pip install text_generation==0.6.1
-```
-```python
-from text_generation import Client
-SERVER_PORT = 443
-SERVER_HOST = "localhost"
-SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
-tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
-def invoke_tgi(prompt,
-                      random_seed=1,
-                      max_new_tokens=400,
-                      print_stream=True,
-                      assist_role=True):
-    if (assist_role):
-        prompt = f"<|prompter|>{prompt}</s><|assistant|>"
-    output = ""
-    for response in tgi_client.generate_stream(
-        prompt,
-        do_sample=False,
-        max_new_tokens=max_new_tokens,
-        return_full_text=False,
-        #temperature=None,
-        #truncate=None,
-        #seed=random_seed,
-        #typical_p=0.2,
-    ):
-        if hasattr(response, "token"):
-            if not response.token.special:
-                snippet = response.token.text
-                output += snippet
-                if (print_stream):
-                    print(snippet, end='', flush=True)
-    return output
-prompt = "What are the main challenges to support a long context for LLM?"
-result = invoke_tgi(prompt)
-```
-**Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.
 ## How to Serve MistralFlite on vLLM ##
 Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
 ### Using vLLM as a server ###
 When using vLLM as a server, pass the --model amazon/MistralLite parameter, for example:
 ```shell

 - **Model License:** Apache 2.0
 - **Contact:** [GitHub issues](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/issues)
+## How to Use MistralFlite from Python Code (HuggingFace transformers) ##
+**Important** - For an end-to-end example Jupyter notebook, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/huggingface-transformers/example_usage.ipynb).
 ### Install the necessary packages
 Requires: [transformers](https://pypi.org/project/transformers/) 4.34.0 or later, [flash-attn](https://pypi.org/project/flash-attn/) 2.3.1.post1 or later,
 <|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>
 ```
+## How to Serve MistralFlite on TGI ##
+**Important:**
+- For an end-to-end example Jupyter notebook using the native TGI container, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/tgi/example_usage.ipynb).
+- If the **input context length is greater than 12K tokens**, it is recommended using a custom TGI container, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/tgi-custom/example_usage.ipynb).
+### Start TGI server ###
+Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`
+Example Docker parameters:
+```shell
+docker run -d --gpus all --shm-size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
+      --model-id amazon/MistralLite \
+      --max-input-length 16000 \
+      --max-total-tokens 16384 \
+      --max-batch-prefill-tokens 16384 \
+      --trust-remote-code
+```
+### Perform Inference ###
+Example Python code for inference with TGI (requires `text_generation` 0.6.1 or later):
+```shell
+pip install text_generation==0.6.1
+```
+```python
+from text_generation import Client
+SERVER_PORT = 443
+SERVER_HOST = "localhost"
+SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
+tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
+def invoke_tgi(prompt,
+                      random_seed=1,
+                      max_new_tokens=400,
+                      print_stream=True,
+                      assist_role=True):
+    if (assist_role):
+        prompt = f"<|prompter|>{prompt}</s><|assistant|>"
+    output = ""
+    for response in tgi_client.generate_stream(
+        prompt,
+        do_sample=False,
+        max_new_tokens=max_new_tokens,
+        return_full_text=False,
+        #temperature=None,
+        #truncate=None,
+        #seed=random_seed,
+        #typical_p=0.2,
+    ):
+        if hasattr(response, "token"):
+            if not response.token.special:
+                snippet = response.token.text
+                output += snippet
+                if (print_stream):
+                    print(snippet, end='', flush=True)
+    return output
+prompt = "What are the main challenges to support a long context for LLM?"
+result = invoke_tgi(prompt)
+```
+**Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.
 ## How to Deploy MistralFlite on Amazon SageMaker ##
+**Important:**
+- For an end-to-end example Jupyter notebook using the SageMaker built-in container, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/sagemaker-tgi/example_usage.ipynb).
+- If the **input context length is greater than 12K tokens**, it is recommended using a custom docker container, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/sagemaker-tgi-custom/example_usage.ipynb).
 ### Install the necessary packages
 Requires: [sagemaker](https://pypi.org/project/sagemaker/) 2.192.1 or later.
 print(result)
 ```
 ## How to Serve MistralFlite on vLLM ##
 Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
+**Important** - For an end-to-end example Jupyter notebook, please refer to [this link](https://github.com/awslabs/extending-the-context-length-of-open-source-llms/blob/main/MistralLite/vllm/example_usage.ipynb).
 ### Using vLLM as a server ###
 When using vLLM as a server, pass the --model amazon/MistralLite parameter, for example:
 ```shell