TheStageAI
/

Elastic-Mistral-7B-Instruct-v0.3

Text Generation

Model card Files Files and versions

xet

Community

Update README.md

#18

by hinairo - opened Mar 14

base: refs/heads/main

←

from: refs/pr/18

Discussion Files changed

+273

-58

Files changed (1) hide show

README.md +273 -58

README.md CHANGED Viewed

@@ -3,7 +3,7 @@ license: apache-2.0
 base_model:
 - mistralai/Mistral-7B-Instruct-v0.3
 base_model_relation: quantized
-pipeline_tag: text2text-generation
 language:
 - zho
 - eng
@@ -20,38 +20,59 @@ language:
 - ara
 ---
-# Elastic model: Mistral-7B-Instruct-v0.3. Fastest and most flexible models for self-serving.
-Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
-* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
-* __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
-* __M__: Faster model, with accuracy degradation less than 1.5%.
-* __S__: The fastest model, with accuracy degradation less than 2%.
-__Goals of elastic models:__
-* Provide flexibility in cost vs quality selection for inference
-* Provide clear quality and latency benchmarks
-* Provide interface of HF libraries: transformers and diffusers with a single line of code
-* Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
-* Provide the best models and service for self-hosting.
-> It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/V8hpZ-cA9vE5Ijyodp6Ih.png)
------
-## Inference
-To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
 ```python
 import torch
@@ -70,7 +91,7 @@ tokenizer = AutoTokenizer.from_pretrained(
     model_name, token=hf_token
 )
 model = AutoModelForCausalLM.from_pretrained(
-    model_name,
     token=hf_token,
     torch_dtype=torch.bfloat16,
     attn_implementation="sdpa",
@@ -105,7 +126,7 @@ input_len = inputs['input_ids'].shape[1]
 generate_ids = generate_ids[:, input_len:]
 output = tokenizer.batch_decode(
     generate_ids,
-    skip_special_tokens=True,
     clean_up_tokenization_spaces=False
 )[0]
@@ -114,66 +135,260 @@ print(f"# Q:\n{prompt}\n")
 print(f"# A:\n{output}\n")
 ```
-__System requirements:__
-* GPUs: H100, L40s
-* CPU: AMD, Intel
-* Python: 3.10-3.12
-To work with our models just run these lines in your terminal:
-```shell
-pip install thestage
-pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
-pip install flash_attn==2.7.3 --no-build-isolation
-pip uninstall apex
 ```
-Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
-```shell
-thestage config set --api-token <YOUR_API_TOKEN>
 ```
-Congrats, now you can use accelerated models!
-----
-## Benchmarks
-Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
-### Quality benchmarks
-<!-- For quality evaluation we have used: #TODO link to github -->
-| Metric/Model  | S | M | L | XL | Original | W8A8, int8 |
-|---------------|---|---|---|----|----------|------------|
-| MMLU          | 59.7 | 60.1 |  60.8 | 61.4  | 61.4        | 28          |
-| PIQA          | 80.8 | 82   | 81.7  | 81.5  | 81.5        | 65.3          |
-| Arc Challenge | 56.6 | 55.1 | 56.8  | 57.4  | 57.4        | 33.2         |
-| Winogrande    | 73.2 | 72.3 | 73.2  | 74.1  | 74.1        | 57          |
-* **MMLU**:Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
-* **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
-* **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
-* **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
-### Latency benchmarks
-__100 input/300 output; tok/s:__
-| GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
-|-----------|-----|---|---|----|----------|------------|
-| H100      | 186 | 180 | 168 | 136  | 48       | 192          |
-| L40s      | 79  | 68 | 59 | 47  | 38    | 82         |
 ## Links
-* __Platform__: [app.thestage.ai](app.thestage.ai)
-<!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
 * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
-* __Contact email__: contact@thestage.ai

 base_model:
 - mistralai/Mistral-7B-Instruct-v0.3
 base_model_relation: quantized
+pipeline_tag: text-generation
 language:
 - zho
 - eng
 - ara
 ---
+# Elastic model: Mistral-7B-Instruct-v0.3
+## Overview
+----
+ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
+- **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
+- **L**: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
+- **M**: Faster model, with accuracy degradation less than 1.5%.
+- **S**: The fastest model, with accuracy degradation less than 2%.
+Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
+## Installation
+---
+### System Requirements
+| **Property**| **Value** |
+ | ---  | ---  |
+| **GPU** | H100, L40s, B200, RTX 5090 |
+| **Python Version** | 3.10-3.12 |
+| **CPU** | Intel/AMD x86_64 |
+| **CUDA Version** | 12.9+ |
+### TheStage AI Access token setup
+Install TheStage AI CLI and setup API token:
+```bash
+pip install thestage
+thestage config set --access-token <YOUR_ACCESS_TOKEN>
+```
+### ElasticModels installation
+Install TheStage Elastic Models package:
+```bash
+pip install 'thestage-elastic-models[nvidia,cudnn]' \
+    --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
+pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
+```
+## Usage example
+----
+Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Mistral-7B-Instruct-v0.3 model:
 ```python
 import torch
     model_name, token=hf_token
 )
 model = AutoModelForCausalLM.from_pretrained(
+    model_name,
     token=hf_token,
     torch_dtype=torch.bfloat16,
     attn_implementation="sdpa",
 generate_ids = generate_ids[:, input_len:]
 output = tokenizer.batch_decode(
     generate_ids,
+    skip_special_tokens=True,
     clean_up_tokenization_spaces=False
 )[0]
 print(f"# A:\n{output}\n")
 ```
+## Quality Benchmarks
+------------
+We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.
+![Quality Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422657-7bb353b4-5d79-4bbf-aacb-654b7d7a7bcb/Elastic_Mistral_7B_Instruct_v0.3_MMLU.png)
+### Quality Benchmark Results
+| **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original** |
+ | ---  | ---  | ---  | ---  | ---  | ---  |
+| **MMLU** | 59.2 | 59.6 | 59.6 | 59.8 | 59.8 |
+| **PIQA** | 81.3 | 81.3 | 81.9 | 81.9 | 82.0 |
+| **Arc Challenge** | 59.6 | 60.4 | 59.5 | 60.3 | 59.7 |
+| **Winogrande** | 75.2 | 76.1 | 75.3 | 74.8 | 74.8 |
+## Datasets
+-------
+- **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
+- **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
+- **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
+- **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
+## Metrics
+----------
+- **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
+## Latency Benchmarks
+-----
+We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
+![Latency Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422671-1ddedb17-7bc7-45e2-b285-4d3ef4212af0/Elastic_Mistral_7B_Instruct_v0.3_latency.png)
+### Latency Benchmark Results
+Tokens per second for different model sizes on various GPUs.
+| **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original** |
+ | ---  | ---  | ---  | ---  | ---  | ---  |
+| **H100** | 203 | 188 | 173 | 144 | 60 |
+| **L40s** | 77 | 68 | 60 | 48 | 39 |
+| **B200** | 268 | 263 | 235 | 219 | 104 |
+| **GeForce RTX 5090** | 155 | N/A | N/A | N/A | 74 |
+## Benchmarking Methodology
+----
+The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
+> **Algorithm summary:**
+> 1. Load the Mistral-7B-Instruct-v0.3 model with the specified size (S, M, L, XL, original).
+> 2. Move the model to the GPU.
+> 3. Prepare a sample prompt for text generation.
+> 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration:
+>    - Synchronize the GPU to flush any previous operations.
+>    - Record the start time.
+>    - Generate the text using the model.
+>    - Synchronize the GPU again.
+>    - Record the end time and calculate the TTFT and TPS for that iteration.
+> 5. Calculate the average TTFT and TPS over all iterations.
+## Serving with Docker Image
+------------
+For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
+Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
+You can also use this container to run inference through TheStage AI platform.
+### Prebuilt image from ECR
+| **GPU** | **Docker image name** |
+| --- | --- |
+| H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` |
+| B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` |
+Pull docker image for your Nvidia GPU and start inference container:
+```bash
+docker pull <IMAGE_NAME>
+```
+```bash
+docker run --rm -ti \
+  --name serving_thestage_model \
+  -p 8000:80 \
+  -e AUTH_TOKEN=<AUTH_TOKEN> \
+  -e MODEL_REPO=mistralai/Mistral-7B-Instruct-v0.3 \
+  -e MODEL_SIZE=<MODEL_SIZE> \
+  -e MODEL_BATCH=<MAX_BATCH_SIZE> \
+  -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
+  -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
+  -v /mnt/hf_cache:/root/.cache/huggingface \
+  <IMAGE_NAME_DEPENDING_ON_YOUR_GPU>
 ```
+| **Parameter**              | **Description**                                                                                      |
+|----------------------------|------------------------------------------------------------------------------------------------------|
+| `<MODEL_SIZE>`             | Available: S, M, L, XL.                                                                              |
+| `<MAX_BATCH_SIZE>`         | Maximum batch size to process in parallel.                                                           |
+| `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token.                                                                         |
+| `<THESTAGE_ACCESS_TOKEN>`  | TheStage token generated on the platform (Profile -> Access tokens).                                 |
+| `<AUTH_TOKEN>`             | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
+| `<IMAGE_NAME>`             | Image name which you have pulled.                                                                    |
+## Invocation
+------
+You can invoke the endpoint using CURL as follows:
+```bash
+curl -X POST 'http://127.0.0.1:8000/v1/chat/completions' \
+    -H 'Authorization: Bearer 123' \
+    -H 'Content-Type: application/json' \
+    -H "X-Model-Name: mistral-7b-instruct-v0-3-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged" \
+    -d '{
+        "messages":[{"role":"user","content":"Define AI"}]
+    }'
+```
+Or using OpenAI python client:
+```python
+import os, base64, pathlib, json
+from openai import OpenAI
+BASE_URL = "http://<your_ip>/v1"
+API_KEY  = "123"
+MODEL    = "mistral-7b-instruct-v0-3-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged"
+client = OpenAI(
+    api_key=API_KEY,
+    base_url=BASE_URL,
+    default_headers={"X-Model-Name": MODEL}
+)
+response = client.chat.completions.create(
+    model=MODEL,
+    messages=[
+        {"role": "user", "content": "Define AI"}
+    ]
+)
+print(response.choices[0].message.content)
 ```
+## Endpoint Parameters
+-------------
+### Method
+> **POST** `/v1/chat/completions`
+### Header Parameters
+> `Authorization`: `string`
+>
+> Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
+> `Content-Type`: `string`
+>
+> Must be set to `application/json`.
+> `X-Model-Name`: `string`
+>
+> Specifies the model to use for generation. Format: `mistral-7b-instruct-v0-3-<size>-bs<batch_size>`, where `<size>` is one of `S`, `M`, `L`, `XL`, `original` and `<batch_size>` is the maximum batch size configured during container startup.
+### Input Body
+> `messages` : `string`
+>
+> The input text prompt.
+## Deploy on Modal
+-----------------------
+For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
+### Clone modal serving code
+```shell
+git clone https://github.com/TheStageAI/ElasticModels.git
+cd ElasticModels/examples/modal
+```
+### Configuration of environment variables
+Set your environment variables in `modal_serving.py`:
+```python
+# modal_serving.py
+ENVS = {
+    "MODEL_REPO": "mistralai/Mistral-7B-Instruct-v0.3",
+    "MODEL_BATCH": "4",
+    "THESTAGE_AUTH_TOKEN": "",
+    "HUGGINGFACE_ACCESS_TOKEN": "",
+    "PORT": "80",
+    "PORT_HEALTH": "80",
+    "HF_HOME": "/cache/huggingface",
+}
+```
+### Configuration of GPUs
+Set your desired GPU type and autoscaling variables in `modal_serving.py`:
+```python
+# modal_serving.py
+@app.function(
+    image=image,
+    gpu="B200",
+    min_containers=8,
+    max_containers=8,
+    timeout=10000,
+    ephemeral_disk=600 * 1024,
+    volumes={"/opt/project/.cache": HF_CACHE},
+    startup_timeout=60*20
+)
+@modal.web_server(
+    80,
+    label="mistralai/Mistral-7B-Instruct-v0.3-test",
+    startup_timeout=60*20
+)
+def serve():
+    pass
+```
+### Run serving
+```shell
+modal serve modal_serving.py
+```
 ## Links
+* __Platform__: [app.thestage.ai](https://app.thestage.ai)
 * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
+* __Contact email__: contact@thestage.ai