TheStageAI
/

Elastic-Qwen2.5-7B-Instruct

Text Generation

Model card Files Files and versions

xet

Community

Update README.md

by hinairo - opened Mar 14

base: refs/heads/main

←

from: refs/pr/8

Discussion Files changed

+272

-62

Files changed (1) hide show

README.md +272 -62

README.md CHANGED Viewed

@@ -20,35 +20,58 @@ language:
 - ara
 ---
-# Elastic model: Qwen2.5-7B-Instruct. Fastest and most flexible models for self-serving.
-Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
-* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
-* __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
-* __M__: Faster model, with accuracy degradation less than 1.5%.
-* __S__: The fastest model, with accuracy degradation less than 2%.
-__Goals of elastic models:__
-* Provide flexibility in cost vs quality selection for inference
-* Provide clear quality and latency benchmarks
-* Provide interface of HF libraries: transformers and diffusers with a single line of code
-* Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
-* Provide the best models and service for self-hosting.
-> It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
-![Performance Graph](images/performance_graph.png)
------
-## Inference
-To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
 ```python
 import torch
@@ -57,7 +80,7 @@ from elastic_models.transformers import AutoModelForCausalLM
 # Currently we require to have your HF token
 # as we use original weights for part of layers and
-# model confugaration as well
 model_name = "Qwen/Qwen2.5-7B-Instruct"
 hf_token = ''
 device = torch.device("cuda")
@@ -111,74 +134,261 @@ print(f"# Q:\n{prompt}\n")
 print(f"# A:\n{output}\n")
 ```
-__System requirements:__
-* GPUs: H100, L40s, 4090, 5090
-* CPU: AMD, Intel
-* Python: 3.10-3.12
-To work with our models just run these lines in your terminal:
-```shell
-pip install thestage
-pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
-pip install flash_attn==2.7.3 --no-build-isolation
-# or for blackwell support
-pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
-pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
-# please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention
-mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
-pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
-pip uninstall apex
 ```
-Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
-```shell
-thestage config set --api-token <YOUR_API_TOKEN>
 ```
-Congrats, now you can use accelerated models!
-----
-## Benchmarks
-Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
-### Quality benchmarks
-| Metric/Model  | S | M | L | XL | Original | W8A8, int8 |
-|---------------|---|---|---|----|----------|------------|
-| arc_challenge | 49.10 | 50.10 | 53.20 | 52.60 | 52.60 | 41.70 | - |
-| mmlu | 71.70 | 73.00 | 74.10 | 73.50 | 73.50 | 64.60 | - |
-| piqa | 77.00 | 78.20 | 78.80 | 79.50 | 79.50 | 67.10 | - |
-| winogrande | 66.20 | 69.10 | 71.50 | 70.60 | 70.60 | 53.10 | - |
-* **MMLU**: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
-* **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
-* **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
-* **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
-### Latency benchmarks
-__100 input/300 output; tok/s:__
-| GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
-|-----------|-----|---|---|----|----------|------------|
-| H100 | 201 | 173 | 162 | 135 | 62 | 201 | - |
-| L40S | 76 | 67 | 61 | 47 | 43 | 78 | - |
-| 5090 | 149 | - | - | - | - | - | - |
-| 4090 | 98 | - | - | - | - | - | - |
 ## Links
-* __Platform__: [app.thestage.ai](app.thestage.ai)
 * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
-<!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
 * __Contact email__: contact@thestage.ai

 - ara
 ---
+# Elastic model: Qwen2.5-7B-Instruct
+## Overview
+ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
+- **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
+- **L**: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
+- **M**: Faster model, with accuracy degradation less than 1.5%.
+- **S**: The fastest model, with accuracy degradation less than 2%.
+Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
+---
+## Installation
+### System Requirements
+| **Property**| **Value** |
+ | ---  | ---  |
+| **GPU** | L40s, RTX 5090, H100, RTX 4090 |
+| **Python Version** | 3.10-3.12 |
+| **CPU** | Intel/AMD x86_64 |
+| **CUDA Version** | 12.8+ |
+### TheStage AI Access token setup
+Install TheStage AI CLI and setup API token:
+```bash
+pip install thestage
+thestage config set --access-token <YOUR_ACCESS_TOKEN>
+```
+### ElasticModels installation
+Install TheStage Elastic Models package:
+```bash
+pip install 'thestage-elastic-models[nvidia,cudnn]' \
+    --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
+pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
+```
+---
+## Usage example
+Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Qwen2.5-7B-Instruct model:
 ```python
 import torch
 # Currently we require to have your HF token
 # as we use original weights for part of layers and
+# model configuration as well
 model_name = "Qwen/Qwen2.5-7B-Instruct"
 hf_token = ''
 device = torch.device("cuda")
 print(f"# A:\n{output}\n")
 ```
+---
+## Quality Benchmarks
+We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.
+![Quality Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422559-0c9621c5-9e7f-4c81-8698-70f6d6872cb5/Elastic_Qwen2.5_7B_Instruct_MMLU.png)
+### Quality Benchmark Results
+| **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8, int8** |
+ | ---  | ---  | ---  | ---  | ---  | ---  | ---  |
+| **Arc Challenge** | 54.2 | 55.2 | 55.3 | 54.9 | 54.7 | 41.7 |
+| **MMLU** | 71.5 | 71.6 | 71.9 | 71.9 | 71.8 | 64.6 |
+| **PIQA** | 78.3 | 79.9 | 79.5 | 79.5 | 79.6 | 67.1 |
+| **Winogrande** | 70.4 | 70.3 | 71.5 | 70.4 | 71.0 | 53.1 |
+---
+## Datasets
+- **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
+- **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
+- **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
+- **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
+---
+## Metrics
+- **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
+---
+## Latency Benchmarks
+We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
+![Latency Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422585-3065316c-5c07-4430-befb-61daac95f712/Elastic_Qwen2.5_7B_Instruct_latency.png)
+### Latency Benchmark Results
+Tokens per second for different model sizes on various GPUs.
+| **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8_int8** |
+ | ---  | ---  | ---  | ---  | ---  | ---  | ---  |
+| **H100** | 184 | 177 | 157 | 138 | 62 | 201 |
+| **L40s** | 72 | 67 | 57 | 48 | 42 | 78 |
+| **B200** | 239 | 232 | 216 | 199 | 114 | N/A |
+| **GeForce RTX 5090** | 141 | N/A | N/A | N/A | 66 | N/A |
+| **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 45 | N/A |
+---
+## Benchmarking Methodology
+The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
+> **Algorithm summary:**
+> 1. Load the Qwen2.5-7B-Instruct model with the specified size (S, M, L, XL, original).
+> 2. Move the model to the GPU.
+> 3. Prepare a sample prompt for text generation.
+> 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration:
+>    - Synchronize the GPU to flush any previous operations.
+>    - Record the start time.
+>    - Generate the text using the model.
+>    - Synchronize the GPU again.
+>    - Record the end time and calculate the TTFT and TPS for that iteration.
+> 5. Calculate the average TTFT and TPS over all iterations.
+---
+## Serving with Docker Image
+For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
+Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
+You can also use this container to run inference through TheStage AI platform.
+### Prebuilt image from ECR
+| **GPU** | **Docker image name** |
+| --- | --- |
+| H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` |
+| B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` |
+Pull docker image for your Nvidia GPU and start inference container:
+```bash
+docker pull <IMAGE_NAME>
+```
+```bash
+docker run --rm -ti \
+  --name serving_thestage_model \
+  -p 8000:80 \
+  -e AUTH_TOKEN=<AUTH_TOKEN> \
+  -e MODEL_REPO=Qwen/Qwen2.5-7B-Instruct \
+  -e MODEL_SIZE=<MODEL_SIZE> \
+  -e MODEL_BATCH=<MAX_BATCH_SIZE> \
+  -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
+  -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
+  -v /mnt/hf_cache:/root/.cache/huggingface \
+  <IMAGE_NAME_DEPENDING_ON_YOUR_GPU>
 ```
+| **Parameter**              | **Description**                                                                                      |
+|----------------------------|------------------------------------------------------------------------------------------------------|
+| `<MODEL_SIZE>`             | Available: S, M, L, XL.                                                                              |
+| `<MAX_BATCH_SIZE>`         | Maximum batch size to process in parallel.                                                           |
+| `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token.                                                                         |
+| `<THESTAGE_ACCESS_TOKEN>`  | TheStage token generated on the platform (Profile -> Access tokens).                                 |
+| `<AUTH_TOKEN>`             | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
+| `<IMAGE_NAME>`             | Image name which you have pulled.                                                                    |
+---
+## Invocation
+You can invoke the endpoint using CURL as follows:
+```bash
+curl -X POST 'http://127.0.0.1:8000/v1/chat/completions' \
+    -H 'Authorization: Bearer 123' \
+    -H 'Content-Type: application/json' \
+    -H "X-Model-Name: qwen-2-5-7b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged" \
+    -d '{
+        "messages":[{"role":"user","content":"Define AI"}]
+    }'
+```
+Or using OpenAI python client:
+```python
+import os, base64, pathlib, json
+from openai import OpenAI
+BASE_URL = "http://<your_ip>/v1"
+API_KEY  = "123"
+MODEL    = "qwen-2-5-7b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged"
+client = OpenAI(
+    api_key=API_KEY,
+    base_url=BASE_URL,
+    default_headers={"X-Model-Name": MODEL}
+)
+response = client.chat.completions.create(
+    model=MODEL,
+    messages=[
+        {"role": "user", "content": "Define AI"}
+    ]
+)
+print(response.choices[0].message.content)
 ```
+---
+## Endpoint Parameters
+### Method
+> **POST** `/v1/chat/completions`
+### Header Parameters
+> `Authorization`: `string`
+>
+> Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
+> `Content-Type`: `string`
+>
+> Must be set to `application/json`.
+> `X-Model-Name`: `string`
+>
+> Specifies the model to use for generation. Format: `qwen-2-5-7b-instruct-<size>-bs<batch_size>`, where `<size>` is one of `S`, `M`, `L`, `XL`, `original` and `<batch_size>` is the maximum batch size configured during container startup.
+### Input Body
+> `messages` : `string`
+>
+> The input text prompt.
+---
+## Deploy on Modal
+For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
+### Clone modal serving code
+```shell
+git clone https://github.com/TheStageAI/ElasticModels.git
+cd ElasticModels/examples/modal
+```
+### Configuration of environment variables
+Set your environment variables in `modal_serving.py`:
+```python
+# modal_serving.py
+ENVS = {
+    "MODEL_REPO": "Qwen/Qwen2.5-7B-Instruct",
+    "MODEL_BATCH": "4",
+    "THESTAGE_AUTH_TOKEN": "",
+    "HUGGINGFACE_ACCESS_TOKEN": "",
+    "PORT": "80",
+    "PORT_HEALTH": "80",
+    "HF_HOME": "/cache/huggingface",
+}
+```
+### Configuration of GPUs
+Set your desired GPU type and autoscaling variables in `modal_serving.py`:
+```python
+# modal_serving.py
+@app.function(
+    image=image,
+    gpu="B200",
+    min_containers=8,
+    max_containers=8,
+    timeout=10000,
+    ephemeral_disk=600 * 1024,
+    volumes={"/opt/project/.cache": HF_CACHE},
+    startup_timeout=60*20
+)
+@modal.web_server(
+    80,
+    label="Qwen/Qwen2.5-7B-Instruct-test",
+    startup_timeout=60*20
+)
+def serve():
+    pass
+```
+### Run serving
+```shell
+modal serve modal_serving.py
+```
 ## Links
+* __Platform__: [app.thestage.ai](https://app.thestage.ai)
 * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
 * __Contact email__: contact@thestage.ai