TheStageAI
/

Elastic-Qwen2.5-7B-Instruct

Text Generation

Model card Files Files and versions

xet

Community

neuroeng commited on Mar 26

Commit

3cbc9ea

verified ·

1 Parent(s): f45f361

Update README.md

Browse files

Files changed (1) hide show

README.md +54 -19

README.md CHANGED Viewed

@@ -25,6 +25,8 @@ language:
 ## Overview
 ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
 - **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
@@ -34,12 +36,15 @@ ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Netw
 Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
 ---
-## Installation
 ### System Requirements
 | **Property**| **Value** |
  | ---  | ---  |
 | **GPU** | L40s, RTX 5090, H100, RTX 4090 |
@@ -50,6 +55,8 @@ Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as
 ### TheStage AI Access token setup
 Install TheStage AI CLI and setup API token:
 ```bash
@@ -59,6 +66,8 @@ thestage config set --access-token <YOUR_ACCESS_TOKEN>
 ### ElasticModels installation
 Install TheStage Elastic Models package:
 ```bash
@@ -67,9 +76,10 @@ pip install 'thestage-elastic-models[nvidia,cudnn]' \
 pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
 ```
 ---
-## Usage example
 Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Qwen2.5-7B-Instruct model:
@@ -135,9 +145,10 @@ print(f"# A:\n{output}\n")
 ```
 ---
-## Quality Benchmarks
 We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.
@@ -145,6 +156,8 @@ We have used the `lm_eval` library to validate the models. For each model size (
 ### Quality Benchmark Results
 | **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8, int8** |
  | ---  | ---  | ---  | ---  | ---  | ---  | ---  |
 | **Arc Challenge** | 54.2 | 55.2 | 55.3 | 54.9 | 54.7 | 41.7 |
@@ -153,25 +166,28 @@ We have used the `lm_eval` library to validate the models. For each model size (
 | **Winogrande** | 70.4 | 70.3 | 71.5 | 70.4 | 71.0 | 53.1 |
 ---
-## Datasets
 - **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
 - **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
 - **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
 - **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
 ---
-## Metrics
 - **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
 ---
-## Latency Benchmarks
 We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
@@ -179,6 +195,8 @@ We measured TPS (tokens per second) for each model size using 100 input tokens a
 ### Latency Benchmark Results
 Tokens per second for different model sizes on various GPUs.
 | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8_int8** |
@@ -190,9 +208,10 @@ Tokens per second for different model sizes on various GPUs.
 | **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 45 | N/A |
 ---
-## Benchmarking Methodology
 The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
@@ -209,9 +228,10 @@ The benchmarking was performed on a single GPU with a batch size of 1. Each mode
 > 5. Calculate the average TTFT and TPS over all iterations.
 ---
-## Serving with Docker Image
 For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
 Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
@@ -219,15 +239,12 @@ You can also use this container to run inference through TheStage AI platform.
 ### Prebuilt image from ECR
-| **GPU** | **Docker image name** |
-| --- | --- |
-| H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` |
-| B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` |
-Pull docker image for your Nvidia GPU and start inference container:
 ```bash
-docker pull <IMAGE_NAME>
 ```
 ```bash
 docker run --rm -ti \
@@ -240,7 +257,7 @@ docker run --rm -ti \
   -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
   -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
   -v /mnt/hf_cache:/root/.cache/huggingface \
-  <IMAGE_NAME_DEPENDING_ON_YOUR_GPU>
 ```
 | **Parameter**              | **Description**                                                                                      |
@@ -250,11 +267,11 @@ docker run --rm -ti \
 | `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token.                                                                         |
 | `<THESTAGE_ACCESS_TOKEN>`  | TheStage token generated on the platform (Profile -> Access tokens).                                 |
 | `<AUTH_TOKEN>`             | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
-| `<IMAGE_NAME>`             | Image name which you have pulled.                                                                    |
 ---
-## Invocation
 You can invoke the endpoint using CURL as follows:
@@ -294,16 +311,21 @@ response = client.chat.completions.create(
 print(response.choices[0].message.content)
 ```
 ---
-## Endpoint Parameters
 ### Method
 > **POST** `/v1/chat/completions`
 ### Header Parameters
 > `Authorization`: `string`
 >
 > Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
@@ -318,19 +340,24 @@ print(response.choices[0].message.content)
 ### Input Body
 > `messages` : `string`
 >
 > The input text prompt.
 ---
-## Deploy on Modal
 For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
 ### Clone modal serving code
 ```shell
 git clone https://github.com/TheStageAI/ElasticModels.git
 cd ElasticModels/examples/modal
@@ -338,6 +365,8 @@ cd ElasticModels/examples/modal
 ### Configuration of environment variables
 Set your environment variables in `modal_serving.py`:
 ```python
@@ -356,6 +385,8 @@ ENVS = {
 ### Configuration of GPUs
 Set your desired GPU type and autoscaling variables in `modal_serving.py`:
 ```python
@@ -382,6 +413,8 @@ def serve():
 ### Run serving
 ```shell
 modal serve modal_serving.py
 ```
@@ -389,6 +422,8 @@ modal serve modal_serving.py
 ## Links
 * __Platform__: [app.thestage.ai](https://app.thestage.ai)
 * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
 * __Contact email__: contact@thestage.ai

 ## Overview
+---
 ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
 - **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
 Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
+## Installation
 ---
 ### System Requirements
+---
 | **Property**| **Value** |
  | ---  | ---  |
 | **GPU** | L40s, RTX 5090, H100, RTX 4090 |
 ### TheStage AI Access token setup
+---
 Install TheStage AI CLI and setup API token:
 ```bash
 ### ElasticModels installation
+---
 Install TheStage Elastic Models package:
 ```bash
 pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
 ```
+## Usage example
 ---
 Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Qwen2.5-7B-Instruct model:
 ```
+## Quality Benchmarks
 ---
 We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.
 ### Quality Benchmark Results
+---
 | **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8, int8** |
  | ---  | ---  | ---  | ---  | ---  | ---  | ---  |
 | **Arc Challenge** | 54.2 | 55.2 | 55.3 | 54.9 | 54.7 | 41.7 |
 | **Winogrande** | 70.4 | 70.3 | 71.5 | 70.4 | 71.0 | 53.1 |
+## Datasets
 ---
 - **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
 - **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
 - **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
 - **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
+## Metrics
 ---
 - **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
+## Latency Benchmarks
 ---
 We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
 ### Latency Benchmark Results
+---
 Tokens per second for different model sizes on various GPUs.
 | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8_int8** |
 | **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 45 | N/A |
+## Benchmarking Methodology
 ---
 The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
 > 5. Calculate the average TTFT and TPS over all iterations.
+## Serving with Docker Image
 ---
 For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
 Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
 ### Prebuilt image from ECR
+---
+Pull docker image and start inference container:
 ```bash
+docker pull public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.2.0-llm-24.09c
 ```
 ```bash
 docker run --rm -ti \
   -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
   -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
   -v /mnt/hf_cache:/root/.cache/huggingface \
+  public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.2.0-llm-24.09c
 ```
 | **Parameter**              | **Description**                                                                                      |
 | `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token.                                                                         |
 | `<THESTAGE_ACCESS_TOKEN>`  | TheStage token generated on the platform (Profile -> Access tokens).                                 |
 | `<AUTH_TOKEN>`             | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
+## Invocation
 ---
 You can invoke the endpoint using CURL as follows:
 print(response.choices[0].message.content)
 ```
+## Endpoint Parameters
 ---
 ### Method
+---
 > **POST** `/v1/chat/completions`
 ### Header Parameters
+---
 > `Authorization`: `string`
 >
 > Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
 ### Input Body
+---
 > `messages` : `string`
 >
 > The input text prompt.
+## Deploy on Modal
 ---
 For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
 ### Clone modal serving code
+---
 ```shell
 git clone https://github.com/TheStageAI/ElasticModels.git
 cd ElasticModels/examples/modal
 ### Configuration of environment variables
+---
 Set your environment variables in `modal_serving.py`:
 ```python
 ### Configuration of GPUs
+---
 Set your desired GPU type and autoscaling variables in `modal_serving.py`:
 ```python
 ### Run serving
+---
 ```shell
 modal serve modal_serving.py
 ```
 ## Links
+---
 * __Platform__: [app.thestage.ai](https://app.thestage.ai)
 * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
 * __Contact email__: contact@thestage.ai