Update README.md
Browse files
README.md
CHANGED
|
@@ -25,6 +25,8 @@ language:
|
|
| 25 |
|
| 26 |
## Overview
|
| 27 |
|
|
|
|
|
|
|
| 28 |
ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
|
| 29 |
|
| 30 |
- **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
|
|
@@ -34,12 +36,15 @@ ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Netw
|
|
| 34 |
|
| 35 |
Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
|
| 36 |
|
|
|
|
|
|
|
| 37 |
---
|
| 38 |
|
| 39 |
-
## Installation
|
| 40 |
|
| 41 |
### System Requirements
|
| 42 |
|
|
|
|
|
|
|
| 43 |
| **Property**| **Value** |
|
| 44 |
| --- | --- |
|
| 45 |
| **GPU** | L40s, RTX 5090, H100, RTX 4090 |
|
|
@@ -50,6 +55,8 @@ Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as
|
|
| 50 |
|
| 51 |
### TheStage AI Access token setup
|
| 52 |
|
|
|
|
|
|
|
| 53 |
Install TheStage AI CLI and setup API token:
|
| 54 |
|
| 55 |
```bash
|
|
@@ -59,6 +66,8 @@ thestage config set --access-token <YOUR_ACCESS_TOKEN>
|
|
| 59 |
|
| 60 |
### ElasticModels installation
|
| 61 |
|
|
|
|
|
|
|
| 62 |
Install TheStage Elastic Models package:
|
| 63 |
|
| 64 |
```bash
|
|
@@ -67,9 +76,10 @@ pip install 'thestage-elastic-models[nvidia,cudnn]' \
|
|
| 67 |
pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
|
| 68 |
```
|
| 69 |
|
|
|
|
|
|
|
| 70 |
---
|
| 71 |
|
| 72 |
-
## Usage example
|
| 73 |
|
| 74 |
Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Qwen2.5-7B-Instruct model:
|
| 75 |
|
|
@@ -135,9 +145,10 @@ print(f"# A:\n{output}\n")
|
|
| 135 |
```
|
| 136 |
|
| 137 |
|
|
|
|
|
|
|
| 138 |
---
|
| 139 |
|
| 140 |
-
## Quality Benchmarks
|
| 141 |
|
| 142 |
We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.
|
| 143 |
|
|
@@ -145,6 +156,8 @@ We have used the `lm_eval` library to validate the models. For each model size (
|
|
| 145 |
|
| 146 |
### Quality Benchmark Results
|
| 147 |
|
|
|
|
|
|
|
| 148 |
| **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8, int8** |
|
| 149 |
| --- | --- | --- | --- | --- | --- | --- |
|
| 150 |
| **Arc Challenge** | 54.2 | 55.2 | 55.3 | 54.9 | 54.7 | 41.7 |
|
|
@@ -153,25 +166,28 @@ We have used the `lm_eval` library to validate the models. For each model size (
|
|
| 153 |
| **Winogrande** | 70.4 | 70.3 | 71.5 | 70.4 | 71.0 | 53.1 |
|
| 154 |
|
| 155 |
|
|
|
|
|
|
|
| 156 |
---
|
| 157 |
|
| 158 |
-
## Datasets
|
| 159 |
|
| 160 |
- **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
|
| 161 |
- **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
|
| 162 |
- **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
|
| 163 |
- **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
|
| 164 |
|
|
|
|
|
|
|
| 165 |
---
|
| 166 |
|
| 167 |
-
## Metrics
|
| 168 |
|
| 169 |
- **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
|
| 170 |
|
| 171 |
|
|
|
|
|
|
|
| 172 |
---
|
| 173 |
|
| 174 |
-
## Latency Benchmarks
|
| 175 |
|
| 176 |
We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
|
| 177 |
|
|
@@ -179,6 +195,8 @@ We measured TPS (tokens per second) for each model size using 100 input tokens a
|
|
| 179 |
|
| 180 |
### Latency Benchmark Results
|
| 181 |
|
|
|
|
|
|
|
| 182 |
Tokens per second for different model sizes on various GPUs.
|
| 183 |
|
| 184 |
| **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8_int8** |
|
|
@@ -190,9 +208,10 @@ Tokens per second for different model sizes on various GPUs.
|
|
| 190 |
| **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 45 | N/A |
|
| 191 |
|
| 192 |
|
|
|
|
|
|
|
| 193 |
---
|
| 194 |
|
| 195 |
-
## Benchmarking Methodology
|
| 196 |
|
| 197 |
The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
|
| 198 |
|
|
@@ -209,9 +228,10 @@ The benchmarking was performed on a single GPU with a batch size of 1. Each mode
|
|
| 209 |
> 5. Calculate the average TTFT and TPS over all iterations.
|
| 210 |
|
| 211 |
|
|
|
|
|
|
|
| 212 |
---
|
| 213 |
|
| 214 |
-
## Serving with Docker Image
|
| 215 |
|
| 216 |
For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
|
| 217 |
Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
|
|
@@ -219,15 +239,12 @@ You can also use this container to run inference through TheStage AI platform.
|
|
| 219 |
|
| 220 |
### Prebuilt image from ECR
|
| 221 |
|
| 222 |
-
|
| 223 |
-
| --- | --- |
|
| 224 |
-
| H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` |
|
| 225 |
-
| B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` |
|
| 226 |
|
| 227 |
-
Pull docker image
|
| 228 |
|
| 229 |
```bash
|
| 230 |
-
docker pull
|
| 231 |
```
|
| 232 |
```bash
|
| 233 |
docker run --rm -ti \
|
|
@@ -240,7 +257,7 @@ docker run --rm -ti \
|
|
| 240 |
-e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
|
| 241 |
-e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
|
| 242 |
-v /mnt/hf_cache:/root/.cache/huggingface \
|
| 243 |
-
|
| 244 |
```
|
| 245 |
|
| 246 |
| **Parameter** | **Description** |
|
|
@@ -250,11 +267,11 @@ docker run --rm -ti \
|
|
| 250 |
| `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token. |
|
| 251 |
| `<THESTAGE_ACCESS_TOKEN>` | TheStage token generated on the platform (Profile -> Access tokens). |
|
| 252 |
| `<AUTH_TOKEN>` | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
|
| 253 |
-
|
|
|
|
| 254 |
|
| 255 |
---
|
| 256 |
|
| 257 |
-
## Invocation
|
| 258 |
|
| 259 |
You can invoke the endpoint using CURL as follows:
|
| 260 |
|
|
@@ -294,16 +311,21 @@ response = client.chat.completions.create(
|
|
| 294 |
print(response.choices[0].message.content)
|
| 295 |
```
|
| 296 |
|
|
|
|
|
|
|
| 297 |
---
|
| 298 |
|
| 299 |
-
## Endpoint Parameters
|
| 300 |
|
| 301 |
### Method
|
| 302 |
|
|
|
|
|
|
|
| 303 |
> **POST** `/v1/chat/completions`
|
| 304 |
|
| 305 |
### Header Parameters
|
| 306 |
|
|
|
|
|
|
|
| 307 |
> `Authorization`: `string`
|
| 308 |
>
|
| 309 |
> Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
|
|
@@ -318,19 +340,24 @@ print(response.choices[0].message.content)
|
|
| 318 |
|
| 319 |
### Input Body
|
| 320 |
|
|
|
|
|
|
|
| 321 |
> `messages` : `string`
|
| 322 |
>
|
| 323 |
> The input text prompt.
|
| 324 |
|
| 325 |
|
|
|
|
|
|
|
| 326 |
---
|
| 327 |
|
| 328 |
-
## Deploy on Modal
|
| 329 |
|
| 330 |
For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
|
| 331 |
|
| 332 |
### Clone modal serving code
|
| 333 |
|
|
|
|
|
|
|
| 334 |
```shell
|
| 335 |
git clone https://github.com/TheStageAI/ElasticModels.git
|
| 336 |
cd ElasticModels/examples/modal
|
|
@@ -338,6 +365,8 @@ cd ElasticModels/examples/modal
|
|
| 338 |
|
| 339 |
### Configuration of environment variables
|
| 340 |
|
|
|
|
|
|
|
| 341 |
Set your environment variables in `modal_serving.py`:
|
| 342 |
|
| 343 |
```python
|
|
@@ -356,6 +385,8 @@ ENVS = {
|
|
| 356 |
|
| 357 |
### Configuration of GPUs
|
| 358 |
|
|
|
|
|
|
|
| 359 |
Set your desired GPU type and autoscaling variables in `modal_serving.py`:
|
| 360 |
|
| 361 |
```python
|
|
@@ -382,6 +413,8 @@ def serve():
|
|
| 382 |
|
| 383 |
### Run serving
|
| 384 |
|
|
|
|
|
|
|
| 385 |
```shell
|
| 386 |
modal serve modal_serving.py
|
| 387 |
```
|
|
@@ -389,6 +422,8 @@ modal serve modal_serving.py
|
|
| 389 |
|
| 390 |
## Links
|
| 391 |
|
|
|
|
|
|
|
| 392 |
* __Platform__: [app.thestage.ai](https://app.thestage.ai)
|
| 393 |
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
|
| 394 |
* __Contact email__: contact@thestage.ai
|
|
|
|
| 25 |
|
| 26 |
## Overview
|
| 27 |
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
|
| 31 |
|
| 32 |
- **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
|
|
|
|
| 36 |
|
| 37 |
Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
|
| 38 |
|
| 39 |
+
## Installation
|
| 40 |
+
|
| 41 |
---
|
| 42 |
|
|
|
|
| 43 |
|
| 44 |
### System Requirements
|
| 45 |
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
| **Property**| **Value** |
|
| 49 |
| --- | --- |
|
| 50 |
| **GPU** | L40s, RTX 5090, H100, RTX 4090 |
|
|
|
|
| 55 |
|
| 56 |
### TheStage AI Access token setup
|
| 57 |
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
Install TheStage AI CLI and setup API token:
|
| 61 |
|
| 62 |
```bash
|
|
|
|
| 66 |
|
| 67 |
### ElasticModels installation
|
| 68 |
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
Install TheStage Elastic Models package:
|
| 72 |
|
| 73 |
```bash
|
|
|
|
| 76 |
pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
|
| 77 |
```
|
| 78 |
|
| 79 |
+
## Usage example
|
| 80 |
+
|
| 81 |
---
|
| 82 |
|
|
|
|
| 83 |
|
| 84 |
Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Qwen2.5-7B-Instruct model:
|
| 85 |
|
|
|
|
| 145 |
```
|
| 146 |
|
| 147 |
|
| 148 |
+
## Quality Benchmarks
|
| 149 |
+
|
| 150 |
---
|
| 151 |
|
|
|
|
| 152 |
|
| 153 |
We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.
|
| 154 |
|
|
|
|
| 156 |
|
| 157 |
### Quality Benchmark Results
|
| 158 |
|
| 159 |
+
---
|
| 160 |
+
|
| 161 |
| **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8, int8** |
|
| 162 |
| --- | --- | --- | --- | --- | --- | --- |
|
| 163 |
| **Arc Challenge** | 54.2 | 55.2 | 55.3 | 54.9 | 54.7 | 41.7 |
|
|
|
|
| 166 |
| **Winogrande** | 70.4 | 70.3 | 71.5 | 70.4 | 71.0 | 53.1 |
|
| 167 |
|
| 168 |
|
| 169 |
+
## Datasets
|
| 170 |
+
|
| 171 |
---
|
| 172 |
|
|
|
|
| 173 |
|
| 174 |
- **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
|
| 175 |
- **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
|
| 176 |
- **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
|
| 177 |
- **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
|
| 178 |
|
| 179 |
+
## Metrics
|
| 180 |
+
|
| 181 |
---
|
| 182 |
|
|
|
|
| 183 |
|
| 184 |
- **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
|
| 185 |
|
| 186 |
|
| 187 |
+
## Latency Benchmarks
|
| 188 |
+
|
| 189 |
---
|
| 190 |
|
|
|
|
| 191 |
|
| 192 |
We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
|
| 193 |
|
|
|
|
| 195 |
|
| 196 |
### Latency Benchmark Results
|
| 197 |
|
| 198 |
+
---
|
| 199 |
+
|
| 200 |
Tokens per second for different model sizes on various GPUs.
|
| 201 |
|
| 202 |
| **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8_int8** |
|
|
|
|
| 208 |
| **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 45 | N/A |
|
| 209 |
|
| 210 |
|
| 211 |
+
## Benchmarking Methodology
|
| 212 |
+
|
| 213 |
---
|
| 214 |
|
|
|
|
| 215 |
|
| 216 |
The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
|
| 217 |
|
|
|
|
| 228 |
> 5. Calculate the average TTFT and TPS over all iterations.
|
| 229 |
|
| 230 |
|
| 231 |
+
## Serving with Docker Image
|
| 232 |
+
|
| 233 |
---
|
| 234 |
|
|
|
|
| 235 |
|
| 236 |
For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
|
| 237 |
Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
|
|
|
|
| 239 |
|
| 240 |
### Prebuilt image from ECR
|
| 241 |
|
| 242 |
+
---
|
|
|
|
|
|
|
|
|
|
| 243 |
|
| 244 |
+
Pull docker image and start inference container:
|
| 245 |
|
| 246 |
```bash
|
| 247 |
+
docker pull public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.2.0-llm-24.09c
|
| 248 |
```
|
| 249 |
```bash
|
| 250 |
docker run --rm -ti \
|
|
|
|
| 257 |
-e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
|
| 258 |
-e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
|
| 259 |
-v /mnt/hf_cache:/root/.cache/huggingface \
|
| 260 |
+
public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.2.0-llm-24.09c
|
| 261 |
```
|
| 262 |
|
| 263 |
| **Parameter** | **Description** |
|
|
|
|
| 267 |
| `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token. |
|
| 268 |
| `<THESTAGE_ACCESS_TOKEN>` | TheStage token generated on the platform (Profile -> Access tokens). |
|
| 269 |
| `<AUTH_TOKEN>` | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
|
| 270 |
+
|
| 271 |
+
## Invocation
|
| 272 |
|
| 273 |
---
|
| 274 |
|
|
|
|
| 275 |
|
| 276 |
You can invoke the endpoint using CURL as follows:
|
| 277 |
|
|
|
|
| 311 |
print(response.choices[0].message.content)
|
| 312 |
```
|
| 313 |
|
| 314 |
+
## Endpoint Parameters
|
| 315 |
+
|
| 316 |
---
|
| 317 |
|
|
|
|
| 318 |
|
| 319 |
### Method
|
| 320 |
|
| 321 |
+
---
|
| 322 |
+
|
| 323 |
> **POST** `/v1/chat/completions`
|
| 324 |
|
| 325 |
### Header Parameters
|
| 326 |
|
| 327 |
+
---
|
| 328 |
+
|
| 329 |
> `Authorization`: `string`
|
| 330 |
>
|
| 331 |
> Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
|
|
|
|
| 340 |
|
| 341 |
### Input Body
|
| 342 |
|
| 343 |
+
---
|
| 344 |
+
|
| 345 |
> `messages` : `string`
|
| 346 |
>
|
| 347 |
> The input text prompt.
|
| 348 |
|
| 349 |
|
| 350 |
+
## Deploy on Modal
|
| 351 |
+
|
| 352 |
---
|
| 353 |
|
|
|
|
| 354 |
|
| 355 |
For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
|
| 356 |
|
| 357 |
### Clone modal serving code
|
| 358 |
|
| 359 |
+
---
|
| 360 |
+
|
| 361 |
```shell
|
| 362 |
git clone https://github.com/TheStageAI/ElasticModels.git
|
| 363 |
cd ElasticModels/examples/modal
|
|
|
|
| 365 |
|
| 366 |
### Configuration of environment variables
|
| 367 |
|
| 368 |
+
---
|
| 369 |
+
|
| 370 |
Set your environment variables in `modal_serving.py`:
|
| 371 |
|
| 372 |
```python
|
|
|
|
| 385 |
|
| 386 |
### Configuration of GPUs
|
| 387 |
|
| 388 |
+
---
|
| 389 |
+
|
| 390 |
Set your desired GPU type and autoscaling variables in `modal_serving.py`:
|
| 391 |
|
| 392 |
```python
|
|
|
|
| 413 |
|
| 414 |
### Run serving
|
| 415 |
|
| 416 |
+
---
|
| 417 |
+
|
| 418 |
```shell
|
| 419 |
modal serve modal_serving.py
|
| 420 |
```
|
|
|
|
| 422 |
|
| 423 |
## Links
|
| 424 |
|
| 425 |
+
---
|
| 426 |
+
|
| 427 |
* __Platform__: [app.thestage.ai](https://app.thestage.ai)
|
| 428 |
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
|
| 429 |
* __Contact email__: contact@thestage.ai
|