neuroeng commited on
Commit
3cbc9ea
·
verified ·
1 Parent(s): f45f361

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -19
README.md CHANGED
@@ -25,6 +25,8 @@ language:
25
 
26
  ## Overview
27
 
 
 
28
  ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
29
 
30
  - **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
@@ -34,12 +36,15 @@ ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Netw
34
 
35
  Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
36
 
 
 
37
  ---
38
 
39
- ## Installation
40
 
41
  ### System Requirements
42
 
 
 
43
  | **Property**| **Value** |
44
  | --- | --- |
45
  | **GPU** | L40s, RTX 5090, H100, RTX 4090 |
@@ -50,6 +55,8 @@ Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as
50
 
51
  ### TheStage AI Access token setup
52
 
 
 
53
  Install TheStage AI CLI and setup API token:
54
 
55
  ```bash
@@ -59,6 +66,8 @@ thestage config set --access-token <YOUR_ACCESS_TOKEN>
59
 
60
  ### ElasticModels installation
61
 
 
 
62
  Install TheStage Elastic Models package:
63
 
64
  ```bash
@@ -67,9 +76,10 @@ pip install 'thestage-elastic-models[nvidia,cudnn]' \
67
  pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
68
  ```
69
 
 
 
70
  ---
71
 
72
- ## Usage example
73
 
74
  Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Qwen2.5-7B-Instruct model:
75
 
@@ -135,9 +145,10 @@ print(f"# A:\n{output}\n")
135
  ```
136
 
137
 
 
 
138
  ---
139
 
140
- ## Quality Benchmarks
141
 
142
  We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.
143
 
@@ -145,6 +156,8 @@ We have used the `lm_eval` library to validate the models. For each model size (
145
 
146
  ### Quality Benchmark Results
147
 
 
 
148
  | **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8, int8** |
149
  | --- | --- | --- | --- | --- | --- | --- |
150
  | **Arc Challenge** | 54.2 | 55.2 | 55.3 | 54.9 | 54.7 | 41.7 |
@@ -153,25 +166,28 @@ We have used the `lm_eval` library to validate the models. For each model size (
153
  | **Winogrande** | 70.4 | 70.3 | 71.5 | 70.4 | 71.0 | 53.1 |
154
 
155
 
 
 
156
  ---
157
 
158
- ## Datasets
159
 
160
  - **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
161
  - **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
162
  - **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
163
  - **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
164
 
 
 
165
  ---
166
 
167
- ## Metrics
168
 
169
  - **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
170
 
171
 
 
 
172
  ---
173
 
174
- ## Latency Benchmarks
175
 
176
  We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
177
 
@@ -179,6 +195,8 @@ We measured TPS (tokens per second) for each model size using 100 input tokens a
179
 
180
  ### Latency Benchmark Results
181
 
 
 
182
  Tokens per second for different model sizes on various GPUs.
183
 
184
  | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8_int8** |
@@ -190,9 +208,10 @@ Tokens per second for different model sizes on various GPUs.
190
  | **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 45 | N/A |
191
 
192
 
 
 
193
  ---
194
 
195
- ## Benchmarking Methodology
196
 
197
  The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
198
 
@@ -209,9 +228,10 @@ The benchmarking was performed on a single GPU with a batch size of 1. Each mode
209
  > 5. Calculate the average TTFT and TPS over all iterations.
210
 
211
 
 
 
212
  ---
213
 
214
- ## Serving with Docker Image
215
 
216
  For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
217
  Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
@@ -219,15 +239,12 @@ You can also use this container to run inference through TheStage AI platform.
219
 
220
  ### Prebuilt image from ECR
221
 
222
- | **GPU** | **Docker image name** |
223
- | --- | --- |
224
- | H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` |
225
- | B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` |
226
 
227
- Pull docker image for your Nvidia GPU and start inference container:
228
 
229
  ```bash
230
- docker pull <IMAGE_NAME>
231
  ```
232
  ```bash
233
  docker run --rm -ti \
@@ -240,7 +257,7 @@ docker run --rm -ti \
240
  -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
241
  -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
242
  -v /mnt/hf_cache:/root/.cache/huggingface \
243
- <IMAGE_NAME_DEPENDING_ON_YOUR_GPU>
244
  ```
245
 
246
  | **Parameter** | **Description** |
@@ -250,11 +267,11 @@ docker run --rm -ti \
250
  | `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token. |
251
  | `<THESTAGE_ACCESS_TOKEN>` | TheStage token generated on the platform (Profile -> Access tokens). |
252
  | `<AUTH_TOKEN>` | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
253
- | `<IMAGE_NAME>` | Image name which you have pulled. |
 
254
 
255
  ---
256
 
257
- ## Invocation
258
 
259
  You can invoke the endpoint using CURL as follows:
260
 
@@ -294,16 +311,21 @@ response = client.chat.completions.create(
294
  print(response.choices[0].message.content)
295
  ```
296
 
 
 
297
  ---
298
 
299
- ## Endpoint Parameters
300
 
301
  ### Method
302
 
 
 
303
  > **POST** `/v1/chat/completions`
304
 
305
  ### Header Parameters
306
 
 
 
307
  > `Authorization`: `string`
308
  >
309
  > Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
@@ -318,19 +340,24 @@ print(response.choices[0].message.content)
318
 
319
  ### Input Body
320
 
 
 
321
  > `messages` : `string`
322
  >
323
  > The input text prompt.
324
 
325
 
 
 
326
  ---
327
 
328
- ## Deploy on Modal
329
 
330
  For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
331
 
332
  ### Clone modal serving code
333
 
 
 
334
  ```shell
335
  git clone https://github.com/TheStageAI/ElasticModels.git
336
  cd ElasticModels/examples/modal
@@ -338,6 +365,8 @@ cd ElasticModels/examples/modal
338
 
339
  ### Configuration of environment variables
340
 
 
 
341
  Set your environment variables in `modal_serving.py`:
342
 
343
  ```python
@@ -356,6 +385,8 @@ ENVS = {
356
 
357
  ### Configuration of GPUs
358
 
 
 
359
  Set your desired GPU type and autoscaling variables in `modal_serving.py`:
360
 
361
  ```python
@@ -382,6 +413,8 @@ def serve():
382
 
383
  ### Run serving
384
 
 
 
385
  ```shell
386
  modal serve modal_serving.py
387
  ```
@@ -389,6 +422,8 @@ modal serve modal_serving.py
389
 
390
  ## Links
391
 
 
 
392
  * __Platform__: [app.thestage.ai](https://app.thestage.ai)
393
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
394
  * __Contact email__: contact@thestage.ai
 
25
 
26
  ## Overview
27
 
28
+ ---
29
+
30
  ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
31
 
32
  - **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
 
36
 
37
  Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
38
 
39
+ ## Installation
40
+
41
  ---
42
 
 
43
 
44
  ### System Requirements
45
 
46
+ ---
47
+
48
  | **Property**| **Value** |
49
  | --- | --- |
50
  | **GPU** | L40s, RTX 5090, H100, RTX 4090 |
 
55
 
56
  ### TheStage AI Access token setup
57
 
58
+ ---
59
+
60
  Install TheStage AI CLI and setup API token:
61
 
62
  ```bash
 
66
 
67
  ### ElasticModels installation
68
 
69
+ ---
70
+
71
  Install TheStage Elastic Models package:
72
 
73
  ```bash
 
76
  pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
77
  ```
78
 
79
+ ## Usage example
80
+
81
  ---
82
 
 
83
 
84
  Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Qwen2.5-7B-Instruct model:
85
 
 
145
  ```
146
 
147
 
148
+ ## Quality Benchmarks
149
+
150
  ---
151
 
 
152
 
153
  We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.
154
 
 
156
 
157
  ### Quality Benchmark Results
158
 
159
+ ---
160
+
161
  | **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8, int8** |
162
  | --- | --- | --- | --- | --- | --- | --- |
163
  | **Arc Challenge** | 54.2 | 55.2 | 55.3 | 54.9 | 54.7 | 41.7 |
 
166
  | **Winogrande** | 70.4 | 70.3 | 71.5 | 70.4 | 71.0 | 53.1 |
167
 
168
 
169
+ ## Datasets
170
+
171
  ---
172
 
 
173
 
174
  - **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
175
  - **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
176
  - **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
177
  - **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
178
 
179
+ ## Metrics
180
+
181
  ---
182
 
 
183
 
184
  - **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
185
 
186
 
187
+ ## Latency Benchmarks
188
+
189
  ---
190
 
 
191
 
192
  We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
193
 
 
195
 
196
  ### Latency Benchmark Results
197
 
198
+ ---
199
+
200
  Tokens per second for different model sizes on various GPUs.
201
 
202
  | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8_int8** |
 
208
  | **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 45 | N/A |
209
 
210
 
211
+ ## Benchmarking Methodology
212
+
213
  ---
214
 
 
215
 
216
  The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
217
 
 
228
  > 5. Calculate the average TTFT and TPS over all iterations.
229
 
230
 
231
+ ## Serving with Docker Image
232
+
233
  ---
234
 
 
235
 
236
  For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
237
  Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
 
239
 
240
  ### Prebuilt image from ECR
241
 
242
+ ---
 
 
 
243
 
244
+ Pull docker image and start inference container:
245
 
246
  ```bash
247
+ docker pull public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.2.0-llm-24.09c
248
  ```
249
  ```bash
250
  docker run --rm -ti \
 
257
  -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
258
  -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
259
  -v /mnt/hf_cache:/root/.cache/huggingface \
260
+ public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.2.0-llm-24.09c
261
  ```
262
 
263
  | **Parameter** | **Description** |
 
267
  | `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token. |
268
  | `<THESTAGE_ACCESS_TOKEN>` | TheStage token generated on the platform (Profile -> Access tokens). |
269
  | `<AUTH_TOKEN>` | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
270
+
271
+ ## Invocation
272
 
273
  ---
274
 
 
275
 
276
  You can invoke the endpoint using CURL as follows:
277
 
 
311
  print(response.choices[0].message.content)
312
  ```
313
 
314
+ ## Endpoint Parameters
315
+
316
  ---
317
 
 
318
 
319
  ### Method
320
 
321
+ ---
322
+
323
  > **POST** `/v1/chat/completions`
324
 
325
  ### Header Parameters
326
 
327
+ ---
328
+
329
  > `Authorization`: `string`
330
  >
331
  > Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
 
340
 
341
  ### Input Body
342
 
343
+ ---
344
+
345
  > `messages` : `string`
346
  >
347
  > The input text prompt.
348
 
349
 
350
+ ## Deploy on Modal
351
+
352
  ---
353
 
 
354
 
355
  For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
356
 
357
  ### Clone modal serving code
358
 
359
+ ---
360
+
361
  ```shell
362
  git clone https://github.com/TheStageAI/ElasticModels.git
363
  cd ElasticModels/examples/modal
 
365
 
366
  ### Configuration of environment variables
367
 
368
+ ---
369
+
370
  Set your environment variables in `modal_serving.py`:
371
 
372
  ```python
 
385
 
386
  ### Configuration of GPUs
387
 
388
+ ---
389
+
390
  Set your desired GPU type and autoscaling variables in `modal_serving.py`:
391
 
392
  ```python
 
413
 
414
  ### Run serving
415
 
416
+ ---
417
+
418
  ```shell
419
  modal serve modal_serving.py
420
  ```
 
422
 
423
  ## Links
424
 
425
+ ---
426
+
427
  * __Platform__: [app.thestage.ai](https://app.thestage.ai)
428
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
429
  * __Contact email__: contact@thestage.ai