Files changed (1) hide show
  1. README.md +273 -62
README.md CHANGED
@@ -20,35 +20,59 @@ language:
20
  - ara
21
  ---
22
 
23
- # Elastic model: Qwen2.5-7B-Instruct. Fastest and most flexible models for self-serving.
24
 
25
- Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
26
 
27
- * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
28
 
29
- * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
30
 
31
- * __M__: Faster model, with accuracy degradation less than 1.5%.
 
 
 
32
 
33
- * __S__: The fastest model, with accuracy degradation less than 2%.
34
 
 
35
 
36
- __Goals of elastic models:__
37
 
38
- * Provide flexibility in cost vs quality selection for inference
39
- * Provide clear quality and latency benchmarks
40
- * Provide interface of HF libraries: transformers and diffusers with a single line of code
41
- * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
42
- * Provide the best models and service for self-hosting.
43
 
44
- > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
 
 
 
 
 
45
 
46
- ![Performance Graph](images/performance_graph.png)
47
- -----
48
 
49
- ## Inference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
 
 
 
 
52
 
53
  ```python
54
  import torch
@@ -57,7 +81,7 @@ from elastic_models.transformers import AutoModelForCausalLM
57
 
58
  # Currently we require to have your HF token
59
  # as we use original weights for part of layers and
60
- # model confugaration as well
61
  model_name = "Qwen/Qwen2.5-7B-Instruct"
62
  hf_token = ''
63
  device = torch.device("cuda")
@@ -111,74 +135,261 @@ print(f"# Q:\n{prompt}\n")
111
  print(f"# A:\n{output}\n")
112
  ```
113
 
114
- __System requirements:__
115
- * GPUs: H100, L40s, 4090, 5090
116
- * CPU: AMD, Intel
117
- * Python: 3.10-3.12
118
 
 
119
 
120
- To work with our models just run these lines in your terminal:
121
 
122
- ```shell
123
- pip install thestage
124
- pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
125
- pip install flash_attn==2.7.3 --no-build-isolation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
- # or for blackwell support
128
- pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
129
- pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
130
- # please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention
131
- mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
132
- pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
133
 
134
- pip uninstall apex
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  ```
136
 
137
- Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
- ```shell
140
- thestage config set --api-token <YOUR_API_TOKEN>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
  ```
142
 
143
- Congrats, now you can use accelerated models!
144
 
145
- ----
 
 
 
 
146
 
147
- ## Benchmarks
148
 
149
- Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
 
 
150
 
151
- ### Quality benchmarks
 
 
152
 
153
- | Metric/Model | S | M | L | XL | Original | W8A8, int8 |
154
- |---------------|---|---|---|----|----------|------------|
155
- | arc_challenge | 49.10 | 50.10 | 53.20 | 52.60 | 52.60 | 41.70 | - |
156
- | mmlu | 71.70 | 73.00 | 74.10 | 73.50 | 73.50 | 64.60 | - |
157
- | piqa | 77.00 | 78.20 | 78.80 | 79.50 | 79.50 | 67.10 | - |
158
- | winogrande | 66.20 | 69.10 | 71.50 | 70.60 | 70.60 | 53.10 | - |
159
 
 
160
 
 
 
 
161
 
162
- * **MMLU**: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
163
- * **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
164
- * **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
165
- * **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
166
 
167
- ### Latency benchmarks
168
 
169
- __100 input/300 output; tok/s:__
170
 
171
- | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
172
- |-----------|-----|---|---|----|----------|------------|
173
- | H100 | 201 | 173 | 162 | 135 | 62 | 201 | - |
174
- | L40S | 76 | 67 | 61 | 47 | 43 | 78 | - |
175
- | 5090 | 149 | - | - | - | - | - | - |
176
- | 4090 | 98 | - | - | - | - | - | - |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
 
178
 
179
  ## Links
180
 
181
- * __Platform__: [app.thestage.ai](app.thestage.ai)
182
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
183
- <!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
184
  * __Contact email__: contact@thestage.ai
 
20
  - ara
21
  ---
22
 
23
+ # Elastic model: Qwen2.5-7B-Instruct
24
 
25
+ ## Overview
26
 
27
+ ----
28
 
29
+ ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
30
 
31
+ - **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
32
+ - **L**: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
33
+ - **M**: Faster model, with accuracy degradation less than 1.5%.
34
+ - **S**: The fastest model, with accuracy degradation less than 2%.
35
 
36
+ Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
37
 
38
+ ## Installation
39
 
40
+ ---
41
 
42
+ ### System Requirements
 
 
 
 
43
 
44
+ | **Property**| **Value** |
45
+ | --- | --- |
46
+ | **GPU** | L40s, RTX 5090, H100, RTX 4090 |
47
+ | **Python Version** | 3.10-3.12 |
48
+ | **CPU** | Intel/AMD x86_64 |
49
+ | **CUDA Version** | 12.8+ |
50
 
 
 
51
 
52
+ ### TheStage AI Access token setup
53
+
54
+ Install TheStage AI CLI and setup API token:
55
+
56
+ ```bash
57
+ pip install thestage
58
+ thestage config set --access-token <YOUR_ACCESS_TOKEN>
59
+ ```
60
+
61
+ ### ElasticModels installation
62
+
63
+ Install TheStage Elastic Models package:
64
+
65
+ ```bash
66
+ pip install 'thestage-elastic-models[nvidia,cudnn]' \
67
+ --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
68
+ pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
69
+ ```
70
 
71
+ ## Usage example
72
+
73
+ ----
74
+
75
+ Elastic Models provides the same interface as HuggingFace Diffusers. Here is an example of how to use the Qwen2.5-7B-Instruct model:
76
 
77
  ```python
78
  import torch
 
81
 
82
  # Currently we require to have your HF token
83
  # as we use original weights for part of layers and
84
+ # model configuration as well
85
  model_name = "Qwen/Qwen2.5-7B-Instruct"
86
  hf_token = ''
87
  device = torch.device("cuda")
 
135
  print(f"# A:\n{output}\n")
136
  ```
137
 
 
 
 
 
138
 
139
+ ## Quality Benchmarks
140
 
141
+ ------------
142
 
143
+ We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Windogrande.
144
+
145
+ ![Quality Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422559-0c9621c5-9e7f-4c81-8698-70f6d6872cb5/Elastic_Qwen2.5_7B_Instruct_MMLU.png)
146
+
147
+ ### Quality Benchmark Results
148
+
149
+ | **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8, int8** |
150
+ | --- | --- | --- | --- | --- | --- | --- |
151
+ | **Arc Challenge** | 54.2 | 55.2 | 55.3 | 54.9 | 54.7 | 41.7 |
152
+ | **MMLU** | 71.5 | 71.6 | 71.9 | 71.9 | 71.8 | 64.6 |
153
+ | **PIQA** | 78.3 | 79.9 | 79.5 | 79.5 | 79.6 | 67.1 |
154
+ | **Winogrande** | 70.4 | 70.3 | 71.5 | 70.4 | 71.0 | 53.1 |
155
+
156
+
157
+ ## Datasets
158
+
159
+ -------
160
+
161
+ - **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
162
+ - **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
163
+ - **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
164
+ - **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
165
+
166
+ ## Metrics
167
+
168
+ ----------
169
+
170
+ - **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
171
+
172
+
173
+ ## Latency Benchmarks
174
+
175
+ -----
176
+
177
+ We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
178
+
179
+ ![Latency Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422585-3065316c-5c07-4430-befb-61daac95f712/Elastic_Qwen2.5_7B_Instruct_latency.png)
180
+
181
+ ### Latency Benchmark Results
182
+
183
+ Tokens per second for different model sizes on various GPUs.
184
+
185
+ | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8_int8** |
186
+ | --- | --- | --- | --- | --- | --- | --- |
187
+ | **H100** | 184 | 177 | 157 | 138 | 62 | 201 |
188
+ | **L40s** | 72 | 67 | 57 | 48 | 42 | 78 |
189
+ | **B200** | 239 | 232 | 216 | 199 | 114 | N/A |
190
+ | **GeForce RTX 5090** | 141 | N/A | N/A | N/A | 66 | N/A |
191
+ | **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 45 | N/A |
192
+
193
+
194
+ ## Benchmarking Methodology
195
+
196
+ ----
197
+
198
+ The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
199
+
200
+ > **Algorithm summary:**
201
+ > 1. Load the Qwen2.5-7B-Instruct model with the specified size (S, M, L, XL, original).
202
+ > 2. Move the model to the GPU.
203
+ > 3. Prepare a sample prompt for image generation.
204
+ > 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration:
205
+ > - Synchronize the GPU to flush any previous operations.
206
+ > - Record the start time.
207
+ > - Generate the text using the model.
208
+ > - Synchronize the GPU again.
209
+ > - Record the end time and calculate the TTFT and TPS for that iteration.
210
+ > 5. Calculate the average TTFT and TPS over all iterations.
211
 
 
 
 
 
 
 
212
 
213
+ ## Serving with Docker Image
214
+
215
+ ------------
216
+
217
+ For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
218
+ Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
219
+ You can also use this container to run inference through TheStage AI platform.
220
+
221
+ ### Prebuilt image from ECR
222
+
223
+ | **GPU** | **Docker image name** |
224
+ | --- | --- |
225
+ | H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` |
226
+ | B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` |
227
+
228
+ Pull docker image for your Nvidia GPU and start inference container:
229
+
230
+ ```bash
231
+ docker pull <IMAGE_NAME>
232
+ ```
233
+ ```bash
234
+ docker run --rm -ti \
235
+ --name serving_thestage_model \
236
+ -p 8000:80 \
237
+ -e AUTH_TOKEN=<AUTH_TOKEN> \
238
+ -e MODEL_REPO=Qwen/Qwen2.5-7B-Instruct \
239
+ -e MODEL_SIZE=<MODEL_SIZE> \
240
+ -e MODEL_BATCH=<MAX_BATCH_SIZE> \
241
+ -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
242
+ -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
243
+ -v /mnt/hf_cache:/root/.cache/huggingface \
244
+ <IMAGE_NAME_DEPNDING_ON_YOUR_GPU>
245
  ```
246
 
247
+ | **Parameter** | **Description** |
248
+ |----------------------------|------------------------------------------------------------------------------------------------------|
249
+ | `<MODEL_SIZE>` | Available: S, M, L, XL. |
250
+ | `<MAX_BATCH_SIZE>` | Maximum batch size to process in parallel. |
251
+ | `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token. |
252
+ | `<THESTAGE_ACCESS_TOKEN>` | TheStage token generated on the platform (Profile -> Access tokens). |
253
+ | `<AUTH_TOKEN>` | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
254
+ | `<IMAGE_NAME>` | Image name which you have pulled. |
255
+
256
+ ## Invocation
257
+
258
+ ------
259
+
260
+ You can invoke the endpoint using CURL as follows:
261
+
262
+ ```bash
263
+ curl -X POST 'http://127.0.0.1:8000/v1/chat/completions' \
264
+ -H 'Authorization: Bearer 123' \
265
+ -H 'Content-Type: application/json' \
266
+ -H "X-Model-Name: qwen-2-5-7b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged" \
267
+ -d '{
268
+ "messages":[{"role":"user","content":"Define AI"}]
269
+ }'
270
+ ```
271
 
272
+ Or using OpenAI python client:
273
+
274
+ ```python
275
+ import os, base64, pathlib, json
276
+ from openai import OpenAI
277
+
278
+ BASE_URL = "http://<your_ip>/v1"
279
+ API_KEY = "123"
280
+ MODEL = "qwen-2-5-7b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged"
281
+
282
+ client = OpenAI(
283
+ api_key=API_KEY,
284
+ base_url=BASE_URL,
285
+ default_headers={"X-Model-Name": MODEL}
286
+ )
287
+
288
+ response = client.client.chat.completions.create(
289
+ model=MODEL,
290
+ messages=[
291
+ {"role": "user", "content": "Define AI"}
292
+ ]
293
+ )
294
+
295
+ print(response.choices[0].message.content)
296
  ```
297
 
298
+ ## Endpoint Parameters
299
 
300
+ -------------
301
+
302
+ ### Method
303
+
304
+ > **POST** `/v1/chat/completions`
305
 
306
+ ### Header Parameters
307
 
308
+ > `Authorization`: `string`
309
+ >
310
+ > Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
311
 
312
+ > `Content-Type`: `string`
313
+ >
314
+ > Must be set to `application/json`.
315
 
316
+ > `X-Model-Name`: `string`
317
+ >
318
+ > Specifies the model to use for generation. Format: `qwen-2-5-7b-instruct-<size>-bs<batch_size>`, where `<size>` is one of `S`, `M`, `L`, `XL`, `original` and `<batch_size>` is the maximum batch size configured during container startup.
 
 
 
319
 
320
+ ### Input Body
321
 
322
+ > `messages` : `string`
323
+ >
324
+ > The input text prompt.
325
 
 
 
 
 
326
 
327
+ ## Deploy on Modal
328
 
329
+ -----------------------
330
 
331
+ For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
332
+
333
+ ### Clone modal serving code
334
+
335
+ ```shell
336
+ git clone https://github.com/TheStageAI/ElasticModels.git
337
+ cd ElasticModels/examples/modal
338
+ ```
339
+
340
+ ### Configuration of environment variables
341
+
342
+ Set your environment variables in `modal_serving.py`:
343
+
344
+ ```python
345
+ # modal_serving.py
346
+
347
+ ENVS = {
348
+ "MODEL_REPO": "Qwen/Qwen2.5-7B-Instruct",
349
+ "MODEL_BATCH": "4",
350
+ "THESTAGE_AUTH_TOKEN": "",
351
+ "HUGGINGFACE_ACCESS_TOKEN": "",
352
+ "PORT": "80",
353
+ "PORT_HEALTH": "80",
354
+ "HF_HOME": "/cache/huggingface",
355
+ }
356
+ ```
357
+
358
+ ### Configuration of GPUs
359
+
360
+ Set your desired GPU type and autoscaling setup. variables in `modal_serving.py`:
361
+
362
+ ```python
363
+ # modal_serving.py
364
+
365
+ @app.function(
366
+ image=image,
367
+ gpu="B200",
368
+ min_containers=8,
369
+ max_containers=8,
370
+ timeout=10000,
371
+ ephemeral_disk=600 * 1024,
372
+ volumes={"/opt/project/.cache": HF_CACHE},
373
+ startup_timeout=60*20
374
+ )
375
+ @modal.web_server(
376
+ 80,
377
+ label="Qwen/Qwen2.5-7B-Instruct-test",
378
+ startup_timeout=60*20
379
+ )
380
+ def serve():
381
+ pass
382
+ ```
383
+
384
+ ### Run serving
385
+
386
+ ```shell
387
+ modal serve modal_serving.py
388
+ ```
389
 
390
 
391
  ## Links
392
 
393
+ * __Platform__: [app.thestage.ai](https://app.thestage.ai)
394
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
 
395
  * __Contact email__: contact@thestage.ai