Files changed (1) hide show
  1. README.md +272 -62
README.md CHANGED
@@ -20,35 +20,58 @@ language:
20
  - ara
21
  ---
22
 
23
- # Elastic model: Qwen2.5-7B-Instruct. Fastest and most flexible models for self-serving.
24
 
25
- Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
26
 
27
- * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
28
 
29
- * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
30
 
31
- * __M__: Faster model, with accuracy degradation less than 1.5%.
 
 
 
32
 
33
- * __S__: The fastest model, with accuracy degradation less than 2%.
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- __Goals of elastic models:__
37
 
38
- * Provide flexibility in cost vs quality selection for inference
39
- * Provide clear quality and latency benchmarks
40
- * Provide interface of HF libraries: transformers and diffusers with a single line of code
41
- * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
42
- * Provide the best models and service for self-hosting.
 
43
 
44
- > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
45
 
46
- ![Performance Graph](images/performance_graph.png)
47
- -----
 
 
 
 
 
48
 
49
- ## Inference
50
 
51
- To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
52
 
53
  ```python
54
  import torch
@@ -57,7 +80,7 @@ from elastic_models.transformers import AutoModelForCausalLM
57
 
58
  # Currently we require to have your HF token
59
  # as we use original weights for part of layers and
60
- # model confugaration as well
61
  model_name = "Qwen/Qwen2.5-7B-Instruct"
62
  hf_token = ''
63
  device = torch.device("cuda")
@@ -111,74 +134,261 @@ print(f"# Q:\n{prompt}\n")
111
  print(f"# A:\n{output}\n")
112
  ```
113
 
114
- __System requirements:__
115
- * GPUs: H100, L40s, 4090, 5090
116
- * CPU: AMD, Intel
117
- * Python: 3.10-3.12
118
 
 
119
 
120
- To work with our models just run these lines in your terminal:
121
 
122
- ```shell
123
- pip install thestage
124
- pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
125
- pip install flash_attn==2.7.3 --no-build-isolation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
- # or for blackwell support
128
- pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
129
- pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
130
- # please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention
131
- mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
132
- pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
133
 
134
- pip uninstall apex
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  ```
136
 
137
- Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
 
 
 
 
 
 
 
138
 
139
- ```shell
140
- thestage config set --api-token <YOUR_API_TOKEN>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
  ```
142
 
143
- Congrats, now you can use accelerated models!
 
 
144
 
145
- ----
146
 
147
- ## Benchmarks
148
 
149
- Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
150
 
151
- ### Quality benchmarks
 
 
152
 
153
- | Metric/Model | S | M | L | XL | Original | W8A8, int8 |
154
- |---------------|---|---|---|----|----------|------------|
155
- | arc_challenge | 49.10 | 50.10 | 53.20 | 52.60 | 52.60 | 41.70 | - |
156
- | mmlu | 71.70 | 73.00 | 74.10 | 73.50 | 73.50 | 64.60 | - |
157
- | piqa | 77.00 | 78.20 | 78.80 | 79.50 | 79.50 | 67.10 | - |
158
- | winogrande | 66.20 | 69.10 | 71.50 | 70.60 | 70.60 | 53.10 | - |
159
 
 
 
 
160
 
 
161
 
162
- * **MMLU**: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
163
- * **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
164
- * **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
165
- * **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
166
 
167
- ### Latency benchmarks
168
 
169
- __100 input/300 output; tok/s:__
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
- | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
172
- |-----------|-----|---|---|----|----------|------------|
173
- | H100 | 201 | 173 | 162 | 135 | 62 | 201 | - |
174
- | L40S | 76 | 67 | 61 | 47 | 43 | 78 | - |
175
- | 5090 | 149 | - | - | - | - | - | - |
176
- | 4090 | 98 | - | - | - | - | - | - |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
 
178
 
179
  ## Links
180
 
181
- * __Platform__: [app.thestage.ai](app.thestage.ai)
182
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
183
- <!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
184
  * __Contact email__: contact@thestage.ai
 
20
  - ara
21
  ---
22
 
23
+ # Elastic model: Qwen2.5-7B-Instruct
24
 
 
25
 
26
+ ## Overview
27
 
28
+ ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
29
 
30
+ - **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
31
+ - **L**: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
32
+ - **M**: Faster model, with accuracy degradation less than 1.5%.
33
+ - **S**: The fastest model, with accuracy degradation less than 2%.
34
 
35
+ Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
36
 
37
+ ---
38
+
39
+ ## Installation
40
+
41
+ ### System Requirements
42
+
43
+ | **Property**| **Value** |
44
+ | --- | --- |
45
+ | **GPU** | L40s, RTX 5090, H100, RTX 4090 |
46
+ | **Python Version** | 3.10-3.12 |
47
+ | **CPU** | Intel/AMD x86_64 |
48
+ | **CUDA Version** | 12.8+ |
49
+
50
+
51
+ ### TheStage AI Access token setup
52
 
53
+ Install TheStage AI CLI and setup API token:
54
 
55
+ ```bash
56
+ pip install thestage
57
+ thestage config set --access-token <YOUR_ACCESS_TOKEN>
58
+ ```
59
+
60
+ ### ElasticModels installation
61
 
62
+ Install TheStage Elastic Models package:
63
 
64
+ ```bash
65
+ pip install 'thestage-elastic-models[nvidia,cudnn]' \
66
+ --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
67
+ pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
68
+ ```
69
+
70
+ ---
71
 
72
+ ## Usage example
73
 
74
+ Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Qwen2.5-7B-Instruct model:
75
 
76
  ```python
77
  import torch
 
80
 
81
  # Currently we require to have your HF token
82
  # as we use original weights for part of layers and
83
+ # model configuration as well
84
  model_name = "Qwen/Qwen2.5-7B-Instruct"
85
  hf_token = ''
86
  device = torch.device("cuda")
 
134
  print(f"# A:\n{output}\n")
135
  ```
136
 
 
 
 
 
137
 
138
+ ---
139
 
140
+ ## Quality Benchmarks
141
 
142
+ We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.
143
+
144
+ ![Quality Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422559-0c9621c5-9e7f-4c81-8698-70f6d6872cb5/Elastic_Qwen2.5_7B_Instruct_MMLU.png)
145
+
146
+ ### Quality Benchmark Results
147
+
148
+ | **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8, int8** |
149
+ | --- | --- | --- | --- | --- | --- | --- |
150
+ | **Arc Challenge** | 54.2 | 55.2 | 55.3 | 54.9 | 54.7 | 41.7 |
151
+ | **MMLU** | 71.5 | 71.6 | 71.9 | 71.9 | 71.8 | 64.6 |
152
+ | **PIQA** | 78.3 | 79.9 | 79.5 | 79.5 | 79.6 | 67.1 |
153
+ | **Winogrande** | 70.4 | 70.3 | 71.5 | 70.4 | 71.0 | 53.1 |
154
+
155
+
156
+ ---
157
+
158
+ ## Datasets
159
+
160
+ - **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
161
+ - **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
162
+ - **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
163
+ - **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
164
+
165
+ ---
166
+
167
+ ## Metrics
168
+
169
+ - **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
170
+
171
+
172
+ ---
173
+
174
+ ## Latency Benchmarks
175
+
176
+ We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
177
+
178
+ ![Latency Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422585-3065316c-5c07-4430-befb-61daac95f712/Elastic_Qwen2.5_7B_Instruct_latency.png)
179
+
180
+ ### Latency Benchmark Results
181
+
182
+ Tokens per second for different model sizes on various GPUs.
183
+
184
+ | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8_int8** |
185
+ | --- | --- | --- | --- | --- | --- | --- |
186
+ | **H100** | 184 | 177 | 157 | 138 | 62 | 201 |
187
+ | **L40s** | 72 | 67 | 57 | 48 | 42 | 78 |
188
+ | **B200** | 239 | 232 | 216 | 199 | 114 | N/A |
189
+ | **GeForce RTX 5090** | 141 | N/A | N/A | N/A | 66 | N/A |
190
+ | **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 45 | N/A |
191
+
192
+
193
+ ---
194
+
195
+ ## Benchmarking Methodology
196
+
197
+ The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
198
+
199
+ > **Algorithm summary:**
200
+ > 1. Load the Qwen2.5-7B-Instruct model with the specified size (S, M, L, XL, original).
201
+ > 2. Move the model to the GPU.
202
+ > 3. Prepare a sample prompt for text generation.
203
+ > 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration:
204
+ > - Synchronize the GPU to flush any previous operations.
205
+ > - Record the start time.
206
+ > - Generate the text using the model.
207
+ > - Synchronize the GPU again.
208
+ > - Record the end time and calculate the TTFT and TPS for that iteration.
209
+ > 5. Calculate the average TTFT and TPS over all iterations.
210
+
211
+
212
+ ---
213
 
214
+ ## Serving with Docker Image
 
 
 
 
 
215
 
216
+ For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
217
+ Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
218
+ You can also use this container to run inference through TheStage AI platform.
219
+
220
+ ### Prebuilt image from ECR
221
+
222
+ | **GPU** | **Docker image name** |
223
+ | --- | --- |
224
+ | H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` |
225
+ | B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` |
226
+
227
+ Pull docker image for your Nvidia GPU and start inference container:
228
+
229
+ ```bash
230
+ docker pull <IMAGE_NAME>
231
+ ```
232
+ ```bash
233
+ docker run --rm -ti \
234
+ --name serving_thestage_model \
235
+ -p 8000:80 \
236
+ -e AUTH_TOKEN=<AUTH_TOKEN> \
237
+ -e MODEL_REPO=Qwen/Qwen2.5-7B-Instruct \
238
+ -e MODEL_SIZE=<MODEL_SIZE> \
239
+ -e MODEL_BATCH=<MAX_BATCH_SIZE> \
240
+ -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
241
+ -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
242
+ -v /mnt/hf_cache:/root/.cache/huggingface \
243
+ <IMAGE_NAME_DEPENDING_ON_YOUR_GPU>
244
  ```
245
 
246
+ | **Parameter** | **Description** |
247
+ |----------------------------|------------------------------------------------------------------------------------------------------|
248
+ | `<MODEL_SIZE>` | Available: S, M, L, XL. |
249
+ | `<MAX_BATCH_SIZE>` | Maximum batch size to process in parallel. |
250
+ | `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token. |
251
+ | `<THESTAGE_ACCESS_TOKEN>` | TheStage token generated on the platform (Profile -> Access tokens). |
252
+ | `<AUTH_TOKEN>` | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
253
+ | `<IMAGE_NAME>` | Image name which you have pulled. |
254
 
255
+ ---
256
+
257
+ ## Invocation
258
+
259
+ You can invoke the endpoint using CURL as follows:
260
+
261
+ ```bash
262
+ curl -X POST 'http://127.0.0.1:8000/v1/chat/completions' \
263
+ -H 'Authorization: Bearer 123' \
264
+ -H 'Content-Type: application/json' \
265
+ -H "X-Model-Name: qwen-2-5-7b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged" \
266
+ -d '{
267
+ "messages":[{"role":"user","content":"Define AI"}]
268
+ }'
269
+ ```
270
+
271
+ Or using OpenAI python client:
272
+
273
+ ```python
274
+ import os, base64, pathlib, json
275
+ from openai import OpenAI
276
+
277
+ BASE_URL = "http://<your_ip>/v1"
278
+ API_KEY = "123"
279
+ MODEL = "qwen-2-5-7b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged"
280
+
281
+ client = OpenAI(
282
+ api_key=API_KEY,
283
+ base_url=BASE_URL,
284
+ default_headers={"X-Model-Name": MODEL}
285
+ )
286
+
287
+ response = client.chat.completions.create(
288
+ model=MODEL,
289
+ messages=[
290
+ {"role": "user", "content": "Define AI"}
291
+ ]
292
+ )
293
+
294
+ print(response.choices[0].message.content)
295
  ```
296
 
297
+ ---
298
+
299
+ ## Endpoint Parameters
300
 
301
+ ### Method
302
 
303
+ > **POST** `/v1/chat/completions`
304
 
305
+ ### Header Parameters
306
 
307
+ > `Authorization`: `string`
308
+ >
309
+ > Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
310
 
311
+ > `Content-Type`: `string`
312
+ >
313
+ > Must be set to `application/json`.
 
 
 
314
 
315
+ > `X-Model-Name`: `string`
316
+ >
317
+ > Specifies the model to use for generation. Format: `qwen-2-5-7b-instruct-<size>-bs<batch_size>`, where `<size>` is one of `S`, `M`, `L`, `XL`, `original` and `<batch_size>` is the maximum batch size configured during container startup.
318
 
319
+ ### Input Body
320
 
321
+ > `messages` : `string`
322
+ >
323
+ > The input text prompt.
 
324
 
 
325
 
326
+ ---
327
+
328
+ ## Deploy on Modal
329
+
330
+ For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
331
+
332
+ ### Clone modal serving code
333
+
334
+ ```shell
335
+ git clone https://github.com/TheStageAI/ElasticModels.git
336
+ cd ElasticModels/examples/modal
337
+ ```
338
+
339
+ ### Configuration of environment variables
340
+
341
+ Set your environment variables in `modal_serving.py`:
342
+
343
+ ```python
344
+ # modal_serving.py
345
+
346
+ ENVS = {
347
+ "MODEL_REPO": "Qwen/Qwen2.5-7B-Instruct",
348
+ "MODEL_BATCH": "4",
349
+ "THESTAGE_AUTH_TOKEN": "",
350
+ "HUGGINGFACE_ACCESS_TOKEN": "",
351
+ "PORT": "80",
352
+ "PORT_HEALTH": "80",
353
+ "HF_HOME": "/cache/huggingface",
354
+ }
355
+ ```
356
+
357
+ ### Configuration of GPUs
358
+
359
+ Set your desired GPU type and autoscaling variables in `modal_serving.py`:
360
 
361
+ ```python
362
+ # modal_serving.py
363
+
364
+ @app.function(
365
+ image=image,
366
+ gpu="B200",
367
+ min_containers=8,
368
+ max_containers=8,
369
+ timeout=10000,
370
+ ephemeral_disk=600 * 1024,
371
+ volumes={"/opt/project/.cache": HF_CACHE},
372
+ startup_timeout=60*20
373
+ )
374
+ @modal.web_server(
375
+ 80,
376
+ label="Qwen/Qwen2.5-7B-Instruct-test",
377
+ startup_timeout=60*20
378
+ )
379
+ def serve():
380
+ pass
381
+ ```
382
+
383
+ ### Run serving
384
+
385
+ ```shell
386
+ modal serve modal_serving.py
387
+ ```
388
 
389
 
390
  ## Links
391
 
392
+ * __Platform__: [app.thestage.ai](https://app.thestage.ai)
393
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
 
394
  * __Contact email__: contact@thestage.ai