Files changed (1) hide show
  1. README.md +273 -67
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
- - meta-llama/Meta-Llama-3.1-8B-Instruct
5
  base_model_relation: quantized
6
  pipeline_tag: text-generation
7
  language:
@@ -20,37 +20,59 @@ language:
20
  - ara
21
  ---
22
 
23
- # Elastic model: Meta-Llama-3.1-8B-Instruct. Fastest and most flexible models for self-serving.
24
 
25
- Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
26
 
27
- * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
28
 
29
- * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
30
 
31
- * __M__: Faster model, with accuracy degradation less than 1.5%.
 
 
 
32
 
33
- * __S__: The fastest model, with accuracy degradation less than 2%.
34
 
 
35
 
36
- __Goals of elastic models:__
37
 
38
- * Provide flexibility in cost vs quality selection for inference
39
- * Provide clear quality and latency benchmarks
40
- * Provide interface of HF libraries: transformers and diffusers with a single line of code
41
- * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
42
- * Provide the best models and service for self-hosting.
43
 
44
- > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
 
 
 
 
 
45
 
46
 
47
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/pKc4jGGKTrp7ecawPbZq-.png)
48
 
49
- -----
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- ## Inference
52
 
53
- To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
 
 
54
 
55
  ```python
56
  import torch
@@ -69,7 +91,7 @@ tokenizer = AutoTokenizer.from_pretrained(
69
  model_name, token=hf_token
70
  )
71
  model = AutoModelForCausalLM.from_pretrained(
72
- model_name,
73
  token=hf_token,
74
  torch_dtype=torch.bfloat16,
75
  attn_implementation="sdpa",
@@ -104,7 +126,7 @@ input_len = inputs['input_ids'].shape[1]
104
  generate_ids = generate_ids[:, input_len:]
105
  output = tokenizer.batch_decode(
106
  generate_ids,
107
- skip_special_tokens=True,
108
  clean_up_tokenization_spaces=False
109
  )[0]
110
 
@@ -113,77 +135,261 @@ print(f"# Q:\n{prompt}\n")
113
  print(f"# A:\n{output}\n")
114
  ```
115
 
116
- __System requirements:__
117
- * GPUs: H100, L40s, 5090, 4090
118
- * CPU: AMD, Intel
119
- * Python: 3.10-3.12
120
 
 
121
 
122
- To work with our models just run these lines in your terminal:
123
 
124
- ```shell
125
- pip install thestage
126
- pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
127
- pip install flash_attn==2.7.3 --no-build-isolation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
 
129
- # or for blackwell support
130
- pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
131
- pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
132
- # please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention
133
- mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
134
- pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
135
 
136
- pip uninstall apex
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  ```
139
 
140
- Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
142
- ```shell
143
- thestage config set --api-token <YOUR_API_TOKEN>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
  ```
145
 
146
- Congrats, now you can use accelerated models!
147
 
148
- ----
 
 
 
 
149
 
150
- ## Benchmarks
151
 
152
- Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
 
 
153
 
154
- ### Quality benchmarks
 
 
155
 
156
- <!-- For quality evaluation we have used: #TODO link to github -->
 
 
157
 
158
- | Metric/Model | S | M | L | XL | Original | W8A8, int8 |
159
- |---------------|---|---|---|----|----------|------------|
160
- | MMLU | 65.8 | 66.8 | 67.5 | 68.2 | 68.2 | 24.3 |
161
- | PIQA | 77.6 | 79.3 | 79.8 | 79.8 | 79.8 | 64.6 |
162
- | Arc Challenge | 50.7 | 50.3 | 52.3 | 51.7 | 51.7 | 29.6 |
163
- | Winogrande | 72.5 | 72 | 73.3 | 73.9 | 73.9 | 62.8 |
164
 
 
 
 
165
 
166
- * **MMLU**:Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
167
- * **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
168
- * **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
169
- * **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
170
 
171
- ### Latency benchmarks
172
 
173
- __100 input/300 output; tok/s:__
174
 
175
- | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
176
- |-----------|-----|---|---|----|----------|------------|
177
- | H100 | 189 | 175 | 159 | 132 | 60 | 191 |
178
- | L40s | 73 | 64 | 57 | 45 | 40 | 77 |
179
- | 5090 | 145 | - | - | - | - | - |
180
- | 4090 | 95 | - | - | - | - | - |
181
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
 
183
 
184
  ## Links
185
 
186
- * __Platform__: [app.thestage.ai](app.thestage.ai)
187
- <!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
188
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
189
- * __Contact email__: contact@thestage.ai
 
1
  ---
2
+ license: llama3.1
3
  base_model:
4
+ - meta-llama/Llama-3.1-8B-Instruct
5
  base_model_relation: quantized
6
  pipeline_tag: text-generation
7
  language:
 
20
  - ara
21
  ---
22
 
23
+ # Elastic model: Llama-3.1-8B-Instruct
24
 
25
+ ## Overview
26
 
27
+ ----
28
 
29
+ ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
30
 
31
+ - **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
32
+ - **L**: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
33
+ - **M**: Faster model, with accuracy degradation less than 1.5%.
34
+ - **S**: The fastest model, with accuracy degradation less than 2%.
35
 
36
+ Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
37
 
38
+ ## Installation
39
 
40
+ ---
41
 
42
+ ### System Requirements
 
 
 
 
43
 
44
+ | **Property**| **Value** |
45
+ | --- | --- |
46
+ | **GPU** | L40s, RTX 5090, H100, B200 |
47
+ | **Python Version** | 3.10-3.12 |
48
+ | **CPU** | Intel/AMD x86_64 |
49
+ | **CUDA Version** | 12.9+ |
50
 
51
 
52
+ ### TheStage AI Access token setup
53
 
54
+ Install TheStage AI CLI and setup API token:
55
+
56
+ ```bash
57
+ pip install thestage
58
+ thestage config set --access-token <YOUR_ACCESS_TOKEN>
59
+ ```
60
+
61
+ ### ElasticModels installation
62
+
63
+ Install TheStage Elastic Models package:
64
+
65
+ ```bash
66
+ pip install 'thestage-elastic-models[nvidia,cudnn]' \
67
+ --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
68
+ pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
69
+ ```
70
 
71
+ ## Usage example
72
 
73
+ ----
74
+
75
+ Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Llama-3.1-8B-Instruct model:
76
 
77
  ```python
78
  import torch
 
91
  model_name, token=hf_token
92
  )
93
  model = AutoModelForCausalLM.from_pretrained(
94
+ model_name,
95
  token=hf_token,
96
  torch_dtype=torch.bfloat16,
97
  attn_implementation="sdpa",
 
126
  generate_ids = generate_ids[:, input_len:]
127
  output = tokenizer.batch_decode(
128
  generate_ids,
129
+ skip_special_tokens=True,
130
  clean_up_tokenization_spaces=False
131
  )[0]
132
 
 
135
  print(f"# A:\n{output}\n")
136
  ```
137
 
 
 
 
 
138
 
139
+ ## Quality Benchmarks
140
 
141
+ ------------
142
 
143
+ We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.
144
+
145
+ ![Quality Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422713-7d51617f-e70a-41db-95f9-abd0d9ff338f/Elastic_Llama_3.1_8B_Instruct_MMLU.png)
146
+
147
+ ### Quality Benchmark Results
148
+
149
+ | **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8, int8** |
150
+ | --- | --- | --- | --- | --- | --- | --- |
151
+ | **MMLU** | 67.4 | 68.1 | 68.3 | 68.5 | 68.4 | 24.3 |
152
+ | **PIQA** | 79.8 | 80.2 | 80.1 | 79.9 | 80.0 | 64.6 |
153
+ | **Arc Challenge** | 55.1 | 54.6 | 54.7 | 55.6 | 55.5 | 29.6 |
154
+ | **Winogrande** | 73.7 | 73.6 | 73.7 | 74.0 | 74.0 | 62.8 |
155
+
156
+
157
+ ## Datasets
158
+
159
+ -------
160
+
161
+ - **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
162
+ - **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
163
+ - **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
164
+ - **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
165
+
166
+ ## Metrics
167
+
168
+ ----------
169
+
170
+ - **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
171
+
172
+
173
+ ## Latency Benchmarks
174
+
175
+ -----
176
+
177
+ We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
178
+
179
+ ![Latency Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422728-414fdb22-5c04-44a6-8686-0602a4293e88/Elastic_Llama_3.1_8B_Instruct_latency.png)
180
+
181
+ ### Latency Benchmark Results
182
+
183
+ Tokens per second for different model sizes on various GPUs.
184
+
185
+ | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8_int8** |
186
+ | --- | --- | --- | --- | --- | --- | --- |
187
+ | **H100** | 189 | 168 | 156 | 134 | 60 | 191 |
188
+ | **L40s** | 72 | 63 | 56 | 45 | 37 | 77 |
189
+ | **B200** | 239 | 236 | 207 | 199 | 100 | N/A |
190
+ | **GeForce RTX 5090** | 143 | N/A | N/A | N/A | 60 | N/A |
191
+ | **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 41 | N/A |
192
+
193
+
194
+ ## Benchmarking Methodology
195
+
196
+ ----
197
+
198
+ The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
199
+
200
+ > **Algorithm summary:**
201
+ > 1. Load the Llama-3.1-8B-Instruct model with the specified size (S, M, L, XL, original).
202
+ > 2. Move the model to the GPU.
203
+ > 3. Prepare a sample prompt for text generation.
204
+ > 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration:
205
+ > - Synchronize the GPU to flush any previous operations.
206
+ > - Record the start time.
207
+ > - Generate the text using the model.
208
+ > - Synchronize the GPU again.
209
+ > - Record the end time and calculate the TTFT and TPS for that iteration.
210
+ > 5. Calculate the average TTFT and TPS over all iterations.
211
 
 
 
 
 
 
 
212
 
213
+ ## Serving with Docker Image
214
 
215
+ ------------
216
+
217
+ For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
218
+ Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
219
+ You can also use this container to run inference through TheStage AI platform.
220
+
221
+ ### Prebuilt image from ECR
222
+
223
+ | **GPU** | **Docker image name** |
224
+ | --- | --- |
225
+ | H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` |
226
+ | B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` |
227
+
228
+ Pull docker image for your Nvidia GPU and start inference container:
229
+
230
+ ```bash
231
+ docker pull <IMAGE_NAME>
232
+ ```
233
+ ```bash
234
+ docker run --rm -ti \
235
+ --name serving_thestage_model \
236
+ -p 8000:80 \
237
+ -e AUTH_TOKEN=<AUTH_TOKEN> \
238
+ -e MODEL_REPO=meta-llama/Llama-3.1-8B-Instruct \
239
+ -e MODEL_SIZE=<MODEL_SIZE> \
240
+ -e MODEL_BATCH=<MAX_BATCH_SIZE> \
241
+ -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
242
+ -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
243
+ -v /mnt/hf_cache:/root/.cache/huggingface \
244
+ <IMAGE_NAME_DEPENDING_ON_YOUR_GPU>
245
  ```
246
 
247
+ | **Parameter** | **Description** |
248
+ |----------------------------|------------------------------------------------------------------------------------------------------|
249
+ | `<MODEL_SIZE>` | Available: S, M, L, XL. |
250
+ | `<MAX_BATCH_SIZE>` | Maximum batch size to process in parallel. |
251
+ | `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token. |
252
+ | `<THESTAGE_ACCESS_TOKEN>` | TheStage token generated on the platform (Profile -> Access tokens). |
253
+ | `<AUTH_TOKEN>` | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
254
+ | `<IMAGE_NAME>` | Image name which you have pulled. |
255
+
256
+ ## Invocation
257
+
258
+ ------
259
+
260
+ You can invoke the endpoint using CURL as follows:
261
+
262
+ ```bash
263
+ curl -X POST 'http://127.0.0.1:8000/v1/chat/completions' \
264
+ -H 'Authorization: Bearer 123' \
265
+ -H 'Content-Type: application/json' \
266
+ -H "X-Model-Name: llama-3-1-8b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged" \
267
+ -d '{
268
+ "messages":[{"role":"user","content":"Define AI"}]
269
+ }'
270
+ ```
271
 
272
+ Or using OpenAI python client:
273
+
274
+ ```python
275
+ import os, base64, pathlib, json
276
+ from openai import OpenAI
277
+
278
+ BASE_URL = "http://<your_ip>/v1"
279
+ API_KEY = "123"
280
+ MODEL = "llama-3-1-8b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged"
281
+
282
+ client = OpenAI(
283
+ api_key=API_KEY,
284
+ base_url=BASE_URL,
285
+ default_headers={"X-Model-Name": MODEL}
286
+ )
287
+
288
+ response = client.chat.completions.create(
289
+ model=MODEL,
290
+ messages=[
291
+ {"role": "user", "content": "Define AI"}
292
+ ]
293
+ )
294
+
295
+ print(response.choices[0].message.content)
296
  ```
297
 
298
+ ## Endpoint Parameters
299
 
300
+ -------------
301
+
302
+ ### Method
303
+
304
+ > **POST** `/v1/chat/completions`
305
 
306
+ ### Header Parameters
307
 
308
+ > `Authorization`: `string`
309
+ >
310
+ > Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
311
 
312
+ > `Content-Type`: `string`
313
+ >
314
+ > Must be set to `application/json`.
315
 
316
+ > `X-Model-Name`: `string`
317
+ >
318
+ > Specifies the model to use for generation. Format: `llama-3-1-8b-instruct-<size>-bs<batch_size>`, where `<size>` is one of `S`, `M`, `L`, `XL`, `original` and `<batch_size>` is the maximum batch size configured during container startup.
319
 
320
+ ### Input Body
 
 
 
 
 
321
 
322
+ > `messages` : `string`
323
+ >
324
+ > The input text prompt.
325
 
 
 
 
 
326
 
327
+ ## Deploy on Modal
328
 
329
+ -----------------------
330
 
331
+ For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
 
 
 
 
 
332
 
333
+ ### Clone modal serving code
334
+
335
+ ```shell
336
+ git clone https://github.com/TheStageAI/ElasticModels.git
337
+ cd ElasticModels/examples/modal
338
+ ```
339
+
340
+ ### Configuration of environment variables
341
+
342
+ Set your environment variables in `modal_serving.py`:
343
+
344
+ ```python
345
+ # modal_serving.py
346
+
347
+ ENVS = {
348
+ "MODEL_REPO": "meta-llama/Llama-3.1-8B-Instruct",
349
+ "MODEL_BATCH": "4",
350
+ "THESTAGE_AUTH_TOKEN": "",
351
+ "HUGGINGFACE_ACCESS_TOKEN": "",
352
+ "PORT": "80",
353
+ "PORT_HEALTH": "80",
354
+ "HF_HOME": "/cache/huggingface",
355
+ }
356
+ ```
357
+
358
+ ### Configuration of GPUs
359
+
360
+ Set your desired GPU type and autoscaling variables in `modal_serving.py`:
361
+
362
+ ```python
363
+ # modal_serving.py
364
+
365
+ @app.function(
366
+ image=image,
367
+ gpu="B200",
368
+ min_containers=8,
369
+ max_containers=8,
370
+ timeout=10000,
371
+ ephemeral_disk=600 * 1024,
372
+ volumes={"/opt/project/.cache": HF_CACHE},
373
+ startup_timeout=60*20
374
+ )
375
+ @modal.web_server(
376
+ 80,
377
+ label="meta-llama/Llama-3.1-8B-Instruct-test",
378
+ startup_timeout=60*20
379
+ )
380
+ def serve():
381
+ pass
382
+ ```
383
+
384
+ ### Run serving
385
+
386
+ ```shell
387
+ modal serve modal_serving.py
388
+ ```
389
 
390
 
391
  ## Links
392
 
393
+ * __Platform__: [app.thestage.ai](https://app.thestage.ai)
 
394
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
395
+ * __Contact email__: contact@thestage.ai