hinairo commited on
Commit
069c1d3
·
verified ·
1 Parent(s): e1adbbf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +272 -80
README.md CHANGED
@@ -1,56 +1,64 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
- - meta-llama/Meta-Llama-3.1-8B-Instruct
5
- base_model_relation: quantized
6
  pipeline_tag: text-generation
7
  language:
8
- - zho
9
- - eng
10
- - fra
11
- - spa
12
- - por
13
- - deu
14
- - ita
15
- - rus
16
- - jpn
17
- - kor
18
- - vie
19
- - tha
20
- - ara
21
  ---
22
 
23
- # Elastic model: Meta-Llama-3.1-8B-Instruct. Fastest and most flexible models for self-serving.
24
 
25
- Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
26
 
27
- * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
28
 
29
- * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
30
 
31
- * __M__: Faster model, with accuracy degradation less than 1.5%.
 
 
 
32
 
33
- * __S__: The fastest model, with accuracy degradation less than 2%.
34
 
 
35
 
36
- __Goals of elastic models:__
37
 
38
- * Provide flexibility in cost vs quality selection for inference
39
- * Provide clear quality and latency benchmarks
40
- * Provide interface of HF libraries: transformers and diffusers with a single line of code
41
- * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
42
- * Provide the best models and service for self-hosting.
43
 
44
- > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
 
 
 
 
 
45
 
46
 
47
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/pKc4jGGKTrp7ecawPbZq-.png)
48
 
49
- -----
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- ## Inference
52
 
53
- To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
 
 
54
 
55
  ```python
56
  import torch
@@ -69,7 +77,7 @@ tokenizer = AutoTokenizer.from_pretrained(
69
  model_name, token=hf_token
70
  )
71
  model = AutoModelForCausalLM.from_pretrained(
72
- model_name,
73
  token=hf_token,
74
  torch_dtype=torch.bfloat16,
75
  attn_implementation="sdpa",
@@ -104,7 +112,7 @@ input_len = inputs['input_ids'].shape[1]
104
  generate_ids = generate_ids[:, input_len:]
105
  output = tokenizer.batch_decode(
106
  generate_ids,
107
- skip_special_tokens=True,
108
  clean_up_tokenization_spaces=False
109
  )[0]
110
 
@@ -113,77 +121,261 @@ print(f"# Q:\n{prompt}\n")
113
  print(f"# A:\n{output}\n")
114
  ```
115
 
116
- __System requirements:__
117
- * GPUs: H100, L40s, 5090, 4090
118
- * CPU: AMD, Intel
119
- * Python: 3.10-3.12
120
 
 
121
 
122
- To work with our models just run these lines in your terminal:
123
 
124
- ```shell
125
- pip install thestage
126
- pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
127
- pip install flash_attn==2.7.3 --no-build-isolation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
 
129
- # or for blackwell support
130
- pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
131
- pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
132
- # please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention
133
- mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
134
- pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
135
 
136
- pip uninstall apex
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  ```
139
 
140
- Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
142
- ```shell
143
- thestage config set --api-token <YOUR_API_TOKEN>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
  ```
145
 
146
- Congrats, now you can use accelerated models!
147
 
148
- ----
 
 
 
 
149
 
150
- ## Benchmarks
151
 
152
- Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
 
 
153
 
154
- ### Quality benchmarks
 
 
155
 
156
- <!-- For quality evaluation we have used: #TODO link to github -->
 
 
157
 
158
- | Metric/Model | S | M | L | XL | Original | W8A8, int8 |
159
- |---------------|---|---|---|----|----------|------------|
160
- | MMLU | 65.8 | 66.8 | 67.5 | 68.2 | 68.2 | 24.3 |
161
- | PIQA | 77.6 | 79.3 | 79.8 | 79.8 | 79.8 | 64.6 |
162
- | Arc Challenge | 50.7 | 50.3 | 52.3 | 51.7 | 51.7 | 29.6 |
163
- | Winogrande | 72.5 | 72 | 73.3 | 73.9 | 73.9 | 62.8 |
164
 
 
 
 
165
 
166
- * **MMLU**:Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
167
- * **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
168
- * **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
169
- * **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
170
 
171
- ### Latency benchmarks
172
 
173
- __100 input/300 output; tok/s:__
174
 
175
- | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
176
- |-----------|-----|---|---|----|----------|------------|
177
- | H100 | 189 | 175 | 159 | 132 | 60 | 191 |
178
- | L40s | 73 | 64 | 57 | 45 | 40 | 77 |
179
- | 5090 | 145 | - | - | - | - | - |
180
- | 4090 | 95 | - | - | - | - | - |
181
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
 
183
 
184
  ## Links
185
 
186
- * __Platform__: [app.thestage.ai](app.thestage.ai)
187
- <!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
188
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
189
  * __Contact email__: contact@thestage.ai
 
1
  ---
 
2
  base_model:
3
+ - meta-llama/Llama-3.1-8B-Instruct
 
4
  pipeline_tag: text-generation
5
  language:
6
+ - en
 
 
 
 
 
 
 
 
 
 
 
 
7
  ---
8
 
9
+ # Elastic model: Llama-3.1-8B-Instruct
10
 
11
+ ## Overview
12
 
13
+ ----
14
 
15
+ ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
16
 
17
+ - **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
18
+ - **L**: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
19
+ - **M**: Faster model, with accuracy degradation less than 1.5%.
20
+ - **S**: The fastest model, with accuracy degradation less than 2%.
21
 
22
+ Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
23
 
24
+ ## Installation
25
 
26
+ ---
27
 
28
+ ### System Requirements
 
 
 
 
29
 
30
+ | **Property**| **Value** |
31
+ | --- | --- |
32
+ | **GPU** | L40s, RTX 5090, H100, B200 |
33
+ | **Python Version** | 3.10-3.12 |
34
+ | **CPU** | Intel/AMD x86_64 |
35
+ | **CUDA Version** | 12.9+ |
36
 
37
 
38
+ ### TheStage AI Access token setup
39
 
40
+ Install TheStage AI CLI and setup API token:
41
+
42
+ ```bash
43
+ pip install thestage
44
+ thestage config set --access-token <YOUR_ACCESS_TOKEN>
45
+ ```
46
+
47
+ ### ElasticModels installation
48
+
49
+ Install TheStage Elastic Models package:
50
+
51
+ ```bash
52
+ pip install 'thestage-elastic-models[nvidia,cudnn]' \
53
+ --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
54
+ pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
55
+ ```
56
 
57
+ ## Usage example
58
 
59
+ ----
60
+
61
+ Elastic Models provides the same interface as HuggingFace Diffusers. Here is an example of how to use the Llama-3.1-8B-Instruct model:
62
 
63
  ```python
64
  import torch
 
77
  model_name, token=hf_token
78
  )
79
  model = AutoModelForCausalLM.from_pretrained(
80
+ model_name,
81
  token=hf_token,
82
  torch_dtype=torch.bfloat16,
83
  attn_implementation="sdpa",
 
112
  generate_ids = generate_ids[:, input_len:]
113
  output = tokenizer.batch_decode(
114
  generate_ids,
115
+ skip_special_tokens=True,
116
  clean_up_tokenization_spaces=False
117
  )[0]
118
 
 
121
  print(f"# A:\n{output}\n")
122
  ```
123
 
 
 
 
 
124
 
125
+ ## Quality Benchmarks
126
 
127
+ ------------
128
 
129
+ We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Windogrande.
130
+
131
+ ![Quality Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422713-7d51617f-e70a-41db-95f9-abd0d9ff338f/Elastic_Llama_3.1_8B_Instruct_MMLU.png)
132
+
133
+ ### Quality Benchmark Results
134
+
135
+ | **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original** |
136
+ | --- | --- | --- | --- | --- | --- |
137
+ | **MMLU** | 67.4 | 68.1 | 68.3 | 68.5 | 68.4 |
138
+ | **PIQA** | 79.8 | 80.2 | 80.1 | 79.9 | 80.0 |
139
+ | **Arc Challenge** | 55.1 | 54.6 | 54.7 | 55.6 | 55.5 |
140
+ | **Winogrande** | 73.7 | 73.6 | 73.7 | 74.0 | 74.0 |
141
+
142
+
143
+ ## Datasets
144
+
145
+ -------
146
+
147
+ - **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
148
+ - **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
149
+ - **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
150
+ - **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
151
+
152
+ ## Metrics
153
+
154
+ ----------
155
+
156
+ - **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
157
+
158
+
159
+ ## Latency Benchmarks
160
+
161
+ -----
162
+
163
+ We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
164
+
165
+ ![Latency Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422728-414fdb22-5c04-44a6-8686-0602a4293e88/Elastic_Llama_3.1_8B_Instruct_latency.png)
166
+
167
+ ### Latency Benchmark Results
168
+
169
+ Tokens per second for different model sizes on various GPUs.
170
+
171
+ | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original** |
172
+ | --- | --- | --- | --- | --- | --- |
173
+ | **H100** | 189 | 168 | 156 | 134 | 60 |
174
+ | **L40s** | 72 | 63 | 56 | 45 | 37 |
175
+ | **B200** | 239 | 236 | 207 | 199 | 100 |
176
+ | **GeForce RTX 5090** | 143 | N/A | N/A | N/A | 60 |
177
+ | **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 41 |
178
+
179
+
180
+ ## Benchmarking Methodology
181
+
182
+ ----
183
+
184
+ The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
185
+
186
+ > **Algorithm summary:**
187
+ > 1. Load the Llama-3.1-8B-Instruct model with the specified size (S, M, L, XL, original).
188
+ > 2. Move the model to the GPU.
189
+ > 3. Prepare a sample prompt for image generation.
190
+ > 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration:
191
+ > - Synchronize the GPU to flush any previous operations.
192
+ > - Record the start time.
193
+ > - Generate the text using the model.
194
+ > - Synchronize the GPU again.
195
+ > - Record the end time and calculate the TTFT and TPS for that iteration.
196
+ > 5. Calculate the average TTFT and TPS over all iterations.
197
 
 
 
 
 
 
 
198
 
199
+ ## Serving with Docker Image
200
 
201
+ ------------
202
+
203
+ For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
204
+ Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
205
+ You can also use this container to run inference through TheStage AI platform.
206
+
207
+ ### Prebuilt image from ECR
208
+
209
+ | **GPU** | **Docker image name** |
210
+ | --- | --- |
211
+ | H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` |
212
+ | B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` |
213
+
214
+ Pull docker image for your Nvidia GPU and start inference container:
215
+
216
+ ```bash
217
+ docker pull <IMAGE_NAME>
218
+ ```
219
+ ```bash
220
+ docker run --rm -ti \
221
+ --name serving_thestage_model \
222
+ -p 8000:80 \
223
+ -e AUTH_TOKEN=<AUTH_TOKEN> \
224
+ -e MODEL_REPO=meta-llama/Llama-3.1-8B-Instruct \
225
+ -e MODEL_SIZE=<MODEL_SIZE> \
226
+ -e MODEL_BATCH=<MAX_BATCH_SIZE> \
227
+ -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
228
+ -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
229
+ -v /mnt/hf_cache:/root/.cache/huggingface \
230
+ <IMAGE_NAME_DEPNDING_ON_YOUR_GPU>
231
  ```
232
 
233
+ | **Parameter** | **Description** |
234
+ |----------------------------|------------------------------------------------------------------------------------------------------|
235
+ | `<MODEL_SIZE>` | Available: S, M, L, XL. |
236
+ | `<MAX_BATCH_SIZE>` | Maximum batch size to process in parallel. |
237
+ | `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token. |
238
+ | `<THESTAGE_ACCESS_TOKEN>` | TheStage token generated on the platform (Profile -> Access tokens). |
239
+ | `<AUTH_TOKEN>` | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
240
+ | `<IMAGE_NAME>` | Image name which you have pulled. |
241
+
242
+ ## Invocation
243
+
244
+ ------
245
+
246
+ You can invoke the endpoint using CURL as follows:
247
+
248
+ ```bash
249
+ curl -X POST 'http://127.0.0.1:8000/v1/chat/completions' \
250
+ -H 'Authorization: Bearer 123' \
251
+ -H 'Content-Type: application/json' \
252
+ -H "X-Model-Name: llama-3-1-8b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged" \
253
+ -d '{
254
+ "messages":[{"role":"user","content":"Define AI"}]
255
+ }'
256
+ ```
257
 
258
+ Or using OpenAI python client:
259
+
260
+ ```python
261
+ import os, base64, pathlib, json
262
+ from openai import OpenAI
263
+
264
+ BASE_URL = "http://<your_ip>/v1"
265
+ API_KEY = "123"
266
+ MODEL = "llama-3-1-8b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged"
267
+
268
+ client = OpenAI(
269
+ api_key=API_KEY,
270
+ base_url=BASE_URL,
271
+ default_headers={"X-Model-Name": MODEL}
272
+ )
273
+
274
+ response = client.client.chat.completions.create(
275
+ model=MODEL,
276
+ messages=[
277
+ {"role": "user", "content": "Define AI"}
278
+ ]
279
+ )
280
+
281
+ print(response.choices[0].message.content)
282
  ```
283
 
284
+ ## Endpoint Parameters
285
 
286
+ -------------
287
+
288
+ ### Method
289
+
290
+ > **POST** `/v1/chat/completions`
291
 
292
+ ### Header Parameters
293
 
294
+ > `Authorization`: `string`
295
+ >
296
+ > Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
297
 
298
+ > `Content-Type`: `string`
299
+ >
300
+ > Must be set to `application/json`.
301
 
302
+ > `X-Model-Name`: `string`
303
+ >
304
+ > Specifies the model to use for generation. Format: `llama-3-1-8b-instruct-<size>-bs<batch_size>`, where `<size>` is one of `S`, `M`, `L`, `XL`, `original` and `<batch_size>` is the maximum batch size configured during container startup.
305
 
306
+ ### Input Body
 
 
 
 
 
307
 
308
+ > `messages` : `string`
309
+ >
310
+ > The input text prompt.
311
 
 
 
 
 
312
 
313
+ ## Deploy on Modal
314
 
315
+ -----------------------
316
 
317
+ For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
 
 
 
 
 
318
 
319
+ ### Clone modal serving code
320
+
321
+ ```shell
322
+ git clone https://github.com/TheStageAI/ElasticModels.git
323
+ cd ElasticModels/examples/modal
324
+ ```
325
+
326
+ ### Configuration of environment variables
327
+
328
+ Set your environment variables in `modal_serving.py`:
329
+
330
+ ```python
331
+ # modal_serving.py
332
+
333
+ ENVS = {
334
+ "MODEL_REPO": "meta-llama/Llama-3.1-8B-Instruct",
335
+ "MODEL_BATCH": "4",
336
+ "THESTAGE_AUTH_TOKEN": "",
337
+ "HUGGINGFACE_ACCESS_TOKEN": "",
338
+ "PORT": "80",
339
+ "PORT_HEALTH": "80",
340
+ "HF_HOME": "/cache/huggingface",
341
+ }
342
+ ```
343
+
344
+ ### Configuration of GPUs
345
+
346
+ Set your desired GPU type and autoscaling setup. variables in `modal_serving.py`:
347
+
348
+ ```python
349
+ # modal_serving.py
350
+
351
+ @app.function(
352
+ image=image,
353
+ gpu="B200",
354
+ min_containers=8,
355
+ max_containers=8,
356
+ timeout=10000,
357
+ ephemeral_disk=600 * 1024,
358
+ volumes={"/opt/project/.cache": HF_CACHE},
359
+ startup_timeout=60*20
360
+ )
361
+ @modal.web_server(
362
+ 80,
363
+ label="meta-llama/Llama-3.1-8B-Instruct-test",
364
+ startup_timeout=60*20
365
+ )
366
+ def serve():
367
+ pass
368
+ ```
369
+
370
+ ### Run serving
371
+
372
+ ```shell
373
+ modal serve modal_serving.py
374
+ ```
375
 
376
 
377
  ## Links
378
 
379
+ * __Platform__: [app.thestage.ai](https://app.thestage.ai)
 
380
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
381
  * __Contact email__: contact@thestage.ai