Update README.md

#18
by hinairo - opened
Files changed (1) hide show
  1. README.md +273 -58
README.md CHANGED
@@ -3,7 +3,7 @@ license: apache-2.0
3
  base_model:
4
  - mistralai/Mistral-7B-Instruct-v0.3
5
  base_model_relation: quantized
6
- pipeline_tag: text2text-generation
7
  language:
8
  - zho
9
  - eng
@@ -20,38 +20,59 @@ language:
20
  - ara
21
  ---
22
 
23
- # Elastic model: Mistral-7B-Instruct-v0.3. Fastest and most flexible models for self-serving.
24
 
25
- Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
26
 
27
- * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
28
 
29
- * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
30
 
31
- * __M__: Faster model, with accuracy degradation less than 1.5%.
 
 
 
32
 
33
- * __S__: The fastest model, with accuracy degradation less than 2%.
34
 
 
35
 
36
- __Goals of elastic models:__
37
 
38
- * Provide flexibility in cost vs quality selection for inference
39
- * Provide clear quality and latency benchmarks
40
- * Provide interface of HF libraries: transformers and diffusers with a single line of code
41
- * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
42
- * Provide the best models and service for self-hosting.
43
 
44
- > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
 
 
 
 
 
45
 
46
 
47
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/V8hpZ-cA9vE5Ijyodp6Ih.png)
48
 
 
49
 
50
- -----
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- ## Inference
53
 
54
- To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
55
 
56
  ```python
57
  import torch
@@ -70,7 +91,7 @@ tokenizer = AutoTokenizer.from_pretrained(
70
  model_name, token=hf_token
71
  )
72
  model = AutoModelForCausalLM.from_pretrained(
73
- model_name,
74
  token=hf_token,
75
  torch_dtype=torch.bfloat16,
76
  attn_implementation="sdpa",
@@ -105,7 +126,7 @@ input_len = inputs['input_ids'].shape[1]
105
  generate_ids = generate_ids[:, input_len:]
106
  output = tokenizer.batch_decode(
107
  generate_ids,
108
- skip_special_tokens=True,
109
  clean_up_tokenization_spaces=False
110
  )[0]
111
 
@@ -114,66 +135,260 @@ print(f"# Q:\n{prompt}\n")
114
  print(f"# A:\n{output}\n")
115
  ```
116
 
117
- __System requirements:__
118
- * GPUs: H100, L40s
119
- * CPU: AMD, Intel
120
- * Python: 3.10-3.12
121
 
 
122
 
123
- To work with our models just run these lines in your terminal:
124
 
125
- ```shell
126
- pip install thestage
127
- pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
128
- pip install flash_attn==2.7.3 --no-build-isolation
129
- pip uninstall apex
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
  ```
131
 
132
- Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
- ```shell
135
- thestage config set --api-token <YOUR_API_TOKEN>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  ```
137
 
138
- Congrats, now you can use accelerated models!
139
 
140
- ----
 
 
 
 
141
 
142
- ## Benchmarks
143
 
144
- Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
 
 
145
 
146
- ### Quality benchmarks
 
 
147
 
148
- <!-- For quality evaluation we have used: #TODO link to github -->
 
 
149
 
150
- | Metric/Model | S | M | L | XL | Original | W8A8, int8 |
151
- |---------------|---|---|---|----|----------|------------|
152
- | MMLU | 59.7 | 60.1 | 60.8 | 61.4 | 61.4 | 28 |
153
- | PIQA | 80.8 | 82 | 81.7 | 81.5 | 81.5 | 65.3 |
154
- | Arc Challenge | 56.6 | 55.1 | 56.8 | 57.4 | 57.4 | 33.2 |
155
- | Winogrande | 73.2 | 72.3 | 73.2 | 74.1 | 74.1 | 57 |
156
 
 
 
 
157
 
158
- * **MMLU**:Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
159
- * **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
160
- * **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
161
- * **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
162
 
163
- ### Latency benchmarks
164
 
165
- __100 input/300 output; tok/s:__
166
 
167
- | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
168
- |-----------|-----|---|---|----|----------|------------|
169
- | H100 | 186 | 180 | 168 | 136 | 48 | 192 |
170
- | L40s | 79 | 68 | 59 | 47 | 38 | 82 |
171
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
 
174
  ## Links
175
 
176
- * __Platform__: [app.thestage.ai](app.thestage.ai)
177
- <!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
178
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
179
- * __Contact email__: contact@thestage.ai
 
3
  base_model:
4
  - mistralai/Mistral-7B-Instruct-v0.3
5
  base_model_relation: quantized
6
+ pipeline_tag: text-generation
7
  language:
8
  - zho
9
  - eng
 
20
  - ara
21
  ---
22
 
23
+ # Elastic model: Mistral-7B-Instruct-v0.3
24
 
25
+ ## Overview
26
 
27
+ ----
28
 
29
+ ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
30
 
31
+ - **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
32
+ - **L**: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
33
+ - **M**: Faster model, with accuracy degradation less than 1.5%.
34
+ - **S**: The fastest model, with accuracy degradation less than 2%.
35
 
36
+ Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
37
 
38
+ ## Installation
39
 
40
+ ---
41
 
42
+ ### System Requirements
 
 
 
 
43
 
44
+ | **Property**| **Value** |
45
+ | --- | --- |
46
+ | **GPU** | H100, L40s, B200, RTX 5090 |
47
+ | **Python Version** | 3.10-3.12 |
48
+ | **CPU** | Intel/AMD x86_64 |
49
+ | **CUDA Version** | 12.9+ |
50
 
51
 
52
+ ### TheStage AI Access token setup
53
 
54
+ Install TheStage AI CLI and setup API token:
55
 
56
+ ```bash
57
+ pip install thestage
58
+ thestage config set --access-token <YOUR_ACCESS_TOKEN>
59
+ ```
60
+
61
+ ### ElasticModels installation
62
+
63
+ Install TheStage Elastic Models package:
64
+
65
+ ```bash
66
+ pip install 'thestage-elastic-models[nvidia,cudnn]' \
67
+ --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
68
+ pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
69
+ ```
70
+
71
+ ## Usage example
72
 
73
+ ----
74
 
75
+ Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Mistral-7B-Instruct-v0.3 model:
76
 
77
  ```python
78
  import torch
 
91
  model_name, token=hf_token
92
  )
93
  model = AutoModelForCausalLM.from_pretrained(
94
+ model_name,
95
  token=hf_token,
96
  torch_dtype=torch.bfloat16,
97
  attn_implementation="sdpa",
 
126
  generate_ids = generate_ids[:, input_len:]
127
  output = tokenizer.batch_decode(
128
  generate_ids,
129
+ skip_special_tokens=True,
130
  clean_up_tokenization_spaces=False
131
  )[0]
132
 
 
135
  print(f"# A:\n{output}\n")
136
  ```
137
 
 
 
 
 
138
 
139
+ ## Quality Benchmarks
140
 
141
+ ------------
142
 
143
+ We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.
144
+
145
+ ![Quality Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422657-7bb353b4-5d79-4bbf-aacb-654b7d7a7bcb/Elastic_Mistral_7B_Instruct_v0.3_MMLU.png)
146
+
147
+ ### Quality Benchmark Results
148
+
149
+ | **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original** |
150
+ | --- | --- | --- | --- | --- | --- |
151
+ | **MMLU** | 59.2 | 59.6 | 59.6 | 59.8 | 59.8 |
152
+ | **PIQA** | 81.3 | 81.3 | 81.9 | 81.9 | 82.0 |
153
+ | **Arc Challenge** | 59.6 | 60.4 | 59.5 | 60.3 | 59.7 |
154
+ | **Winogrande** | 75.2 | 76.1 | 75.3 | 74.8 | 74.8 |
155
+
156
+
157
+ ## Datasets
158
+
159
+ -------
160
+
161
+ - **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
162
+ - **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
163
+ - **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
164
+ - **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
165
+
166
+ ## Metrics
167
+
168
+ ----------
169
+
170
+ - **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
171
+
172
+
173
+ ## Latency Benchmarks
174
+
175
+ -----
176
+
177
+ We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
178
+
179
+ ![Latency Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422671-1ddedb17-7bc7-45e2-b285-4d3ef4212af0/Elastic_Mistral_7B_Instruct_v0.3_latency.png)
180
+
181
+ ### Latency Benchmark Results
182
+
183
+ Tokens per second for different model sizes on various GPUs.
184
+
185
+ | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original** |
186
+ | --- | --- | --- | --- | --- | --- |
187
+ | **H100** | 203 | 188 | 173 | 144 | 60 |
188
+ | **L40s** | 77 | 68 | 60 | 48 | 39 |
189
+ | **B200** | 268 | 263 | 235 | 219 | 104 |
190
+ | **GeForce RTX 5090** | 155 | N/A | N/A | N/A | 74 |
191
+
192
+
193
+ ## Benchmarking Methodology
194
+
195
+ ----
196
+
197
+ The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
198
+
199
+ > **Algorithm summary:**
200
+ > 1. Load the Mistral-7B-Instruct-v0.3 model with the specified size (S, M, L, XL, original).
201
+ > 2. Move the model to the GPU.
202
+ > 3. Prepare a sample prompt for text generation.
203
+ > 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration:
204
+ > - Synchronize the GPU to flush any previous operations.
205
+ > - Record the start time.
206
+ > - Generate the text using the model.
207
+ > - Synchronize the GPU again.
208
+ > - Record the end time and calculate the TTFT and TPS for that iteration.
209
+ > 5. Calculate the average TTFT and TPS over all iterations.
210
+
211
+
212
+ ## Serving with Docker Image
213
+
214
+ ------------
215
+
216
+ For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
217
+ Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
218
+ You can also use this container to run inference through TheStage AI platform.
219
+
220
+ ### Prebuilt image from ECR
221
+
222
+ | **GPU** | **Docker image name** |
223
+ | --- | --- |
224
+ | H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` |
225
+ | B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` |
226
+
227
+ Pull docker image for your Nvidia GPU and start inference container:
228
+
229
+ ```bash
230
+ docker pull <IMAGE_NAME>
231
+ ```
232
+ ```bash
233
+ docker run --rm -ti \
234
+ --name serving_thestage_model \
235
+ -p 8000:80 \
236
+ -e AUTH_TOKEN=<AUTH_TOKEN> \
237
+ -e MODEL_REPO=mistralai/Mistral-7B-Instruct-v0.3 \
238
+ -e MODEL_SIZE=<MODEL_SIZE> \
239
+ -e MODEL_BATCH=<MAX_BATCH_SIZE> \
240
+ -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
241
+ -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
242
+ -v /mnt/hf_cache:/root/.cache/huggingface \
243
+ <IMAGE_NAME_DEPENDING_ON_YOUR_GPU>
244
  ```
245
 
246
+ | **Parameter** | **Description** |
247
+ |----------------------------|------------------------------------------------------------------------------------------------------|
248
+ | `<MODEL_SIZE>` | Available: S, M, L, XL. |
249
+ | `<MAX_BATCH_SIZE>` | Maximum batch size to process in parallel. |
250
+ | `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token. |
251
+ | `<THESTAGE_ACCESS_TOKEN>` | TheStage token generated on the platform (Profile -> Access tokens). |
252
+ | `<AUTH_TOKEN>` | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
253
+ | `<IMAGE_NAME>` | Image name which you have pulled. |
254
+
255
+ ## Invocation
256
+
257
+ ------
258
+
259
+ You can invoke the endpoint using CURL as follows:
260
+
261
+ ```bash
262
+ curl -X POST 'http://127.0.0.1:8000/v1/chat/completions' \
263
+ -H 'Authorization: Bearer 123' \
264
+ -H 'Content-Type: application/json' \
265
+ -H "X-Model-Name: mistral-7b-instruct-v0-3-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged" \
266
+ -d '{
267
+ "messages":[{"role":"user","content":"Define AI"}]
268
+ }'
269
+ ```
270
 
271
+ Or using OpenAI python client:
272
+
273
+ ```python
274
+ import os, base64, pathlib, json
275
+ from openai import OpenAI
276
+
277
+ BASE_URL = "http://<your_ip>/v1"
278
+ API_KEY = "123"
279
+ MODEL = "mistral-7b-instruct-v0-3-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged"
280
+
281
+ client = OpenAI(
282
+ api_key=API_KEY,
283
+ base_url=BASE_URL,
284
+ default_headers={"X-Model-Name": MODEL}
285
+ )
286
+
287
+ response = client.chat.completions.create(
288
+ model=MODEL,
289
+ messages=[
290
+ {"role": "user", "content": "Define AI"}
291
+ ]
292
+ )
293
+
294
+ print(response.choices[0].message.content)
295
  ```
296
 
297
+ ## Endpoint Parameters
298
 
299
+ -------------
300
+
301
+ ### Method
302
+
303
+ > **POST** `/v1/chat/completions`
304
 
305
+ ### Header Parameters
306
 
307
+ > `Authorization`: `string`
308
+ >
309
+ > Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
310
 
311
+ > `Content-Type`: `string`
312
+ >
313
+ > Must be set to `application/json`.
314
 
315
+ > `X-Model-Name`: `string`
316
+ >
317
+ > Specifies the model to use for generation. Format: `mistral-7b-instruct-v0-3-<size>-bs<batch_size>`, where `<size>` is one of `S`, `M`, `L`, `XL`, `original` and `<batch_size>` is the maximum batch size configured during container startup.
318
 
319
+ ### Input Body
 
 
 
 
 
320
 
321
+ > `messages` : `string`
322
+ >
323
+ > The input text prompt.
324
 
 
 
 
 
325
 
326
+ ## Deploy on Modal
327
 
328
+ -----------------------
329
 
330
+ For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
 
 
 
331
 
332
+ ### Clone modal serving code
333
+
334
+ ```shell
335
+ git clone https://github.com/TheStageAI/ElasticModels.git
336
+ cd ElasticModels/examples/modal
337
+ ```
338
+
339
+ ### Configuration of environment variables
340
+
341
+ Set your environment variables in `modal_serving.py`:
342
+
343
+ ```python
344
+ # modal_serving.py
345
+
346
+ ENVS = {
347
+ "MODEL_REPO": "mistralai/Mistral-7B-Instruct-v0.3",
348
+ "MODEL_BATCH": "4",
349
+ "THESTAGE_AUTH_TOKEN": "",
350
+ "HUGGINGFACE_ACCESS_TOKEN": "",
351
+ "PORT": "80",
352
+ "PORT_HEALTH": "80",
353
+ "HF_HOME": "/cache/huggingface",
354
+ }
355
+ ```
356
+
357
+ ### Configuration of GPUs
358
+
359
+ Set your desired GPU type and autoscaling variables in `modal_serving.py`:
360
+
361
+ ```python
362
+ # modal_serving.py
363
+
364
+ @app.function(
365
+ image=image,
366
+ gpu="B200",
367
+ min_containers=8,
368
+ max_containers=8,
369
+ timeout=10000,
370
+ ephemeral_disk=600 * 1024,
371
+ volumes={"/opt/project/.cache": HF_CACHE},
372
+ startup_timeout=60*20
373
+ )
374
+ @modal.web_server(
375
+ 80,
376
+ label="mistralai/Mistral-7B-Instruct-v0.3-test",
377
+ startup_timeout=60*20
378
+ )
379
+ def serve():
380
+ pass
381
+ ```
382
+
383
+ ### Run serving
384
+
385
+ ```shell
386
+ modal serve modal_serving.py
387
+ ```
388
 
389
 
390
  ## Links
391
 
392
+ * __Platform__: [app.thestage.ai](https://app.thestage.ai)
 
393
  * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
394
+ * __Contact email__: contact@thestage.ai