Update README.md
#8
by hinairo - opened
README.md
CHANGED
|
@@ -20,35 +20,58 @@ language:
|
|
| 20 |
- ara
|
| 21 |
---
|
| 22 |
|
| 23 |
-
# Elastic model: Qwen2.5-7B-Instruct
|
| 24 |
|
| 25 |
-
Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
|
|
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
-
|
| 47 |
-
--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
-
##
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
```python
|
| 54 |
import torch
|
|
@@ -57,7 +80,7 @@ from elastic_models.transformers import AutoModelForCausalLM
|
|
| 57 |
|
| 58 |
# Currently we require to have your HF token
|
| 59 |
# as we use original weights for part of layers and
|
| 60 |
-
# model
|
| 61 |
model_name = "Qwen/Qwen2.5-7B-Instruct"
|
| 62 |
hf_token = ''
|
| 63 |
device = torch.device("cuda")
|
|
@@ -111,74 +134,261 @@ print(f"# Q:\n{prompt}\n")
|
|
| 111 |
print(f"# A:\n{output}\n")
|
| 112 |
```
|
| 113 |
|
| 114 |
-
__System requirements:__
|
| 115 |
-
* GPUs: H100, L40s, 4090, 5090
|
| 116 |
-
* CPU: AMD, Intel
|
| 117 |
-
* Python: 3.10-3.12
|
| 118 |
|
|
|
|
| 119 |
|
| 120 |
-
|
| 121 |
|
| 122 |
-
``
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
|
| 127 |
-
#
|
| 128 |
-
pip install 'thestage-elastic-models[blackwell]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
|
| 129 |
-
pip install torch==2.7.0+cu128 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
|
| 130 |
-
# please download the appropriate version of Wheels for your system from https://github.com/Zarrac/flashattention-blackwell-wheels-whl-ONLY-5090-5080-5070-5060-flash-attention-/releases/tag/FlashAttention
|
| 131 |
-
mv flash_attn-2.7.4.post1-rtx5090-torch2.7.0cu128cxx11abiTRUE-cp311-linux_x86_64.whl flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
|
| 132 |
-
pip install flash_attn-2.7.4.post1-0rtx5090torch270cu128cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
|
| 133 |
|
| 134 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
```
|
| 136 |
|
| 137 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
-
|
| 140 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
```
|
| 142 |
|
| 143 |
-
|
|
|
|
|
|
|
| 144 |
|
| 145 |
-
|
| 146 |
|
| 147 |
-
|
| 148 |
|
| 149 |
-
|
| 150 |
|
| 151 |
-
|
|
|
|
|
|
|
| 152 |
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
| mmlu | 71.70 | 73.00 | 74.10 | 73.50 | 73.50 | 64.60 | - |
|
| 157 |
-
| piqa | 77.00 | 78.20 | 78.80 | 79.50 | 79.50 | 67.10 | - |
|
| 158 |
-
| winogrande | 66.20 | 69.10 | 71.50 | 70.60 | 70.60 | 53.10 | - |
|
| 159 |
|
|
|
|
|
|
|
|
|
|
| 160 |
|
|
|
|
| 161 |
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
* **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
|
| 166 |
|
| 167 |
-
### Latency benchmarks
|
| 168 |
|
| 169 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
|
| 178 |
|
| 179 |
## Links
|
| 180 |
|
| 181 |
-
* __Platform__: [app.thestage.ai](app.thestage.ai)
|
| 182 |
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
|
| 183 |
-
<!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
|
| 184 |
* __Contact email__: contact@thestage.ai
|
|
|
|
| 20 |
- ara
|
| 21 |
---
|
| 22 |
|
| 23 |
+
# Elastic model: Qwen2.5-7B-Instruct
|
| 24 |
|
|
|
|
| 25 |
|
| 26 |
+
## Overview
|
| 27 |
|
| 28 |
+
ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:
|
| 29 |
|
| 30 |
+
- **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
|
| 31 |
+
- **L**: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
|
| 32 |
+
- **M**: Faster model, with accuracy degradation less than 1.5%.
|
| 33 |
+
- **S**: The fastest model, with accuracy degradation less than 2%.
|
| 34 |
|
| 35 |
+
Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).
|
| 36 |
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## Installation
|
| 40 |
+
|
| 41 |
+
### System Requirements
|
| 42 |
+
|
| 43 |
+
| **Property**| **Value** |
|
| 44 |
+
| --- | --- |
|
| 45 |
+
| **GPU** | L40s, RTX 5090, H100, RTX 4090 |
|
| 46 |
+
| **Python Version** | 3.10-3.12 |
|
| 47 |
+
| **CPU** | Intel/AMD x86_64 |
|
| 48 |
+
| **CUDA Version** | 12.8+ |
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
### TheStage AI Access token setup
|
| 52 |
|
| 53 |
+
Install TheStage AI CLI and setup API token:
|
| 54 |
|
| 55 |
+
```bash
|
| 56 |
+
pip install thestage
|
| 57 |
+
thestage config set --access-token <YOUR_ACCESS_TOKEN>
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
### ElasticModels installation
|
| 61 |
|
| 62 |
+
Install TheStage Elastic Models package:
|
| 63 |
|
| 64 |
+
```bash
|
| 65 |
+
pip install 'thestage-elastic-models[nvidia,cudnn]' \
|
| 66 |
+
--extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
|
| 67 |
+
pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
|
| 72 |
+
## Usage example
|
| 73 |
|
| 74 |
+
Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Qwen2.5-7B-Instruct model:
|
| 75 |
|
| 76 |
```python
|
| 77 |
import torch
|
|
|
|
| 80 |
|
| 81 |
# Currently we require to have your HF token
|
| 82 |
# as we use original weights for part of layers and
|
| 83 |
+
# model configuration as well
|
| 84 |
model_name = "Qwen/Qwen2.5-7B-Instruct"
|
| 85 |
hf_token = ''
|
| 86 |
device = torch.device("cuda")
|
|
|
|
| 134 |
print(f"# A:\n{output}\n")
|
| 135 |
```
|
| 136 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
+
---
|
| 139 |
|
| 140 |
+
## Quality Benchmarks
|
| 141 |
|
| 142 |
+
We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.
|
| 143 |
+
|
| 144 |
+

|
| 145 |
+
|
| 146 |
+
### Quality Benchmark Results
|
| 147 |
+
|
| 148 |
+
| **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8, int8** |
|
| 149 |
+
| --- | --- | --- | --- | --- | --- | --- |
|
| 150 |
+
| **Arc Challenge** | 54.2 | 55.2 | 55.3 | 54.9 | 54.7 | 41.7 |
|
| 151 |
+
| **MMLU** | 71.5 | 71.6 | 71.9 | 71.9 | 71.8 | 64.6 |
|
| 152 |
+
| **PIQA** | 78.3 | 79.9 | 79.5 | 79.5 | 79.6 | 67.1 |
|
| 153 |
+
| **Winogrande** | 70.4 | 70.3 | 71.5 | 70.4 | 71.0 | 53.1 |
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
+
|
| 158 |
+
## Datasets
|
| 159 |
+
|
| 160 |
+
- **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
|
| 161 |
+
- **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
|
| 162 |
+
- **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
|
| 163 |
+
- **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.
|
| 164 |
+
|
| 165 |
+
---
|
| 166 |
+
|
| 167 |
+
## Metrics
|
| 168 |
+
|
| 169 |
+
- **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
---
|
| 173 |
+
|
| 174 |
+
## Latency Benchmarks
|
| 175 |
+
|
| 176 |
+
We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.
|
| 177 |
+
|
| 178 |
+

|
| 179 |
+
|
| 180 |
+
### Latency Benchmark Results
|
| 181 |
+
|
| 182 |
+
Tokens per second for different model sizes on various GPUs.
|
| 183 |
+
|
| 184 |
+
| **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8_int8** |
|
| 185 |
+
| --- | --- | --- | --- | --- | --- | --- |
|
| 186 |
+
| **H100** | 184 | 177 | 157 | 138 | 62 | 201 |
|
| 187 |
+
| **L40s** | 72 | 67 | 57 | 48 | 42 | 78 |
|
| 188 |
+
| **B200** | 239 | 232 | 216 | 199 | 114 | N/A |
|
| 189 |
+
| **GeForce RTX 5090** | 141 | N/A | N/A | N/A | 66 | N/A |
|
| 190 |
+
| **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 45 | N/A |
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
---
|
| 194 |
+
|
| 195 |
+
## Benchmarking Methodology
|
| 196 |
+
|
| 197 |
+
The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
|
| 198 |
+
|
| 199 |
+
> **Algorithm summary:**
|
| 200 |
+
> 1. Load the Qwen2.5-7B-Instruct model with the specified size (S, M, L, XL, original).
|
| 201 |
+
> 2. Move the model to the GPU.
|
| 202 |
+
> 3. Prepare a sample prompt for text generation.
|
| 203 |
+
> 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration:
|
| 204 |
+
> - Synchronize the GPU to flush any previous operations.
|
| 205 |
+
> - Record the start time.
|
| 206 |
+
> - Generate the text using the model.
|
| 207 |
+
> - Synchronize the GPU again.
|
| 208 |
+
> - Record the end time and calculate the TTFT and TPS for that iteration.
|
| 209 |
+
> 5. Calculate the average TTFT and TPS over all iterations.
|
| 210 |
+
|
| 211 |
+
|
| 212 |
+
---
|
| 213 |
|
| 214 |
+
## Serving with Docker Image
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
|
| 216 |
+
For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
|
| 217 |
+
Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
|
| 218 |
+
You can also use this container to run inference through TheStage AI platform.
|
| 219 |
+
|
| 220 |
+
### Prebuilt image from ECR
|
| 221 |
+
|
| 222 |
+
| **GPU** | **Docker image name** |
|
| 223 |
+
| --- | --- |
|
| 224 |
+
| H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` |
|
| 225 |
+
| B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` |
|
| 226 |
+
|
| 227 |
+
Pull docker image for your Nvidia GPU and start inference container:
|
| 228 |
+
|
| 229 |
+
```bash
|
| 230 |
+
docker pull <IMAGE_NAME>
|
| 231 |
+
```
|
| 232 |
+
```bash
|
| 233 |
+
docker run --rm -ti \
|
| 234 |
+
--name serving_thestage_model \
|
| 235 |
+
-p 8000:80 \
|
| 236 |
+
-e AUTH_TOKEN=<AUTH_TOKEN> \
|
| 237 |
+
-e MODEL_REPO=Qwen/Qwen2.5-7B-Instruct \
|
| 238 |
+
-e MODEL_SIZE=<MODEL_SIZE> \
|
| 239 |
+
-e MODEL_BATCH=<MAX_BATCH_SIZE> \
|
| 240 |
+
-e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
|
| 241 |
+
-e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
|
| 242 |
+
-v /mnt/hf_cache:/root/.cache/huggingface \
|
| 243 |
+
<IMAGE_NAME_DEPENDING_ON_YOUR_GPU>
|
| 244 |
```
|
| 245 |
|
| 246 |
+
| **Parameter** | **Description** |
|
| 247 |
+
|----------------------------|------------------------------------------------------------------------------------------------------|
|
| 248 |
+
| `<MODEL_SIZE>` | Available: S, M, L, XL. |
|
| 249 |
+
| `<MAX_BATCH_SIZE>` | Maximum batch size to process in parallel. |
|
| 250 |
+
| `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token. |
|
| 251 |
+
| `<THESTAGE_ACCESS_TOKEN>` | TheStage token generated on the platform (Profile -> Access tokens). |
|
| 252 |
+
| `<AUTH_TOKEN>` | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
|
| 253 |
+
| `<IMAGE_NAME>` | Image name which you have pulled. |
|
| 254 |
|
| 255 |
+
---
|
| 256 |
+
|
| 257 |
+
## Invocation
|
| 258 |
+
|
| 259 |
+
You can invoke the endpoint using CURL as follows:
|
| 260 |
+
|
| 261 |
+
```bash
|
| 262 |
+
curl -X POST 'http://127.0.0.1:8000/v1/chat/completions' \
|
| 263 |
+
-H 'Authorization: Bearer 123' \
|
| 264 |
+
-H 'Content-Type: application/json' \
|
| 265 |
+
-H "X-Model-Name: qwen-2-5-7b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged" \
|
| 266 |
+
-d '{
|
| 267 |
+
"messages":[{"role":"user","content":"Define AI"}]
|
| 268 |
+
}'
|
| 269 |
+
```
|
| 270 |
+
|
| 271 |
+
Or using OpenAI python client:
|
| 272 |
+
|
| 273 |
+
```python
|
| 274 |
+
import os, base64, pathlib, json
|
| 275 |
+
from openai import OpenAI
|
| 276 |
+
|
| 277 |
+
BASE_URL = "http://<your_ip>/v1"
|
| 278 |
+
API_KEY = "123"
|
| 279 |
+
MODEL = "qwen-2-5-7b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged"
|
| 280 |
+
|
| 281 |
+
client = OpenAI(
|
| 282 |
+
api_key=API_KEY,
|
| 283 |
+
base_url=BASE_URL,
|
| 284 |
+
default_headers={"X-Model-Name": MODEL}
|
| 285 |
+
)
|
| 286 |
+
|
| 287 |
+
response = client.chat.completions.create(
|
| 288 |
+
model=MODEL,
|
| 289 |
+
messages=[
|
| 290 |
+
{"role": "user", "content": "Define AI"}
|
| 291 |
+
]
|
| 292 |
+
)
|
| 293 |
+
|
| 294 |
+
print(response.choices[0].message.content)
|
| 295 |
```
|
| 296 |
|
| 297 |
+
---
|
| 298 |
+
|
| 299 |
+
## Endpoint Parameters
|
| 300 |
|
| 301 |
+
### Method
|
| 302 |
|
| 303 |
+
> **POST** `/v1/chat/completions`
|
| 304 |
|
| 305 |
+
### Header Parameters
|
| 306 |
|
| 307 |
+
> `Authorization`: `string`
|
| 308 |
+
>
|
| 309 |
+
> Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.
|
| 310 |
|
| 311 |
+
> `Content-Type`: `string`
|
| 312 |
+
>
|
| 313 |
+
> Must be set to `application/json`.
|
|
|
|
|
|
|
|
|
|
| 314 |
|
| 315 |
+
> `X-Model-Name`: `string`
|
| 316 |
+
>
|
| 317 |
+
> Specifies the model to use for generation. Format: `qwen-2-5-7b-instruct-<size>-bs<batch_size>`, where `<size>` is one of `S`, `M`, `L`, `XL`, `original` and `<batch_size>` is the maximum batch size configured during container startup.
|
| 318 |
|
| 319 |
+
### Input Body
|
| 320 |
|
| 321 |
+
> `messages` : `string`
|
| 322 |
+
>
|
| 323 |
+
> The input text prompt.
|
|
|
|
| 324 |
|
|
|
|
| 325 |
|
| 326 |
+
---
|
| 327 |
+
|
| 328 |
+
## Deploy on Modal
|
| 329 |
+
|
| 330 |
+
For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)
|
| 331 |
+
|
| 332 |
+
### Clone modal serving code
|
| 333 |
+
|
| 334 |
+
```shell
|
| 335 |
+
git clone https://github.com/TheStageAI/ElasticModels.git
|
| 336 |
+
cd ElasticModels/examples/modal
|
| 337 |
+
```
|
| 338 |
+
|
| 339 |
+
### Configuration of environment variables
|
| 340 |
+
|
| 341 |
+
Set your environment variables in `modal_serving.py`:
|
| 342 |
+
|
| 343 |
+
```python
|
| 344 |
+
# modal_serving.py
|
| 345 |
+
|
| 346 |
+
ENVS = {
|
| 347 |
+
"MODEL_REPO": "Qwen/Qwen2.5-7B-Instruct",
|
| 348 |
+
"MODEL_BATCH": "4",
|
| 349 |
+
"THESTAGE_AUTH_TOKEN": "",
|
| 350 |
+
"HUGGINGFACE_ACCESS_TOKEN": "",
|
| 351 |
+
"PORT": "80",
|
| 352 |
+
"PORT_HEALTH": "80",
|
| 353 |
+
"HF_HOME": "/cache/huggingface",
|
| 354 |
+
}
|
| 355 |
+
```
|
| 356 |
+
|
| 357 |
+
### Configuration of GPUs
|
| 358 |
+
|
| 359 |
+
Set your desired GPU type and autoscaling variables in `modal_serving.py`:
|
| 360 |
|
| 361 |
+
```python
|
| 362 |
+
# modal_serving.py
|
| 363 |
+
|
| 364 |
+
@app.function(
|
| 365 |
+
image=image,
|
| 366 |
+
gpu="B200",
|
| 367 |
+
min_containers=8,
|
| 368 |
+
max_containers=8,
|
| 369 |
+
timeout=10000,
|
| 370 |
+
ephemeral_disk=600 * 1024,
|
| 371 |
+
volumes={"/opt/project/.cache": HF_CACHE},
|
| 372 |
+
startup_timeout=60*20
|
| 373 |
+
)
|
| 374 |
+
@modal.web_server(
|
| 375 |
+
80,
|
| 376 |
+
label="Qwen/Qwen2.5-7B-Instruct-test",
|
| 377 |
+
startup_timeout=60*20
|
| 378 |
+
)
|
| 379 |
+
def serve():
|
| 380 |
+
pass
|
| 381 |
+
```
|
| 382 |
+
|
| 383 |
+
### Run serving
|
| 384 |
+
|
| 385 |
+
```shell
|
| 386 |
+
modal serve modal_serving.py
|
| 387 |
+
```
|
| 388 |
|
| 389 |
|
| 390 |
## Links
|
| 391 |
|
| 392 |
+
* __Platform__: [app.thestage.ai](https://app.thestage.ai)
|
| 393 |
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
|
|
|
|
| 394 |
* __Contact email__: contact@thestage.ai
|