Update README.md
#2
by
hypothetical
- opened
README.md
CHANGED
|
@@ -7,8 +7,7 @@ pipeline_tag: text2text-generation
|
|
| 7 |
|
| 8 |
# Elastic models
|
| 9 |
|
| 10 |
-
Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator.
|
| 11 |
-
ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
|
| 12 |
|
| 13 |
* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
|
| 14 |
|
|
@@ -25,10 +24,10 @@ __Goals of elastic models:__
|
|
| 25 |
* Provide clear quality and latency benchmarks
|
| 26 |
* Provide interface of HF libraries: transformers and diffusers with a single line of code
|
| 27 |
* Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
|
|
|
|
| 28 |
|
| 29 |
> It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
|
| 30 |
|
| 31 |
-
|
| 32 |
## Inference
|
| 33 |
|
| 34 |
To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
|
|
@@ -38,21 +37,28 @@ import torch
|
|
| 38 |
from transformers import AutoTokenizer
|
| 39 |
from elastic_models.transformers import AutoModelForCausalLM
|
| 40 |
|
|
|
|
|
|
|
|
|
|
| 41 |
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
|
| 42 |
-
|
| 43 |
-
|
| 44 |
device = torch.device("cuda")
|
| 45 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
model = AutoModelForCausalLM.from_pretrained(
|
| 48 |
model_name,
|
| 49 |
-
token=
|
| 50 |
-
cache_dir=
|
| 51 |
torch_dtype=torch.bfloat16,
|
| 52 |
attn_implementation="sdpa"
|
| 53 |
).to(device)
|
| 54 |
model.generation_config.pad_token_id = tokenizer.eos_token_id
|
| 55 |
|
|
|
|
| 56 |
prompt = "Describe basics of DNNs quantization."
|
| 57 |
inputs = tokenizer(prompt, return_tensors="pt")
|
| 58 |
inputs.to(device)
|
|
@@ -65,12 +71,13 @@ output = tokenizer.batch_decode(
|
|
| 65 |
skip_special_tokens=True,
|
| 66 |
clean_up_tokenization_spaces=False
|
| 67 |
)[0]
|
|
|
|
|
|
|
| 68 |
print(f"# Q:\n{prompt}\n")
|
| 69 |
print(f"# A:\n{output}\n")
|
| 70 |
```
|
| 71 |
|
| 72 |
-
|
| 73 |
-
### System requirements
|
| 74 |
|
| 75 |
__GPUs__: H100, L40s
|
| 76 |
|
|
@@ -78,17 +85,14 @@ __OS__: Linux #TODO
|
|
| 78 |
|
| 79 |
__Python__: 3.10-3.12
|
| 80 |
|
| 81 |
-
|
| 82 |
-
---
|
| 83 |
-
### Installation
|
| 84 |
|
| 85 |
```shell
|
| 86 |
pip install thestage
|
| 87 |
pip install elastic_models
|
| 88 |
```
|
| 89 |
|
| 90 |
-
Then go to app.thestage.ai, login and generate API token from your profile page.
|
| 91 |
-
Set up API token as follows:
|
| 92 |
|
| 93 |
```shell
|
| 94 |
thestage config set --api-token <YOUR_API_TOKEN>
|
|
@@ -96,6 +100,7 @@ thestage config set --api-token <YOUR_API_TOKEN>
|
|
| 96 |
|
| 97 |
Congrats, now you can use accelerated models!
|
| 98 |
|
|
|
|
| 99 |
|
| 100 |
## Benchmarks
|
| 101 |
|
|
@@ -113,7 +118,7 @@ For quality evaluation we have used: #TODO link to github
|
|
| 113 |
| Winogrande | 0 | 0 | 0 | 0 | 0 | 0 |
|
| 114 |
|
| 115 |
|
| 116 |
-
> __MMLU__: Evaluates/shows
|
| 117 |
|
| 118 |
> __MMLU__: Evaluates/shows ...
|
| 119 |
|
|
@@ -121,28 +126,34 @@ For quality evaluation we have used: #TODO link to github
|
|
| 121 |
|
| 122 |
> __PIQA__: Evaluates/shows ...
|
| 123 |
|
| 124 |
-
|
| 125 |
### Latency benchmarks
|
| 126 |
|
| 127 |
We have profiled models in different scenarios:
|
| 128 |
|
| 129 |
-
|
|
|
|
|
|
|
|
|
|
| 130 |
| GPU/Model | S | M | L | XL | Original | W8A8, int8 |
|
| 131 |
|-----------|-----|---|---|----|----------|------------|
|
| 132 |
| H100 | 189 | 0 | 0 | 0 | 48 | 0 |
|
| 133 |
| L40s | 79 | 0 | 0 | 0 | 42 | 0 |
|
| 134 |
|
| 135 |
|
| 136 |
-
|
|
|
|
|
|
|
| 137 |
| GPU/Model | S | M | L | XL | Original | W8A8, int8 |
|
| 138 |
|-----------|-----|---|---|----|----------|------------|
|
| 139 |
| H100 | 189 | 0 | 0 | 0 | 48 | 0 |
|
| 140 |
| L40s | 79 | 0 | 0 | 0 | 42 | 0 |
|
| 141 |
|
|
|
|
|
|
|
| 142 |
|
| 143 |
## Links
|
| 144 |
|
| 145 |
* __Platform__: [app.thestage.ai](app.thestage.ai)
|
| 146 |
* __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
|
| 147 |
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
|
| 148 |
-
* __Contact email__: contact@thestage.ai
|
|
|
|
| 7 |
|
| 8 |
# Elastic models
|
| 9 |
|
| 10 |
+
Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
|
|
|
|
| 11 |
|
| 12 |
* __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
|
| 13 |
|
|
|
|
| 24 |
* Provide clear quality and latency benchmarks
|
| 25 |
* Provide interface of HF libraries: transformers and diffusers with a single line of code
|
| 26 |
* Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
|
| 27 |
+
* Provide the best models and service for self-hosting.
|
| 28 |
|
| 29 |
> It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
|
| 30 |
|
|
|
|
| 31 |
## Inference
|
| 32 |
|
| 33 |
To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
|
|
|
|
| 37 |
from transformers import AutoTokenizer
|
| 38 |
from elastic_models.transformers import AutoModelForCausalLM
|
| 39 |
|
| 40 |
+
# Currently we require to have your HF token
|
| 41 |
+
# as we use original weights for part of layers and
|
| 42 |
+
# model confugaration as well
|
| 43 |
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
|
| 44 |
+
hf_token = ''
|
| 45 |
+
hf_cache_dir = ''
|
| 46 |
device = torch.device("cuda")
|
|
|
|
| 47 |
|
| 48 |
+
# Create mode
|
| 49 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 50 |
+
model_name, token=hf_token
|
| 51 |
+
)
|
| 52 |
model = AutoModelForCausalLM.from_pretrained(
|
| 53 |
model_name,
|
| 54 |
+
token=hf_token,
|
| 55 |
+
cache_dir=hf_cache_dir,
|
| 56 |
torch_dtype=torch.bfloat16,
|
| 57 |
attn_implementation="sdpa"
|
| 58 |
).to(device)
|
| 59 |
model.generation_config.pad_token_id = tokenizer.eos_token_id
|
| 60 |
|
| 61 |
+
# Inference simple as transformers library
|
| 62 |
prompt = "Describe basics of DNNs quantization."
|
| 63 |
inputs = tokenizer(prompt, return_tensors="pt")
|
| 64 |
inputs.to(device)
|
|
|
|
| 71 |
skip_special_tokens=True,
|
| 72 |
clean_up_tokenization_spaces=False
|
| 73 |
)[0]
|
| 74 |
+
|
| 75 |
+
# Validate answer
|
| 76 |
print(f"# Q:\n{prompt}\n")
|
| 77 |
print(f"# A:\n{output}\n")
|
| 78 |
```
|
| 79 |
|
| 80 |
+
### Installation
|
|
|
|
| 81 |
|
| 82 |
__GPUs__: H100, L40s
|
| 83 |
|
|
|
|
| 85 |
|
| 86 |
__Python__: 3.10-3.12
|
| 87 |
|
| 88 |
+
To work with our models
|
|
|
|
|
|
|
| 89 |
|
| 90 |
```shell
|
| 91 |
pip install thestage
|
| 92 |
pip install elastic_models
|
| 93 |
```
|
| 94 |
|
| 95 |
+
Then go to app.thestage.ai, login and generate API token from your profile page. Set up API token as follows:
|
|
|
|
| 96 |
|
| 97 |
```shell
|
| 98 |
thestage config set --api-token <YOUR_API_TOKEN>
|
|
|
|
| 100 |
|
| 101 |
Congrats, now you can use accelerated models!
|
| 102 |
|
| 103 |
+
----
|
| 104 |
|
| 105 |
## Benchmarks
|
| 106 |
|
|
|
|
| 118 |
| Winogrande | 0 | 0 | 0 | 0 | 0 | 0 |
|
| 119 |
|
| 120 |
|
| 121 |
+
> __MMLU__: Evaluates/shows {MMLU}
|
| 122 |
|
| 123 |
> __MMLU__: Evaluates/shows ...
|
| 124 |
|
|
|
|
| 126 |
|
| 127 |
> __PIQA__: Evaluates/shows ...
|
| 128 |
|
|
|
|
| 129 |
### Latency benchmarks
|
| 130 |
|
| 131 |
We have profiled models in different scenarios:
|
| 132 |
|
| 133 |
+
<table>
|
| 134 |
+
<tr><th> 100 input/300 output; tok/s </th><th> 1000 input/1000 output; tok/s </th></tr>
|
| 135 |
+
<tr><td>
|
| 136 |
+
|
| 137 |
| GPU/Model | S | M | L | XL | Original | W8A8, int8 |
|
| 138 |
|-----------|-----|---|---|----|----------|------------|
|
| 139 |
| H100 | 189 | 0 | 0 | 0 | 48 | 0 |
|
| 140 |
| L40s | 79 | 0 | 0 | 0 | 42 | 0 |
|
| 141 |
|
| 142 |
|
| 143 |
+
|
| 144 |
+
</td><td>
|
| 145 |
+
|
| 146 |
| GPU/Model | S | M | L | XL | Original | W8A8, int8 |
|
| 147 |
|-----------|-----|---|---|----|----------|------------|
|
| 148 |
| H100 | 189 | 0 | 0 | 0 | 48 | 0 |
|
| 149 |
| L40s | 79 | 0 | 0 | 0 | 42 | 0 |
|
| 150 |
|
| 151 |
+
</td></tr> </table>
|
| 152 |
+
|
| 153 |
|
| 154 |
## Links
|
| 155 |
|
| 156 |
* __Platform__: [app.thestage.ai](app.thestage.ai)
|
| 157 |
* __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
|
| 158 |
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
|
| 159 |
+
* __Contact email__: contact@thestage.ai
|