psynote123 commited on
Commit
a63d1ec
·
verified ·
1 Parent(s): e28a350

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +179 -3
README.md CHANGED
@@ -1,3 +1,179 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - mistralai/Mistral-Nemo-Instruct-2407
5
+ base_model_relation: quantized
6
+ pipeline_tag: text2text-generation
7
+ language:
8
+ - zho
9
+ - eng
10
+ - fra
11
+ - spa
12
+ - por
13
+ - deu
14
+ - ita
15
+ - rus
16
+ - jpn
17
+ - kor
18
+ - vie
19
+ - tha
20
+ - ara
21
+ ---
22
+
23
+ # Elastic model: Mistral-Nemo-Instruct-2407. Fastest and most flexible models for self-serving.
24
+
25
+ Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
26
+
27
+ * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
28
+
29
+ * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
30
+
31
+ * __M__: Faster model, with accuracy degradation less than 1.5%.
32
+
33
+ * __S__: The fastest model, with accuracy degradation less than 2%.
34
+
35
+
36
+ __Goals of elastic models:__
37
+
38
+ * Provide flexibility in cost vs quality selection for inference
39
+ * Provide clear quality and latency benchmarks
40
+ * Provide interface of HF libraries: transformers and diffusers with a single line of code
41
+ * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
42
+ * Provide the best models and service for self-hosting.
43
+
44
+ > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
45
+
46
+ ![Performance Graph](images/performance_graph.png)
47
+ -----
48
+
49
+ ## Inference
50
+
51
+ To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
52
+
53
+ ```python
54
+ import torch
55
+ from transformers import AutoTokenizer
56
+ from elastic_models.transformers import AutoModelForCausalLM
57
+
58
+ # Currently we require to have your HF token
59
+ # as we use original weights for part of layers and
60
+ # model confugaration as well
61
+ model_name = "mistralai/Mistral-Nemo-Instruct-2407"
62
+ hf_token = ''
63
+ device = torch.device("cuda")
64
+
65
+ # Create mode
66
+ tokenizer = AutoTokenizer.from_pretrained(
67
+ model_name, token=hf_token
68
+ )
69
+ model = AutoModelForCausalLM.from_pretrained(
70
+ model_name,
71
+ token=hf_token,
72
+ torch_dtype=torch.bfloat16,
73
+ attn_implementation="sdpa",
74
+ mode='S'
75
+ ).to(device)
76
+ model.generation_config.pad_token_id = tokenizer.eos_token_id
77
+
78
+ # Inference simple as transformers library
79
+ prompt = "Describe basics of DNNs quantization."
80
+ messages = [
81
+ {
82
+ "role": "system",
83
+ "content": "You are a search bot, answer on user text queries."
84
+ },
85
+ {
86
+ "role": "user",
87
+ "content": prompt
88
+ }
89
+ ]
90
+
91
+ chat_prompt = tokenizer.apply_chat_template(
92
+ messages, add_generation_prompt=True, tokenize=False
93
+ )
94
+
95
+ inputs = tokenizer(chat_prompt, return_tensors="pt")
96
+ inputs.to(device)
97
+
98
+ with torch.inference_mode():
99
+ generate_ids = model.generate(**inputs, max_length=500)
100
+
101
+ input_len = inputs['input_ids'].shape[1]
102
+ generate_ids = generate_ids[:, input_len:]
103
+ output = tokenizer.batch_decode(
104
+ generate_ids,
105
+ skip_special_tokens=True,
106
+ clean_up_tokenization_spaces=False
107
+ )[0]
108
+
109
+ # Validate answer
110
+ print(f"# Q:\n{prompt}\n")
111
+ print(f"# A:\n{output}\n")
112
+ ```
113
+
114
+ __System requirements:__
115
+ * GPUs: H100, L40s
116
+ * CPU: AMD, Intel
117
+ * Python: 3.10-3.12
118
+
119
+
120
+ To work with our models just run these lines in your terminal:
121
+
122
+ ```shell
123
+ pip install thestage
124
+ pip install elastic_models[nvidia]\
125
+ --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
126
+ --extra-index-url https://pypi.nvidia.com\
127
+ --extra-index-url https://pypi.org/simple
128
+
129
+ pip install flash_attn==2.7.3 --no-build-isolation
130
+ pip uninstall apex
131
+ ```
132
+
133
+ Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
134
+
135
+ ```shell
136
+ thestage config set --api-token <YOUR_API_TOKEN>
137
+ ```
138
+
139
+ Congrats, now you can use accelerated models!
140
+
141
+ ----
142
+
143
+ ## Benchmarks
144
+
145
+ Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
146
+
147
+ ### Quality benchmarks
148
+
149
+ | Metric/Model | S | M | L | XL | Original | W8A8, int8 |
150
+ |---------------|---|---|---|----|----------|------------|
151
+ | arc_challenge | 55.00 | 54.70 | 56.30 | 56.20 | 56.20 | 55.00 | - |
152
+ | mmlu | 65.40 | 65.60 | 66.70 | 66.90 | 66.90 | 65.40 | - |
153
+ | piqa | 80.00 | 81.10 | 80.80 | 80.80 | 80.80 | 80.00 | - |
154
+ | winogrande | 74.20 | 74.30 | 75.10 | 75.10 | 75.10 | 74.20 | - |
155
+
156
+
157
+
158
+ * **MMLU**: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
159
+ * **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
160
+ * **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
161
+ * **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
162
+
163
+ ### Latency benchmarks
164
+
165
+ __100 input/300 output; tok/s:__
166
+
167
+ | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
168
+ |-----------|-----|---|---|----|----------|------------|
169
+ | H100 | -1 | -1 | -1 | -1 | 40 | -1 | - |
170
+ | L40S | -1 | -1 | -1 | -1 | 27 | -1 | - |
171
+
172
+
173
+
174
+ ## Links
175
+
176
+ * __Platform__: [app.thestage.ai](app.thestage.ai)
177
+ * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
178
+ <!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
179
+ * __Contact email__: contact@thestage.ai