fraseque commited on
Commit
626cae2
·
verified ·
1 Parent(s): a4c9bfa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -161
README.md CHANGED
@@ -10,47 +10,86 @@ tags:
10
  - fp8
11
  pipeline_tag: text-generation
12
  ---
13
- # Model Card for Model ID
14
-
15
- <!-- Provide a quick summary of what the model is/does. -->
16
- This is an FP8-quantized version of Meta's Llama 3.3 70B Instruct model, optimized for efficient inference on AWS Neuron accelerators.
17
 
18
  ## Model Details
19
 
20
  ### Model Description
 
 
 
 
 
 
 
 
21
 
22
  <!-- Provide a longer summary of what this model is. -->
23
  **Base Model:** [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
24
 
25
- **Quantization:** FP8 (8-bit floating point)
26
 
27
- **Optimization Target:** AWS Inferentia2
28
 
29
  **Tensor Parallelism Degree:** 24
30
 
31
- **Hardware:** AWS Inferentia2.48xlarge
32
 
 
33
 
34
 
35
- - **Developed by:** [Fraser Sequeira]
36
-
37
  ## Quick Start
38
 
39
  This model requires AWS Neuron runtime and the appropriate neuron compiler. To use it:
40
 
41
- ```python
42
- import torch
43
- import torch_neuronx
44
- from transformers import AutoTokenizer, AutoModelForCausalLM
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  model_id = "fraseque/Llama-3.3-70B-FP8-Instruct-Neuron"
47
- tokenizer = AutoTokenizer.from_pretrained(model_id)
48
- model = AutoModelForCausalLM.from_pretrained(model_id, device_map="neuron")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  # Generate text
51
- inputs = tokenizer("Hello, how are you?", return_tensors="pt")
52
- outputs = model.generate(**inputs, max_length=100)
53
- print(tokenizer.decode(outputs[0]))
54
  ```
55
 
56
  ## Quantization Details
@@ -61,6 +100,7 @@ print(tokenizer.decode(outputs[0]))
61
  - Tensor Parallelism (TP) degree: 24
62
  - Target accelerator: AWS Inferentia2
63
  - Instance type: aws.inf2.48xlarge
 
64
 
65
  ## Uses
66
 
@@ -68,161 +108,44 @@ print(tokenizer.decode(outputs[0]))
68
 
69
  ### Direct Use
70
 
71
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
72
 
73
- [More Information Needed]
 
 
 
74
 
75
- ### Downstream Use [optional]
 
 
 
76
 
77
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
78
-
79
- [More Information Needed]
80
 
81
  ### Out-of-Scope Use
82
 
83
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
84
 
85
- [More Information Needed]
86
 
87
  ## Bias, Risks, and Limitations
88
 
89
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
90
-
91
- [More Information Needed]
92
-
93
- ### Recommendations
94
-
95
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
96
-
97
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
98
-
99
- ## How to Get Started with the Model
100
-
101
- Use the code below to get started with the model.
102
-
103
- [More Information Needed]
104
-
105
- ## Training Details
106
-
107
- ### Training Data
108
-
109
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
110
-
111
- [More Information Needed]
112
-
113
- ### Training Procedure
114
-
115
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
116
-
117
- #### Preprocessing [optional]
118
-
119
- [More Information Needed]
120
-
121
-
122
- #### Training Hyperparameters
123
-
124
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
125
-
126
- #### Speeds, Sizes, Times [optional]
127
-
128
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
129
-
130
- [More Information Needed]
131
-
132
- ## Evaluation
133
-
134
- <!-- This section describes the evaluation protocols and provides the results. -->
135
-
136
- ### Testing Data, Factors & Metrics
137
-
138
- #### Testing Data
139
-
140
- <!-- This should link to a Dataset Card if possible. -->
141
 
142
- [More Information Needed]
143
-
144
- #### Factors
145
-
146
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
147
-
148
- [More Information Needed]
149
-
150
- #### Metrics
151
-
152
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
153
-
154
- [More Information Needed]
155
-
156
- ### Results
157
-
158
- [More Information Needed]
159
-
160
- #### Summary
161
-
162
-
163
-
164
- ## Model Examination [optional]
165
-
166
- <!-- Relevant interpretability work for the model goes here -->
167
-
168
- [More Information Needed]
169
-
170
- ## Environmental Impact
171
-
172
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
173
-
174
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
175
-
176
- - **Hardware Type:** [More Information Needed]
177
- - **Hours used:** [More Information Needed]
178
- - **Cloud Provider:** [More Information Needed]
179
- - **Compute Region:** [More Information Needed]
180
- - **Carbon Emitted:** [More Information Needed]
181
-
182
- ## Technical Specifications [optional]
183
-
184
- ### Model Architecture and Objective
185
-
186
- [More Information Needed]
187
-
188
- ### Compute Infrastructure
189
-
190
- [More Information Needed]
191
 
192
  #### Hardware
193
 
194
- [More Information Needed]
195
-
196
- #### Software
197
-
198
- [More Information Needed]
199
-
200
- ## Citation [optional]
201
-
202
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
203
-
204
- **BibTeX:**
205
-
206
- [More Information Needed]
207
-
208
- **APA:**
209
-
210
- [More Information Needed]
211
-
212
- ## Glossary [optional]
213
-
214
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
215
-
216
- [More Information Needed]
217
-
218
- ## More Information [optional]
219
-
220
- [More Information Needed]
221
-
222
- ## Model Card Authors [optional]
223
-
224
- [More Information Needed]
225
-
226
- ## Model Card Contact
227
 
228
- [More Information Needed]
 
 
10
  - fp8
11
  pipeline_tag: text-generation
12
  ---
13
+ # Llama-3.3-70B-FP8-Instruct-Neuron
14
+ This is an FP8-quantized version of Meta's Llama 3.3 70B Instruct model, specifically optimized for efficient inference on AWS Neuron accelerators (Inferentia2 and Trainium). The model has been compiled and quantized using AWS Neuron SDK to leverage the specialized AI acceleration capabilities of AWS Neuron chips.
 
 
15
 
16
  ## Model Details
17
 
18
  ### Model Description
19
+ This model is a deployment-optimized version of Llama 3.3 70B Instruct that has been quantized to FP8 precision and compiled for AWS Neuron devices. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium chips, which are purpose-built machine learning accelerators.
20
+
21
+ ## Key Features:
22
+
23
+ * Reduced memory footprint through FP8 quantization (from 16-bit to 8-bit floating point)
24
+ * Optimized for AWS Inferentia2 instances
25
+ * Pre-compiled for tensor parallelism across 24 NeuronCores
26
+ * Maintains instruction-following capabilities of the base model
27
 
28
  <!-- Provide a longer summary of what this model is. -->
29
  **Base Model:** [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
30
 
31
+ **Quantization:** FP8 E4M3 (IEEE-754 FP8_EXP4 format)
32
 
33
+ **Optimization Target:** [AWS Inferentia2](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inferentia2.html) NeuronCores
34
 
35
  **Tensor Parallelism Degree:** 24
36
 
37
+ **Recommended Hardware:** AWS inf2.48xlarge (24 Neuron devices with 2 NeuronCores each)
38
 
39
+ **Developed by:** Fraser Sequeira
40
 
41
 
 
 
42
  ## Quick Start
43
 
44
  This model requires AWS Neuron runtime and the appropriate neuron compiler. To use it:
45
 
46
+ ### Prerequisites
 
 
 
47
 
48
+ ```
49
+ # Install AWS Neuron SDK and required packages
50
+ pip install neuronx-distributed-inference transformers huggingface_hub
51
+ ```
52
+
53
+ ```python
54
+ from transformers import AutoTokenizer, GenerationConfig
55
+ from huggingface_hub import snapshot_download
56
+ from neuronx_distributed_inference.models.config import NeuronConfig
57
+ from neuronx_distributed_inference.models.llama.modeling_llama import (
58
+ LlamaInferenceConfig,
59
+ NeuronLlamaForCausalLM,
60
+ )
61
+ from neuronx_distributed_inference.utils.accuracy import get_generate_outputs
62
+ from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
63
+
64
+ # Model setup
65
  model_id = "fraseque/Llama-3.3-70B-FP8-Instruct-Neuron"
66
+ compiled_model_path = os.getenv(
67
+ "COMPILED_MODEL_PATH", "/tmp/compiled_llama-3.3-70B-FP8-Instruct-Neuron"
68
+ )
69
+ # Download and load model
70
+ model_path = snapshot_download(repo_id=model_id)
71
+
72
+ # Configure for Neuron
73
+ neuron_config = NeuronConfig(tp_degree=24, seq_len=8192)
74
+ config = LlamaInferenceConfig(neuron_config, load_config=load_pretrained_config(model_path))
75
+
76
+ # Load model
77
+ model = NeuronLlamaForCausalLM(model_path, config)
78
+ model.compile(compiled_model_path)
79
+ model.load(compiled_model_path)
80
+
81
+ # Load tokenizer
82
+ tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side=neuron_config.padding_side)
83
+
84
+ # Generation config
85
+ generation_config = GenerationConfig.from_pretrained(model_id)
86
+ generation_config.max_new_tokens = 100
87
+ generation_config.temperature = 0.7
88
 
89
  # Generate text
90
+ prompt = "[INST] Hello, Whats the capital of Australia? [/INST]"
91
+ _, outputs = get_generate_outputs(model, [prompt], tokenizer, is_hf=False, generation_config=generation_config)
92
+ print(outputs[0])
93
  ```
94
 
95
  ## Quantization Details
 
100
  - Tensor Parallelism (TP) degree: 24
101
  - Target accelerator: AWS Inferentia2
102
  - Instance type: aws.inf2.48xlarge
103
+ - Sequence length: 8192 tokens
104
 
105
  ## Uses
106
 
 
108
 
109
  ### Direct Use
110
 
111
+ This model is intended for:
112
 
113
+ * Production inference deployments on AWS Inferentia2 instances
114
+ * Cost-effective LLM serving with reduced computational requirements
115
+ * Conversational AI applications requiring instruction-following capabilities
116
+ * Text generation tasks including question-answering, summarization, and creative writing
117
 
118
+ **The FP8 quantization enables:**
119
+ * ~50% reduction in memory footprint compared to FP16
120
+ * Improved throughput on Neuron accelerators
121
+ * Lower inference costs on AWS infrastructure
122
 
 
 
 
123
 
124
  ### Out-of-Scope Use
125
 
126
+ This model is NOT suitable for:
127
+ * Deployment on non-Neuron hardware (GPUs, CPUs) without recompilation
128
 
 
129
 
130
  ## Bias, Risks, and Limitations
131
 
132
+ **Technical Limitations:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
+ - **Quantization artifacts:** FP8 quantization may introduce minor accuracy degradation compared to the full-precision base model
135
+ - **Numerical range:** The Neuron FP8 E4M3 format has a limited range (±240), which may cause NaNs for extreme values
136
+ - **Hardware dependency:** Model is compiled specifically for Neuron devices and cannot run on standard GPU/CPU infrastructure without recompilation
137
+ - **Fixed compilation:** Model is compiled with TP degree 24 and sequence length 8192; different configurations require recompilation
138
+ - **Inherited Limitations:** This model inherits all limitations from the base Llama 3.3 70B Instruct model
139
+ - **AWS Neuron Specific:**
140
+ - Requires AWS Neuron SDK and compatible instance types
141
+ - Performance characteristics differ from GPU-based deployments
142
+ - Optimal performance achieved on inf2.48xlarge instances
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
  #### Hardware
145
 
146
+ - **Hardware Type:** [Inf2.48xlarge](https://instances.vantage.sh/aws/ec2/inf2.48xlarge?currency=USD)
147
+ - **Cloud Provider:** AWS
148
+ - **Compute Region:** US-EAST
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
+ ## Model Card Authors
151
+ * [Fraser Sequeira](https://www.linkedin.com/in/fraser-sequeira)