Update README.md
Browse files
README.md
CHANGED
|
@@ -23,8 +23,8 @@ This is an FP8-quantized version of Meta's Llama 3.2 1B model, specifically opti
|
|
| 23 |
|
| 24 |
### Model Description
|
| 25 |
|
| 26 |
-
This model is a deployment-optimized version of Llama 3.2 1B that has been quantized to FP8 precision and compiled for AWS Neuron devices. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium chips, which are purpose-built machine learning accelerators.
|
| 27 |
-
For better performance set Tp_degree=8 on Inf2.24xlarge [Total Token Throughput = ~2.5k tokens/sec]
|
| 28 |
|
| 29 |
### Key Features
|
| 30 |
|
|
|
|
| 23 |
|
| 24 |
### Model Description
|
| 25 |
|
| 26 |
+
- This model is a deployment-optimized version of Llama 3.2 1B that has been quantized to FP8 precision and compiled for AWS Neuron devices. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium chips, which are purpose-built machine learning accelerators.
|
| 27 |
+
- **Note:** For better performance set Tp_degree=8 on Inf2.24xlarge [Total Token Throughput = ~2.5k tokens/sec]
|
| 28 |
|
| 29 |
### Key Features
|
| 30 |
|