fraseque
/

llama-3.2-1B-FP8-Neuron

Text Generation

Model card Files Files and versions

fraseque commited on Oct 24, 2025

Commit

adefff8

·

verified ·

1 Parent(s): 9043499

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -23,8 +23,8 @@ This is an FP8-quantized version of Meta's Llama 3.2 1B model, specifically opti
 ### Model Description
-This model is a deployment-optimized version of Llama 3.2 1B that has been quantized to FP8 precision and compiled for AWS Neuron devices. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium chips, which are purpose-built machine learning accelerators.
-For better performance set Tp_degree=8 on Inf2.24xlarge [Total Token Throughput = ~2.5k tokens/sec]
 ### Key Features

 ### Model Description
+- This model is a deployment-optimized version of Llama 3.2 1B that has been quantized to FP8 precision and compiled for AWS Neuron devices. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium chips, which are purpose-built machine learning accelerators.
+- **Note:** For better performance set Tp_degree=8 on Inf2.24xlarge [Total Token Throughput = ~2.5k tokens/sec]
 ### Key Features