fraseque
/

llama-3.2-1B-FP8-Neuron

Text Generation

Model card Files Files and versions

fraseque commited on Oct 24, 2025

Commit

9043499

·

verified ·

1 Parent(s): df3c3cc

Update README.md

Files changed (1) hide show

README.md +1 -0

README.md CHANGED Viewed

@@ -24,6 +24,7 @@ This is an FP8-quantized version of Meta's Llama 3.2 1B model, specifically opti
 ### Model Description
 This model is a deployment-optimized version of Llama 3.2 1B that has been quantized to FP8 precision and compiled for AWS Neuron devices. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium chips, which are purpose-built machine learning accelerators.
 ### Key Features

 ### Model Description
 This model is a deployment-optimized version of Llama 3.2 1B that has been quantized to FP8 precision and compiled for AWS Neuron devices. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium chips, which are purpose-built machine learning accelerators.
+For better performance set Tp_degree=8 on Inf2.24xlarge [Total Token Throughput = ~2.5k tokens/sec]
 ### Key Features