fraseque
/

llama-3.2-1B-FP8-Neuron

Text Generation

Model card Files Files and versions

fraseque commited on Oct 26, 2025

Commit

563d128

·

verified ·

1 Parent(s): adefff8

Update README.md

Files changed (1) hide show

README.md +23 -0

README.md CHANGED Viewed

@@ -69,6 +69,25 @@ docker run \
   bash
 ```
 2. Install Dependencies
 # Install required dependencies
 ```bash
@@ -142,6 +161,10 @@ python3 benchmark_serving.py --backend vllm --base-url http://127.0.0.1:8080 --d
 ![Screenshot 2025-10-21 at 3.07.47 pm](https://cdn-uploads.huggingface.co/production/uploads/64ccd3db7a4f236357524396/XiJqFZMH995uUHJ_xbbvI.png)
 **Quantization Details**
 Quantization Format	FP8 E4M3 (8-bit floating point)

   bash
 ```
+**On Inf2.24xlarge [Optional]**
+```bash
+docker run \
+  -it \
+  --device=/dev/neuron0 \
+  --device=/dev/neuron1 \
+  --device=/dev/neuron2 \
+  --device=/dev/neuron3 \
+  --device=/dev/neuron4 \
+  --device=/dev/neuron5 \
+  --cap-add SYS_ADMIN \
+  --cap-add IPC_LOCK \
+  -p 8080:8080 \
+  --name llama3-2-1B \
+  public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py311-sdk2.26.0-ubuntu22.04 \
+  bash
+```
 2. Install Dependencies
 # Install required dependencies
 ```bash
 ![Screenshot 2025-10-21 at 3.07.47 pm](https://cdn-uploads.huggingface.co/production/uploads/64ccd3db7a4f236357524396/XiJqFZMH995uUHJ_xbbvI.png)
+Results on Inf2.24xlarge [6 neuron cores]
+![Screenshot 2025-10-23 at 12.52.57 pm](https://cdn-uploads.huggingface.co/production/uploads/64ccd3db7a4f236357524396/CP7xB8TdEF1PLrIpF7gRD.png)
 **Quantization Details**
 Quantization Format	FP8 E4M3 (8-bit floating point)