File size: 7,698 Bytes
5fafe32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
adefff8
 
5fafe32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
563d128
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5fafe32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
563d128
 
 
 
5fafe32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
license: llama3.2
base_model:
- meta-llama/Llama-3.2-1B
tags:
- Neuron
- Inferentia2
- AWS
- text-generation
- fp8
- quantized
- vllm
pipeline_tag: text-generation
language:
- en
---

# Llama-3.2-1B-FP8-Neuron

This is an FP8-quantized version of Meta's Llama 3.2 1B model, specifically optimized for efficient inference on AWS Neuron accelerators (Inferentia2). The model has been compiled and quantized using AWS Neuron SDK to leverage the specialized AI acceleration capabilities of AWS Neuron chips.

## Model Details

### Model Description

- This model is a deployment-optimized version of Llama 3.2 1B that has been quantized to FP8 precision and compiled for AWS Neuron devices. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium chips, which are purpose-built machine learning accelerators.
- **Note:** For better performance set Tp_degree=8 on Inf2.24xlarge [Total Token Throughput = ~2.5k tokens/sec]

### Key Features

* **Reduced memory footprint** through FP8 quantization (~50% reduction from FP16)
* **Optimized for AWS Inferentia2** instances
* **Pre-compiled** for tensor parallelism across 2 NeuronCores
* **Maintains instruction-following capabilities** of the base model
* **Cost-effective** LLM serving with improved throughput

### Model Specifications

| Specification | Value |
|--------------|-------|
| **Base Model** | [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) |
| **Quantization** | FP8 E4M3 (IEEE-754 FP8_EXP4 format) |
| **Optimization Target** | [AWS Inferentia2](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inferentia2.html) NeuronCores |
| **Tensor Parallelism Degree** | 2 |
| **Recommended Hardware** | AWS inf2.8xlarge |
| **Max Sequence Length** | 8192 tokens |
| **Developed by** | [Fraser Sequeira](https://www.linkedin.com/in/fraser-sequeira) |

## Quick Start

### Prerequisites

1. Launch an **inf2.8xlarge** Ubuntu EC2 instance on AWS
2. Select the **'Deep Learning AMI Neuron (Ubuntu 22.04)'** AMI

### Installation & Setup

#### 1. Launch Docker Container

```bash
docker run \
  -it \
  --device=/dev/neuron0 \
  --cap-add SYS_ADMIN \
  --cap-add IPC_LOCK \
  -p 8080:8080 \
  --name llama3-2-1B \
  public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py311-sdk2.26.0-ubuntu22.04 \
  bash
```

**On Inf2.24xlarge [Optional]**
```bash
docker run \
  -it \
  --device=/dev/neuron0 \
  --device=/dev/neuron1 \
  --device=/dev/neuron2 \
  --device=/dev/neuron3 \
  --device=/dev/neuron4 \
  --device=/dev/neuron5 \
  --cap-add SYS_ADMIN \
  --cap-add IPC_LOCK \
  -p 8080:8080 \
  --name llama3-2-1B \
  public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py311-sdk2.26.0-ubuntu22.04 \
  bash
```


2. Install Dependencies
# Install required dependencies
```bash
pip install -U "huggingface_hub[cli]"
```
# Optional dependencies for benchmarking
```bash
pip install pandas datasets
```

3. Configure Hugging Face Access
```bash
export HF_TOKEN=<your-huggingface-token>
```
4. Download the Model
```bash
hf download fraseque/llama-3.2-1B-FP8-Neuron
```

5. Set Model Path
**The model is typically saved to:**
```bash
/root/.cache/huggingface/hub/models--fraseque--llama-3.2-1B-FP8-Neuron/snapshots/{{uuid}}
```
- **Replace {{uuid}} with the actual snapshot ID**
```bash
export MODEL_PATH=/root/.cache/huggingface/hub/models--fraseque--llama-3.2-1B-FP8-Neuron/snapshots/{{uuid}}
```

6. **_(Optional)_** Use Pre-compiled Artifacts
To **skip** compilation[This step takes 5-10 minutes] you can use pre-compiled artifacts available as part of this repository:
```bash
export NEURON_COMPILED_ARTIFACTS=$MODEL_PATH/neuron-compiled-artifacts/0a7a59fd2142874207e2f96474f27309
```

7. **Serve the Model**
```bash
VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference' python -m vllm.entrypoints.openai.api_server \
--model "$MODEL_PATH" \
--device "neuron" \
--tensor-parallel-size 2 \
--max-num-seqs 16 \
--max-model-len 8192 \
--port 8080 \
--override-neuron-config "{\"enable_bucketing\": true, \"context_encoding_buckets\": [128,512,1024,2048,4096,8192], \"token_generation_buckets\": [128,512,1024,2048,4096,8192], \"max_context_length\": 8192, \"use-v2-block-manager\": true, \"modules_to_not_convert\": [\"lm_head\", \"embed_tokens\"], \"seq_len\": 8192, \"quantization_dtype\":\"f8e4m3\", \"quantization_type\": \"per_channel_symmetric\", \"quantized_checkpoints_path\":\"$MODEL_PATH\", \"quantized\": true, \"batch_size\": 1, \"ctx_batch_size\": 1, \"tkg_batch_size\": 1, \"attn_kernel_enabled\": true, \"sequence_parallel_enabled\": true, \"is_continuous_batching\": true}"

```
**Making Inference Requests**
Once the server is running on Port 8080, you can make requests as follows:
**_Open another terminal and fire the below CURL request_**

```bash
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
  "prompt": "<|system|>You are a helpful AI assistant.<|user|>What is the capital of France?<|assistant|>",
  "max_tokens": 100,
  "temperature": 0.1,
  "top_p": 0.9,
  "stop": ["<|system|>", "<|user|>", "<|assistant|>", "<|end|>", "\n\n"]
}'
```


**Benchmarking Performance**
**_Open another terminal set the MODEL_PATH and fire the below benchmark command_**
```bash
cd /opt/vllm/benchmarks
python3 benchmark_serving.py --backend vllm --base-url http://127.0.0.1:8080 --dataset-name=random --model $MODEL_PATH --num-prompts 20 --max-concurrency 5 --request-rate inf --random-input-len 4000 --random-output-len 500 --seed 12345
```

![Screenshot 2025-10-21 at 3.07.47 pm](https://cdn-uploads.huggingface.co/production/uploads/64ccd3db7a4f236357524396/XiJqFZMH995uUHJ_xbbvI.png)

Results on Inf2.24xlarge [6 neuron cores]

![Screenshot 2025-10-23 at 12.52.57 pm](https://cdn-uploads.huggingface.co/production/uploads/64ccd3db7a4f236357524396/CP7xB8TdEF1PLrIpF7gRD.png)


**Quantization Details**
Quantization Format	FP8 E4M3 (8-bit floating point)
Quantization Type	Per-channel symmetric
Tensor Parallelism (TP)	2
Target Accelerator	AWS Inferentia2
Instance Type	inf2.8xlarge
Sequence Length	8192 tokens
Use Cases
Intended Use
This model is optimized for:

✅ Production inference deployments on AWS Inferentia2 instances
✅ Cost-effective LLM serving with reduced computational requirements
✅ Conversational AI applications requiring instruction-following
✅ Text generation tasks (Q&A, summarization, creative writing)
✅ Low-latency inference requirements

**Benefits of FP8 Quantization**
~50% memory reduction compared to FP16
Improved throughput on Neuron accelerators
Lower inference costs on AWS infrastructure
Maintained accuracy with minimal degradation

**Out-of-Scope Use**
This model is NOT suitable for:
❌ Deployment on non-Neuron hardware (GPUs, CPUs) without recompilation

**Limitations and Considerations**
Quantization artifacts: FP8 quantization may introduce minor accuracy degradation compared to full-precision models
Hardware dependency: Compiled specifically for Neuron devices; requires recompilation for other hardware
Max Sequence Length	8192 tokens

Citation
@misc{llama32-1b-fp8-neuron,
  author = {Sequeira, Fraser},
  title = {Llama-3.2-1B-FP8-Neuron},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/fraseque/llama-3.2-1B-FP8-Neuron}}
}

**Model Card Authors**
- Fraser Sequeira

Acknowledgments
Base model: Meta's Llama 3.2 1B
Quantization and compilation: AWS Neuron SDK [NEURONX_DISTRIBUTED_INFERENCE]
Inference framework: vLLM with Neuron support
License
This model inherits the Llama 3.2 license from Meta. Please refer to the official license for terms and conditions.

---