fraseque
/

Llama-3.3-70B-FP8-Instruct-Neuron

@@ -10,47 +10,86 @@ tags:
 - fp8
 pipeline_tag: text-generation
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-This is an FP8-quantized version of Meta's Llama 3.3 70B Instruct model, optimized for efficient inference on AWS Neuron accelerators.
 ## Model Details
 ### Model Description
 <!-- Provide a longer summary of what this model is. -->
 **Base Model:** [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
-**Quantization:** FP8 (8-bit floating point)
-**Optimization Target:** AWS Inferentia2
 **Tensor Parallelism Degree:** 24
-**Hardware:** AWS Inferentia2.48xlarge
-- **Developed by:** [Fraser Sequeira]
 ## Quick Start
 This model requires AWS Neuron runtime and the appropriate neuron compiler. To use it:
-```python
-import torch
-import torch_neuronx
-from transformers import AutoTokenizer, AutoModelForCausalLM
 model_id = "fraseque/Llama-3.3-70B-FP8-Instruct-Neuron"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id, device_map="neuron")
 # Generate text
-inputs = tokenizer("Hello, how are you?", return_tensors="pt")
-outputs = model.generate(**inputs, max_length=100)
-print(tokenizer.decode(outputs[0]))
 ```
 ## Quantization Details
@@ -61,6 +100,7 @@ print(tokenizer.decode(outputs[0]))
 - Tensor Parallelism (TP) degree: 24
 - Target accelerator: AWS Inferentia2
 - Instance type: aws.inf2.48xlarge
 ## Uses
@@ -68,161 +108,44 @@ print(tokenizer.decode(outputs[0]))
 ### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
 ### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
 #### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 - fp8
 pipeline_tag: text-generation
 ---
+# Llama-3.3-70B-FP8-Instruct-Neuron
+This is an FP8-quantized version of Meta's Llama 3.3 70B Instruct model, specifically optimized for efficient inference on AWS Neuron accelerators (Inferentia2 and Trainium). The model has been compiled and quantized using AWS Neuron SDK to leverage the specialized AI acceleration capabilities of AWS Neuron chips.
 ## Model Details
 ### Model Description
+This model is a deployment-optimized version of Llama 3.3 70B Instruct that has been quantized to FP8 precision and compiled for AWS Neuron devices. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium chips, which are purpose-built machine learning accelerators.
+## Key Features:
+* Reduced memory footprint through FP8 quantization (from 16-bit to 8-bit floating point)
+* Optimized for AWS Inferentia2 instances
+* Pre-compiled for tensor parallelism across 24 NeuronCores
+* Maintains instruction-following capabilities of the base model
 <!-- Provide a longer summary of what this model is. -->
 **Base Model:** [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
+**Quantization:** FP8 E4M3 (IEEE-754 FP8_EXP4 format)
+**Optimization Target:** [AWS Inferentia2](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inferentia2.html) NeuronCores
 **Tensor Parallelism Degree:** 24
+**Recommended Hardware:** AWS inf2.48xlarge (24 Neuron devices with 2 NeuronCores each)
+**Developed by:** Fraser Sequeira
 ## Quick Start
 This model requires AWS Neuron runtime and the appropriate neuron compiler. To use it:
+### Prerequisites
+```
+# Install AWS Neuron SDK and required packages
+pip install neuronx-distributed-inference transformers huggingface_hub
+```
+```python
+from transformers import AutoTokenizer, GenerationConfig
+from huggingface_hub import snapshot_download
+from neuronx_distributed_inference.models.config import NeuronConfig
+from neuronx_distributed_inference.models.llama.modeling_llama import (
+    LlamaInferenceConfig,
+    NeuronLlamaForCausalLM,
+)
+from neuronx_distributed_inference.utils.accuracy import get_generate_outputs
+from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
+# Model setup
 model_id = "fraseque/Llama-3.3-70B-FP8-Instruct-Neuron"
+compiled_model_path = os.getenv(
+        "COMPILED_MODEL_PATH", "/tmp/compiled_llama-3.3-70B-FP8-Instruct-Neuron"
+    )
+# Download and load model
+model_path = snapshot_download(repo_id=model_id)
+# Configure for Neuron
+neuron_config = NeuronConfig(tp_degree=24, seq_len=8192)
+config = LlamaInferenceConfig(neuron_config, load_config=load_pretrained_config(model_path))
+# Load model
+model = NeuronLlamaForCausalLM(model_path, config)
+model.compile(compiled_model_path)
+model.load(compiled_model_path)
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side=neuron_config.padding_side)
+# Generation config
+generation_config = GenerationConfig.from_pretrained(model_id)
+generation_config.max_new_tokens = 100
+generation_config.temperature = 0.7
 # Generate text
+prompt = "[INST] Hello, Whats the capital of Australia? [/INST]"
+_, outputs = get_generate_outputs(model, [prompt], tokenizer, is_hf=False, generation_config=generation_config)
+print(outputs[0])
 ```
 ## Quantization Details
 - Tensor Parallelism (TP) degree: 24
 - Target accelerator: AWS Inferentia2
 - Instance type: aws.inf2.48xlarge
+- Sequence length: 8192 tokens
 ## Uses
 ### Direct Use
+This model is intended for:
+* Production inference deployments on AWS Inferentia2 instances
+* Cost-effective LLM serving with reduced computational requirements
+* Conversational AI applications requiring instruction-following capabilities
+* Text generation tasks including question-answering, summarization, and creative writing
+**The FP8 quantization enables:**
+* ~50% reduction in memory footprint compared to FP16
+* Improved throughput on Neuron accelerators
+* Lower inference costs on AWS infrastructure
 ### Out-of-Scope Use
+This model is NOT suitable for:
+* Deployment on non-Neuron hardware (GPUs, CPUs) without recompilation
 ## Bias, Risks, and Limitations
+**Technical Limitations:**
+- **Quantization artifacts:** FP8 quantization may introduce minor accuracy degradation compared to the full-precision base model
+- **Numerical range:** The Neuron FP8 E4M3 format has a limited range (±240), which may cause NaNs for extreme values
+- **Hardware dependency:** Model is compiled specifically for Neuron devices and cannot run on standard GPU/CPU infrastructure without recompilation
+- **Fixed compilation:** Model is compiled with TP degree 24 and sequence length 8192; different configurations require recompilation
+- **Inherited Limitations:** This model inherits all limitations from the base Llama 3.3 70B Instruct model
+- **AWS Neuron Specific:**
+    - Requires AWS Neuron SDK and compatible instance types
+    - Performance characteristics differ from GPU-based deployments
+    - Optimal performance achieved on inf2.48xlarge instances
 #### Hardware
+- **Hardware Type:** [Inf2.48xlarge](https://instances.vantage.sh/aws/ec2/inf2.48xlarge?currency=USD)
+- **Cloud Provider:** AWS
+- **Compute Region:** US-EAST
+## Model Card Authors
+* [Fraser Sequeira](https://www.linkedin.com/in/fraser-sequeira)