--- license: llama3.1 base_model: meta-llama/Llama-3.1-8B-Instruct tags: - llama - quantized - nvidia-modeloptimizer - fp8 library_name: nvidia-modeloptimizer --- # Llama-3.1-8B-Instruct Quantized (ModelOpt FP8) This is a quantized version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using [modelopt](https://github.com/NVIDIA/Model-Optimizer) with FP8 weight quantization. ## Model Details - **Base Model:** meta-llama/Llama-3.1-8B-Instruct - **Quantization Method:** modelopt FP8 Post-Training Quantization (PTQ) - **Weight Precision:** FP8 - **Original Size:** ~16 GB (bfloat16) - **Quantized Size:** ~9 GB ## Usage ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer # Load base model structure model = AutoModelForCausalLM.from_pretrained( "tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True ) # Load tokenizer and generate tokenizer = AutoTokenizer.from_pretrained("tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8") inputs = tokenizer("Hello, my name is", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=10) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## License This model inherits the [Llama 3.1 Community License](https://llama.meta.com/llama3_1/license/).