Upload Llama-3.1-8B quantized with ModelOpt FP8-QAT

6c85a47 verified about 2 months ago

1.4 kB

	---
	license: llama3.1
	base_model: meta-llama/Llama-3.1-8B-Instruct
	tags:
	- llama
	- quantized
	- nvidia-modeloptimizer
	- FP8
	- QAT
	library_name: nvidia-modeloptimizer
	---

	# Llama-3.1-8B-Instruct Quantized (ModelOpt FP8) through QAT

	This is a quantized version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using [modelopt](https://github.com/NVIDIA/Model-Optimizer) with NVFP4 weight quantization.

	## Model Details

	- Base Model: meta-llama/Llama-3.1-8B-Instruct
	- Quantization Method: modelopt FP8 Quantization Aware Training (QAT)
	- Weight Precision: FP8
	- Original Size: ~16 GB (bfloat16)
	- Quantized Size: ~6 GB (fp8)

	## Usage

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load base model structure
	model = AutoModelForCausalLM.from_pretrained(
	"tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8",
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True
	)

	# Load tokenizer and generate
	tokenizer = AutoTokenizer.from_pretrained("tokenlabsdotrun/Llama-3.1-8B-ModelOpt-NVFP4-QAT")

	inputs = tokenizer("Hello, my name is", return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=10)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## License

	This model inherits the [Llama 3.1 Community License](https://llama.meta.com/llama3_1/license/).