SwastikM
/

Meta-Llama-3-8B-Instruct_bitsandbytes_4bit

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions

Meta-Llama-3-8B-Instruct_bitsandbytes_4bit / README.md

SwastikM's picture

Update README.md

23bd217 verified over 1 year ago

|

history blame contribute delete

2.99 kB

	---
	library_name: transformers
	language:
	- en
	pipeline_tag: text-generation
	---

	# Meta-Llama-3-8B-Instruct-4bit

	BitsAndBytes 4bit Quantized Model

	# Quantization Configuration

	- load_in_4bit: True
	- llm_int8_threshold: 6.0
	- bnb_4bit_quant_type: nf4
	- bnb_4bit_use_double_quant: True
	- bnb_4bit_compute_dtype: bfloat16

	# How to use

	### Load Required Libraries

	```Python
	!pip install transformers
	!pip install peft
	!pip install -U bitsandbytes
	```

	### Load model directly

	```Python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
	model = AutoModelForCausalLM.from_pretrained("SwastikM/Meta-Llama-3-8B-Instruct_bitsandbytes_4bit")

	messages = [
	{"role": "system", "content": "You are a Coder."},
	{"role": "user", "content": "How to ctrate a list in Python?"}
	]

	input_ids = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	return_tensors="pt"
	).to(model.device)

	terminators = [
	tokenizer.eos_token_id,
	tokenizer.convert_tokens_to_ids("<\|eot_id\|>")
	]

	outputs = model.generate(
	input_ids,
	max_new_tokens=256,
	eos_token_id=terminators,
	do_sample=False,
	temperature=0.0
	)

	response = outputs[0][input_ids.shape[-1]:]
	print(tokenizer.decode(response, skip_special_tokens=True))
	```

	### Output

	```
	In Python, you can create a list in several ways:

	1. Using the `list()` function:

	my_list = list()

	This creates an empty list.

	2. Using square brackets `[]`:

	my_list = []

	This also creates an empty list.

	3. Using the `list()` function with an iterable (such as a string or a tuple):

	my_list = list("hello")
	print(my_list) # Output: ['h', 'e', 'l', 'l', 'o']

	4. Using the `list()` function with a range of numbers:

	my_list = list(range(1, 6))
	print(my_list) # Output: [1, 2, 3, 4, 5]

	5. Using the `list()` function with a dictionary:

	my_dict = {"a": 1, "b": 2, "c": 3}
	my_list = list(my_dict.keys())
	print(my_list) # Output: ['a', 'b', 'c']

	Note that in Python, lists are mutable, meaning you can add, remove, or modify elements after creating the list.
	```

	## Size Comparison

	The table shows comparison VRAM requirements for loading and training
	of FP16 Base Model and 4bit GPTQ quantized model with PEFT.
	The value for base model referenced from [Model Memory Calculator](https://huggingface.co/docs/accelerate/main/en/usage_guides/model_size_estimator)
	from HuggingFace




	\| Model \| Total Size \|
	\|-------------------------\|-------------\|
	\| Base Model \| 28 GB \|
	\| 4bitQuantized \| 5.21 GB \|


	## Acknowledgment

	- Thanks to [@AMerve Noyan](https://huggingface.co/blog/merve/quantization) for precise intro.
	- Thanks to [@HuggungFace Team](https://huggingface.co/blog/4bit-transformers-bitsandbytes) for the Blog.
	- Thanks to [@Meta](https://huggingface.co/meta-llama) for the Open Source Model.


	## Model Card Authors

	Swastik Maiti