harshagale
/

llm-upload

Text Generation

Model card Files Files and versions

llm-upload / README.md

harshagale's picture

Update README.md

af4a212 verified 8 days ago

|

history blame contribute delete

2.98 kB

	---
	license: apache-2.0
	base_model: NousResearch/Llama-2-7b-chat-hf
	tags:
	- loRA
	- qloRA
	- peft
	- causal-lm
	- text-generation
	- fine-tuned
	datasets:
	- mlabonne/guanaco-llama2-1k
	pipeline_tag: text-generation
	language:
	- en
	---

	# Llama-2-7b-chat-hf Fine-Tuned with QLoRA

	This model is a fine-tuned version of `NousResearch/Llama-2-7b-chat-hf` using Parameter-Efficient Fine-Tuning (PEFT) via QLoRA (4-bit quantization). It was trained on the `mlabonne/guanaco-llama2-1k` dataset.

	> Note: This repository contains only the adapter weights. To use this model, you need to load the base model (`NousResearch/Llama-2-7b-chat-hf`) and apply these LoRA adapters on top of it.

	## Model Details

	- Developed by: Harsh Agale
	- Base Model: `NousResearch/Llama-2-7b-chat-hf`
	- Method: QLoRA (4-bit Quantization + LoRA)
	- Language(s): English
	- License: Apache 2.0
	- Task: Causal Language Modeling / Text Generation

	## Training Hyperparameters

	The model was trained using the following configuration:
	* Quantization: 4-bit NormalFloat (`nf4`) with double quantization
	* Compute Dtype: `float16`
	* LoRA Rank (r): 8
	* LoRA Alpha: 16
	* Target Modules: `q_proj`, `v_proj`
	* LoRA Dropout: 0.05
	* Learning Rate: 2e-4
	* Optimizer: `paged_adamw_8bit`
	* Batch Size: 1 (with 4 Gradient Accumulation Steps)
	* Epochs: 1

	## Project Purpose

	This project was created to learn and experiment with:
	- QLoRA fine-tuning
	- PEFT adapters
	- 4-bit quantization
	- Efficient LLM training
	- Hugging Face ecosystem

	## Limitations

	- Trained on a small dataset
	- May produce hallucinated responses
	- Intended for educational and research purposes

	## How to Load and Use This Model

	You can easily load this model and its adapters using the `transformers` and `peft` libraries:

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
	from peft import PeftModel

	model_id = "NousResearch/Llama-2-7b-chat-hf"
	adapter_id = "harshagale/llm-upload"

	# 1. You must use the same 4-bit config to load the base model
	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.float16,
	bnb_4bit_use_double_quant=True
	)

	# 2. Load the base tokenizer and configure the padding token
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	tokenizer.pad_token = tokenizer.eos_token

	# 3. Load the quantized base model
	base_model = AutoModelForCausalLM.from_pretrained(
	model_id,
	quantization_config=bnb_config,
	device_map="auto"
	)

	# 4. Merge the PEFT adapter weights onto the base model
	model = PeftModel.from_pretrained(base_model, adapter_id)

	# 5. Quick inference test
	prompt = "Human: Tell me a joke.\nAssistant:"
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

	with torch.no_grad():
	outputs = model.generate(**inputs, max_new_tokens=50)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))