---
license: apache-2.0
base_model: NousResearch/Llama-2-7b-chat-hf
tags:
- loRA
- qloRA
- peft
- causal-lm
- text-generation
- fine-tuned
datasets:
- mlabonne/guanaco-llama2-1k
pipeline_tag: text-generation
language:
- en
---

# Llama-2-7b-chat-hf Fine-Tuned with QLoRA

This model is a fine-tuned version of `NousResearch/Llama-2-7b-chat-hf` using Parameter-Efficient Fine-Tuning (PEFT) via **QLoRA** (4-bit quantization). It was trained on the `mlabonne/guanaco-llama2-1k` dataset.

> **Note:** This repository contains **only the adapter weights**. To use this model, you need to load the base model (`NousResearch/Llama-2-7b-chat-hf`) and apply these LoRA adapters on top of it.

## Model Details

- **Developed by:** Harsh Agale
- **Base Model:** `NousResearch/Llama-2-7b-chat-hf`
- **Method:** QLoRA (4-bit Quantization + LoRA)
- **Language(s):** English
- **License:** Apache 2.0
- **Task:** Causal Language Modeling / Text Generation

## Training Hyperparameters

The model was trained using the following configuration:
* **Quantization:** 4-bit NormalFloat (`nf4`) with double quantization
* **Compute Dtype:** `float16`
* **LoRA Rank (r):** 8
* **LoRA Alpha:** 16
* **Target Modules:** `q_proj`, `v_proj`
* **LoRA Dropout:** 0.05
* **Learning Rate:** 2e-4
* **Optimizer:** `paged_adamw_8bit`
* **Batch Size:** 1 (with 4 Gradient Accumulation Steps)
* **Epochs:** 1

## Project Purpose

This project was created to learn and experiment with:
- QLoRA fine-tuning
- PEFT adapters
- 4-bit quantization
- Efficient LLM training
- Hugging Face ecosystem

## Limitations

- Trained on a small dataset
- May produce hallucinated responses
- Intended for educational and research purposes

## How to Load and Use This Model

You can easily load this model and its adapters using the `transformers` and `peft` libraries:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

model_id = "NousResearch/Llama-2-7b-chat-hf"
adapter_id = "harshagale/llm-upload"

# 1. You must use the same 4-bit config to load the base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# 2. Load the base tokenizer and configure the padding token
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# 3. Load the quantized base model
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# 4. Merge the PEFT adapter weights onto the base model
model = PeftModel.from_pretrained(base_model, adapter_id)

# 5. Quick inference test
prompt = "Human: Tell me a joke.\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))