Update README.md

af4a212 verified 5 days ago

2.98 kB

license: apache-2.0
base_model: NousResearch/Llama-2-7b-chat-hf
tags:
  - loRA
  - qloRA
  - peft
  - causal-lm
  - text-generation
  - fine-tuned
datasets:
  - mlabonne/guanaco-llama2-1k
pipeline_tag: text-generation
language:
  - en

Llama-2-7b-chat-hf Fine-Tuned with QLoRA

This model is a fine-tuned version of NousResearch/Llama-2-7b-chat-hf using Parameter-Efficient Fine-Tuning (PEFT) via QLoRA (4-bit quantization). It was trained on the mlabonne/guanaco-llama2-1k dataset.

Note: This repository contains only the adapter weights. To use this model, you need to load the base model (NousResearch/Llama-2-7b-chat-hf) and apply these LoRA adapters on top of it.

Model Details

Developed by: Harsh Agale
Base Model: NousResearch/Llama-2-7b-chat-hf
Method: QLoRA (4-bit Quantization + LoRA)
Language(s): English
License: Apache 2.0
Task: Causal Language Modeling / Text Generation

Training Hyperparameters

The model was trained using the following configuration:

Quantization: 4-bit NormalFloat (nf4) with double quantization
Compute Dtype: float16
LoRA Rank (r): 8
LoRA Alpha: 16
Target Modules: q_proj, v_proj
LoRA Dropout: 0.05
Learning Rate: 2e-4
Optimizer: paged_adamw_8bit
Batch Size: 1 (with 4 Gradient Accumulation Steps)
Epochs: 1

Project Purpose

This project was created to learn and experiment with:

QLoRA fine-tuning
PEFT adapters
4-bit quantization
Efficient LLM training
Hugging Face ecosystem

Limitations

Trained on a small dataset
May produce hallucinated responses
Intended for educational and research purposes

How to Load and Use This Model

You can easily load this model and its adapters using the transformers and peft libraries:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

model_id = "NousResearch/Llama-2-7b-chat-hf"
adapter_id = "harshagale/llm-upload"

# 1. You must use the same 4-bit config to load the base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# 2. Load the base tokenizer and configure the padding token
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# 3. Load the quantized base model
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# 4. Merge the PEFT adapter weights onto the base model
model = PeftModel.from_pretrained(base_model, adapter_id)

# 5. Quick inference test
prompt = "Human: Tell me a joke.\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))