--- license: apache-2.0 base_model: NousResearch/Llama-2-7b-chat-hf tags: - loRA - qloRA - peft - causal-lm - text-generation - fine-tuned datasets: - mlabonne/guanaco-llama2-1k pipeline_tag: text-generation language: - en --- # Llama-2-7b-chat-hf Fine-Tuned with QLoRA This model is a fine-tuned version of `NousResearch/Llama-2-7b-chat-hf` using Parameter-Efficient Fine-Tuning (PEFT) via **QLoRA** (4-bit quantization). It was trained on the `mlabonne/guanaco-llama2-1k` dataset. > **Note:** This repository contains **only the adapter weights**. To use this model, you need to load the base model (`NousResearch/Llama-2-7b-chat-hf`) and apply these LoRA adapters on top of it. ## Model Details - **Developed by:** Harsh Agale - **Base Model:** `NousResearch/Llama-2-7b-chat-hf` - **Method:** QLoRA (4-bit Quantization + LoRA) - **Language(s):** English - **License:** Apache 2.0 - **Task:** Causal Language Modeling / Text Generation ## Training Hyperparameters The model was trained using the following configuration: * **Quantization:** 4-bit NormalFloat (`nf4`) with double quantization * **Compute Dtype:** `float16` * **LoRA Rank (r):** 8 * **LoRA Alpha:** 16 * **Target Modules:** `q_proj`, `v_proj` * **LoRA Dropout:** 0.05 * **Learning Rate:** 2e-4 * **Optimizer:** `paged_adamw_8bit` * **Batch Size:** 1 (with 4 Gradient Accumulation Steps) * **Epochs:** 1 ## Project Purpose This project was created to learn and experiment with: - QLoRA fine-tuning - PEFT adapters - 4-bit quantization - Efficient LLM training - Hugging Face ecosystem ## Limitations - Trained on a small dataset - May produce hallucinated responses - Intended for educational and research purposes ## How to Load and Use This Model You can easily load this model and its adapters using the `transformers` and `peft` libraries: ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig from peft import PeftModel model_id = "NousResearch/Llama-2-7b-chat-hf" adapter_id = "harshagale/llm-upload" # 1. You must use the same 4-bit config to load the base model bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True ) # 2. Load the base tokenizer and configure the padding token tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token # 3. Load the quantized base model base_model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto" ) # 4. Merge the PEFT adapter weights onto the base model model = PeftModel.from_pretrained(base_model, adapter_id) # 5. Quick inference test prompt = "Human: Tell me a joke.\nAssistant:" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True))