Spaces:

Ghaithhmz
/

SSFT_Audit_Report_generator

Sleeping

App Files Files Community

Ghaithhmz commited on Feb 11

Commit

6202f17

0 Parent(s):

model trained on few data

Browse files

Files changed (2) hide show

.gitignore +2 -0
audit_model_finetuning.ipynb +679 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ data/*
2	+ docs/*

audit_model_finetuning.ipynb ADDED Viewed

	@@ -0,0 +1,679 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Self-Supervised Fine-Tuning of Mistral-7B on Audit Reports\n",
+    "\n",
+    "This notebook demonstrates how to adapt a Large Language Model (Mistral-7B) to the domain of professional audit reports using self-supervised fine-tuning (continued pretraining). \n",
+    "\n",
+    "**Objective**: Enhance the model's domain fluency, vocabulary, and stylistic consistency for audit documentation.\n",
+    "**Method**: Causal Language Modeling (Next-Token Prediction) on raw text extracted from PDF reports.\n",
+    "**Hardware**: Optimized for a T4 GPU (Google Colab free tier compatible) using QLoRA (4-bit quantization + LoRA).\n",
+    "\n",
+    "## 1. Setup and Installation\n",
+    "We need to install the necessary libraries for PDF extraction, efficient model loading, and training.\n",
+    "\n",
+    "**IMPORTANT**: After running the installation cell below, you MUST restart the runtime/session (Runtime > Restart session) for the updates to take effect, then run the cells starting from the imports."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Installation complete. Please RESTART the runtime (Runtime > Restart session) to apply changes, then run the next cells.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Install all key dependencies including PyTorch components to ensure version compatibility\n",
+    "!pip install -q -U torch torchvision torchaudio transformers peft datasets bitsandbytes trl pdfplumber accelerate\n",
+    "\n",
+    "print(\"Installation complete. Please RESTART the runtime (Runtime > Restart session) to apply changes, then run the next cells.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<torch._C.Generator at 0x7b00d3b86a30>"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import glob\n",
+    "import pdfplumber\n",
+    "import torch\n",
+    "from datasets import Dataset, DatasetDict\n",
+    "from transformers import (\n",
+    "    AutoModelForCausalLM,\n",
+    "    AutoTokenizer,\n",
+    "    BitsAndBytesConfig,\n",
+    "    TrainingArguments,\n",
+    "    Trainer,\n",
+    "    DataCollatorForLanguageModeling\n",
+    ")\n",
+    "from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType\n",
+    "import re\n",
+    "\n",
+    "# Set seed for reproducibility\n",
+    "torch.manual_seed(42)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n",
+      "Mounted Google Drive. DATA_DIR set to: /content/drive/MyDrive/Data\n"
+     ]
+    }
+   ],
+   "source": [
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Check if running in Colab\n",
+    "if 'google.colab' in sys.modules:\n",
+    "    from google.colab import drive\n",
+    "    try:\n",
+    "        drive.mount('/content/drive')\n",
+    "    except:\n",
+    "        pass\n",
+    "    # Update DATA_DIR to point to mounted Drive\n",
+    "    # Make sure you have uploaded the Data folder to your Google Drive root\n",
+    "    DATA_DIR = Path('/content/drive/MyDrive/Data')\n",
+    "    print(f\"Mounted Google Drive. DATA_DIR set to: {DATA_DIR}\")\n",
+    "else:\n",
+    "    DATA_DIR = Path(\"./Data\")\n",
+    "    print(f\"Not running in Colab. Using local Data directory: {DATA_DIR}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Data Preparation\n",
+    "\n",
+    "We will extract text from the PDF audit reports located in the `Data` directory. \n",
+    "\n",
+    "**Cleaning Steps**:\n",
+    "- Extract text using `pdfplumber`.\n",
+    "- Remove potential headers and footers (heuristic: very short lines at top/bottom of pages).\n",
+    "- Normalize whitespace.\n",
+    "- Anonymize sensitive patterns (placeholder implementation)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Searching for PDFs in: /content/drive/MyDrive/Data\n",
+      "Processing /content/drive/MyDrive/Data/Annual_Review_of_Audit_Quality_2025.pdf...\n",
+      "Processing /content/drive/MyDrive/Data/BDO_LLP_Audit_Quality_Inspection_and_Supervision_2025.pdf...\n",
+      "Processing /content/drive/MyDrive/Data/Deloitte_LLP_Audit_Quality_Inspection_and_Supervision_2025.pdf...\n",
+      "Processing /content/drive/MyDrive/Data/Ernst__Young_LLP_Audit_Quality_Inspection_and_Supervision_2025.pdf...\n",
+      "Processing /content/drive/MyDrive/Data/Forvis_Mazars_LLP_Audit_Quality_Inspection_and_Supervision_2025.pdf...\n",
+      "Processing /content/drive/MyDrive/Data/KPMG_LLP_Audit_Quality_Inspection_and_Supervision_2025.pdf...\n",
+      "Processing /content/drive/MyDrive/Data/PricewaterhouseCoopers_LLP_Audit_Quality_Inspection_and_Supervision_2025.pdf...\n",
+      "Processing /content/drive/MyDrive/Data/Annual_Review_of_Audit_Quality_2024_7yhxTsi.pdf...\n",
+      "Processing /content/drive/MyDrive/Data/Tier_1_Firms__Overview_2023.pdf...\n",
+      "Processing /content/drive/MyDrive/Data/FRC_Audit_Quality_Inspection_and_Supervision_Public_Report_2022_-_Tier_1_Firms_Overview.pdf...\n",
+      "Processing /content/drive/MyDrive/Data/Individual_Rights_Data_Privacy_Policy.pdf...\n",
+      "\n",
+      "Successfully loaded 11 documents.\n"
+     ]
+    }
+   ],
+   "source": [
+    "def extract_text_from_pdf(pdf_path):\n",
+    "    text_content = []\n",
+    "    with pdfplumber.open(pdf_path) as pdf:\n",
+    "        for page in pdf.pages:\n",
+    "            # Extract text\n",
+    "            text = page.extract_text()\n",
+    "            if not text:\n",
+    "                continue\n",
+    "            \n",
+    "            lines = text.split('\\n')\n",
+    "            \n",
+    "            # Basic Heuristic: Remove first and last lines if they likely resemble headers/footers (e.g., page numbers or short titles)\n",
+    "            # Adjust this logic based on your specific PDF layout\n",
+    "            if len(lines) > 2:\n",
+    "                # Remove header if short (arbitrary length < 50 chars as a heuristic)\n",
+    "                if len(lines[0]) < 50:\n",
+    "                    lines = lines[1:]\n",
+    "                # Remove footer if short and looks like page number\n",
+    "                if len(lines) > 0 and len(lines[-1]) < 20:\n",
+    "                    lines = lines[:-1]\n",
+    "            \n",
+    "            page_text = \"\\n\".join(lines)\n",
+    "            text_content.append(page_text)\n",
+    "    \n",
+    "    full_text = \"\\n\\n\".join(text_content)\n",
+    "    return full_text\n",
+    "\n",
+    "def clean_data(text):\n",
+    "    # Normalize whitespace\n",
+    "    text = re.sub(r'\\s+', ' ', text).strip()\n",
+    "    \n",
+    "    # Placeholder for anonymization (e.g., replace emails, phone numbers)\n",
+    "    # This regex is a simple example and should be expanded for real production use\n",
+    "    text = re.sub(r'[\\w\\.-]+@[\\w\\.-]+', '[EMAIL]', text)\n",
+    "    \n",
+    "    return text\n",
+    "\n",
+    "# Main Data Loading Loop\n",
+    "try:\n",
+    "    data_dir = DATA_DIR\n",
+    "except NameError:\n",
+    "    data_dir = \"./Data\"\n",
+    "    \n",
+    "print(f\"Searching for PDFs in: {data_dir}\")\n",
+    "pdf_files = glob.glob(str(data_dir / \"*.pdf\"))\n",
+    "\n",
+    "raw_texts = []\n",
+    "for pdf_file in pdf_files:\n",
+    "    print(f\"Processing {pdf_file}...\")\n",
+    "    try:\n",
+    "        raw_text = extract_text_from_pdf(pdf_file)\n",
+    "        cleaned_text = clean_data(raw_text)\n",
+    "        if len(cleaned_text) > 500: # Only keep documents with substantial content\n",
+    "            raw_texts.append(cleaned_text)\n",
+    "    except Exception as e:\n",
+    "        print(f\"Error processing {pdf_file}: {e}\")\n",
+    "\n",
+    "print(f\"\\nSuccessfully loaded {len(raw_texts)} documents.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Dataset Tokenization and Chunking\n",
+    "\n",
+    "We need to process the text into chunks suitable for the model's context window. \n",
+    "- **Context Window**: 1024 tokens (Reduced from 2048 to save VRAM).\n",
+    "- **Overlap**: No overlap in packing strategy.\n",
+    "- **Format**: Prepare as a Hugging Face Dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "DatasetDict({\n",
+      "    train: Dataset({\n",
+      "        features: ['text'],\n",
+      "        num_rows: 9\n",
+      "    })\n",
+      "    test: Dataset({\n",
+      "        features: ['text'],\n",
+      "        num_rows: 2\n",
+      "    })\n",
+      "})\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:104: UserWarning: \n",
+      "Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.\n",
+      "You are not authenticated with the Hugging Face Hub in this notebook.\n",
+      "If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "d7d2664df18f403cbc36eab98e88f893",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Chunking and Tokenizing:   0%|          | 0/9 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "164effd8479149a2b96f05a0d4af7ac1",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Chunking and Tokenizing:   0%|          | 0/2 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Train chunks: 107\n",
+      "Test chunks: 23\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Create HF Dataset\n",
+    "dataset = Dataset.from_dict({\"text\": raw_texts})\n",
+    "\n",
+    "# Split into train and validation\n",
+    "dataset = dataset.train_test_split(test_size=0.1, seed=42)\n",
+    "print(dataset)\n",
+    "\n",
+    "# Load Tokenizer\n",
+    "model_id = \"mistralai/Mistral-7B-v0.1\"\n",
+    "tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)\n",
+    "tokenizer.pad_token = tokenizer.eos_token # Mistral has no pad token by default\n",
+    "\n",
+    "def chunk_and_tokenize(examples):\n",
+    "    # Flatten texts into a single long string of tokens\n",
+    "    chunk_size = 1024 # Reduced from 2048 to save VRAM\n",
+    "    \n",
+    "    # Basic tokenization without padding/truncated\n",
+    "    tokens = tokenizer(examples[\"text\"], truncation=False, return_attention_mask=False)[\"input_ids\"]\n",
+    "    \n",
+    "    # Flatten list of lists into one big list of tokens\n",
+    "    concatenated_tokens = [tok for doc in tokens for tok in doc]\n",
+    "    \n",
+    "    # Calculate total length divisible by chunk_size\n",
+    "    # We drop the small remainder at the very end of the entire dataset\n",
+    "    total_length = len(concatenated_tokens)\n",
+    "    if total_length >= chunk_size:\n",
+    "        total_length = (total_length // chunk_size) * chunk_size\n",
+    "    else:\n",
+    "        # Handle highly unlikely case where entire dataset < chunk_size tokens\n",
+    "        # Pad to chunk_size\n",
+    "        concatenated_tokens += [tokenizer.eos_token_id] * (chunk_size - total_length)\n",
+    "        total_length = chunk_size\n",
+    "\n",
+    "    # Split by chunks of max_len\n",
+    "    result = {\n",
+    "        \"input_ids\": [concatenated_tokens[i : i + chunk_size] for i in range(0, total_length, chunk_size)],\n",
+    "        \"labels\": [concatenated_tokens[i : i + chunk_size] for i in range(0, total_length, chunk_size)]\n",
+    "    }\n",
+    "    \n",
+    "    return result\n",
+    "\n",
+    "# Apply processing\n",
+    "tokenized_dataset = dataset.map(\n",
+    "    chunk_and_tokenize,\n",
+    "    batched=True,\n",
+    "    remove_columns=dataset[\"train\"].column_names,\n",
+    "    desc=\"Chunking and Tokenizing\"\n",
+    ")\n",
+    "\n",
+    "print(f\"Train chunks: {len(tokenized_dataset['train'])}\")\n",
+    "print(f\"Test chunks: {len(tokenized_dataset['test'])}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Model Loading with QLoRA\n",
+    "\n",
+    "We load Mistral-7B in 4-bit quantization to fit on a T4 GPU.\n",
+    "Then we attach LoRA adapters for parameter-efficient fine-tuning."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "0ae545926d92459b8ab84e155ad3685e",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "trainable params: 41,943,040 || all params: 7,283,675,136 || trainable%: 0.5758\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 4-bit Quantization Config\n",
+    "bnb_config = BitsAndBytesConfig(\n",
+    "    load_in_4bit=True,\n",
+    "    bnb_4bit_quant_type=\"nf4\",\n",
+    "    bnb_4bit_compute_dtype=torch.float16, # or bfloat16 if supported by hardware\n",
+    "    bnb_4bit_use_double_quant=False,\n",
+    ")\n",
+    "\n",
+    "# Load Base Model\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    model_id,\n",
+    "    quantization_config=bnb_config,\n",
+    "    device_map=\"auto\",\n",
+    "    trust_remote_code=True\n",
+    ")\n",
+    "\n",
+    "# Enable gradient checkpointing to save memory\n",
+    "model.gradient_checkpointing_enable()\n",
+    "model = prepare_model_for_kbit_training(model)\n",
+    "\n",
+    "# LoRA Configuration\n",
+    "peft_config = LoraConfig(\n",
+    "    r=16, # Rank\n",
+    "    lora_alpha=32,\n",
+    "    lora_dropout=0.05,\n",
+    "    bias=\"none\",\n",
+    "    task_type=\"CAUSAL_LM\",\n",
+    "    # Target all linear layers for better adaptation\n",
+    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\"]\n",
+    ")\n",
+    "\n",
+    "model = get_peft_model(model, peft_config)\n",
+    "model.print_trainable_parameters()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Training\n",
+    "\n",
+    "We use the basic `Trainer` with `DataCollatorForLanguageModeling`. \n",
+    "The objective is purely self-supervised next-token prediction."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.\n",
+      "  return fn(*args, **kwargs)\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='42' max='42' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [42/42 30:01, Epoch 3/3]\n",
+       "    </div>\n",
+       "    <table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       " <tr style=\"text-align: left;\">\n",
+       "      <th>Step</th>\n",
+       "      <th>Training Loss</th>\n",
+       "      <th>Validation Loss</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td>10</td>\n",
+       "      <td>2.144592</td>\n",
+       "      <td>2.107712</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>20</td>\n",
+       "      <td>1.789090</td>\n",
+       "      <td>1.951720</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>30</td>\n",
+       "      <td>1.445677</td>\n",
+       "      <td>1.882343</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>40</td>\n",
+       "      <td>1.282020</td>\n",
+       "      <td>1.864601</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table><p>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.\n",
+      "  return fn(*args, **kwargs)\n",
+      "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.\n",
+      "  return fn(*args, **kwargs)\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "TrainOutput(global_step=42, training_loss=1.6343639180773781, metrics={'train_runtime': 1842.9007, 'train_samples_per_second': 0.174, 'train_steps_per_second': 0.023, 'total_flos': 1.4106535567294464e+16, 'train_loss': 1.6343639180773781, 'epoch': 3.0})"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Clear cache before training\n",
+    "torch.cuda.empty_cache()\n",
+    "\n",
+    "# Data Collator\n",
+    "data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)\n",
+    "\n",
+    "# Training Arguments\n",
+    "training_args = TrainingArguments(\n",
+    "    output_dir=\"./audit-mistral-finetuned\",\n",
+    "    per_device_train_batch_size=1, # Reduced to 1 to fit T4 VRAM\n",
+    "    gradient_accumulation_steps=8, # Increased to 8 to maintain effective batch size\n",
+    "    learning_rate=2e-4,\n",
+    "    logging_steps=10,\n",
+    "    num_train_epochs=3, # Increased to 3 epochs\n",
+    "    save_strategy=\"epoch\",\n",
+    "    eval_strategy=\"steps\", # Evaluate more frequently\n",
+    "    eval_steps=10,\n",
+    "    fp16=True,\n",
+    "    optim=\"paged_adamw_8bit\", # Memory efficient optimizer\n",
+    "    report_to=\"none\"\n",
+    ")\n",
+    "\n",
+    "# Initialize Trainer\n",
+    "trainer = Trainer(\n",
+    "    model=model,\n",
+    "    args=training_args,\n",
+    "    train_dataset=tokenized_dataset[\"train\"],\n",
+    "    eval_dataset=tokenized_dataset[\"test\"],\n",
+    "    data_collator=data_collator,\n",
+    ")\n",
+    "\n",
+    "# Start Training\n",
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Evaluation\n",
+    "\n",
+    "We calculate Perplexity as a quantitative metric of how well the model predicts the domain text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='3' max='3' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [3/3 00:21]\n",
+       "    </div>\n",
+       "    "
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Perplexity: 6.47\n"
+     ]
+    }
+   ],
+   "source": [
+    "import math\n",
+    "\n",
+    "eval_results = trainer.evaluate()\n",
+    "perplexity = math.exp(eval_results['eval_loss'])\n",
+    "print(f\"Perplexity: {perplexity:.2f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Inference\n",
+    "\n",
+    "Test the model's generation capabilities on an audit-related prompt."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The audit of the financial statements reveals that the audited entity is likely to be impacted by a significant risk related to climate change. The entity must include a statement in its annual report, as required under the Companies Act 2006, to that effect. In this example, the entity’s statement is included on the inside front cover of its annual report and sets out the following: “Climate change is one of the greatest threats to our future prosperity. We are taking action to reduce our impact on the environment and to prepare for the opportunities and risks of climate change. We are setting science-based targets to reduce our emissions and to increase our use of renewable energy. We are developing our climate risk disclosures to provide greater transparency to our stakeholders\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Save the model (adapters only)\n",
+    "trainer.save_model(\"/content/drive/MyDrive/Self_Supervised_finetuning_Model/audit-mistral-7b-qlora\")\n",
+    "\n",
+    "# Inference Prompt\n",
+    "prompt = \"The audit of the financial statements reveals that\"\n",
+    "inputs = tokenizer(prompt, return_tensors=\"pt\").to(\"cuda\")\n",
+    "\n",
+    "# Generate\n",
+    "with torch.no_grad():\n",
+    "    outputs = model.generate(\n",
+    "        **inputs,\n",
+    "        max_new_tokens=150,\n",
+    "        temperature=0.7,\n",
+    "        top_p=0.9,\n",
+    "        do_sample=True\n",
+    "    )\n",
+    "\n",
+    "print(tokenizer.decode(outputs[0], skip_special_tokens=True))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer.save_model(\"/content/drive/MyDrive/Self_Supervised_finetuning_Model/audit-mistral-7b-qlora\")\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}