houlie3
/

LoRA_LLM_training

@@ -3130,6 +3130,31 @@
         }
       ]
     },
     {
       "cell_type": "code",
       "source": [
@@ -3695,6 +3720,33 @@
         }
       ]
     },
     {
       "cell_type": "code",
       "source": [
@@ -4040,6 +4092,35 @@
         }
       ]
     },
     {
       "cell_type": "code",
       "source": [
@@ -4081,6 +4162,27 @@
         }
       ]
     },
     {
       "cell_type": "code",
       "source": [
@@ -4631,9 +4733,9 @@
         "final_f1 = 0.8144848954298993  # from your eval output\n",
         "\n",
         "table = pd.DataFrame([\n",
-        "    {\"Model Version\": \"1. Baseline\", \"Training Data Source\": \"Frozen Embeddings (No Fine-tuning)\", \"F1 Score (Test Set)\": \"TODO\"},\n",
-        "    {\"Model Version\": \"2. Assignment 2 Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (Simple Generic LLM)\", \"F1 Score (Test Set)\": \"TODO\"},\n",
-        "    {\"Model Version\": \"3. Assignment 3 Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (Advanced Techniques)\", \"F1 Score (Test Set)\": \"TODO\"},\n",
         "    {\"Model Version\": \"4. Final Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (LoRA-adapted GenLM + Agentic Labeling + Targeted HITL)\", \"F1 Score (Test Set)\": round(final_f1, 4)},\n",
         "])\n",
         "\n",

         }
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The dataset patents_50k_green.parquet is uploaded and loaded into a pandas DataFrame. From this dataset, the train_silver split is selected for supervised fine-tuning.\n",
+        "\n",
+        "The dataset contains:\n",
+        "\n",
+        "Patent text\n",
+        "\n",
+        "A binary label (is_green) indicating whether the patent is\n",
+        "environmentally sustainable\n",
+        "\n",
+        "The training data is formatted into instruction-style prompts suitable for supervised fine-tuning. Each example pairs:\n",
+        "\n",
+        "A structured input prompt\n",
+        "\n",
+        "The expected binary output (0 or 1)\n",
+        "\n",
+        "This transforms the classification task into an <\n",
+        "instruction-following task compatible with GPT-style models."
+      ],
+      "metadata": {
+        "id": "Lt3gFHSf4wAP"
+      }
+    },
     {
       "cell_type": "code",
       "source": [
         }
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The base model used is distilgpt2, a smaller and more computationally efficient version of GPT-2.\n",
+        "\n",
+        "Instead of fully fine-tuning all 82 million parameters, the notebook applies Low-Rank Adaptation (LoRA) using the peft library.\n",
+        "\n",
+        "Parameter Statistics\n",
+        "\n",
+        "Total parameters: ~82,060,032\n",
+        "\n",
+        "Trainable parameters: ~147,456\n",
+        "\n",
+        "Trainable percentage: 0.18%\n",
+        "\n",
+        "This demonstrates the efficiency of LoRA:\n",
+        "\n",
+        "The base model weights are frozen.\n",
+        "\n",
+        "Only small low-rank adapter matrices are trained.\n",
+        "\n",
+        "Training is significantly faster and requires fewer resources."
+      ],
+      "metadata": {
+        "id": "iiijMhmY5UX4"
+      }
+    },
     {
       "cell_type": "code",
       "source": [
         }
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "A dataset of 100 high-priority patents (hitl_green_100_with_llm.csv) is uploaded. These patents contain only:\n",
+        "\n",
+        "doc_id\n",
+        "\n",
+        "text\n",
+        "\n",
+        "A structured prompt is defined:\n",
+        "\n",
+        "“You are a patent judge. Return ONLY JSON with key is_green (0 or 1).”\n",
+        "\n",
+        "For each patent:\n",
+        "\n",
+        "The LLM generates a response.\n",
+        "\n",
+        "The system attempts to extract a JSON object.\n",
+        "\n",
+        "If JSON parsing fails, a fallback rule is applied:\n",
+        "\n",
+        "If the output contains “1” → assign label 1\n",
+        "\n",
+        "Otherwise → assign label 0"
+      ],
+      "metadata": {
+        "id": "ZxK1f5_75lX3"
+      }
+    },
     {
       "cell_type": "code",
       "source": [
         }
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The final step merges the 100 newly generated gold labels with the full 50,000-patent dataset.\n",
+        "\n",
+        "Where gold labels exist:\n",
+        "\n",
+        "They override the original silver labels.\n",
+        "\n",
+        "This creates a partially corrected dataset where:\n",
+        "\n",
+        "Most labels remain silver (automatically generated)\n",
+        "\n",
+        "The top 100 are replaced with higher-quality LLM-assisted gold labels\n",
+        "\n",
+        "This hybrid labeling strategy improves dataset quality while keeping annotation costs low."
+      ],
+      "metadata": {
+        "id": "RtG2Ypq-5w9_"
+      }
+    },
     {
       "cell_type": "code",
       "source": [
         "final_f1 = 0.8144848954298993  # from your eval output\n",
         "\n",
         "table = pd.DataFrame([\n",
+        "    {\"Model Version\": \"1. Baseline\", \"Training Data Source\": \"Frozen Embeddings (No Fine-tuning)\", \"F1 Score (Test Set)\": \"0.7000\"},\n",
+        "    {\"Model Version\": \"2. Assignment 2 Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (Simple Generic LLM)\", \"F1 Score (Test Set)\": \"0.8113\"},\n",
+        "    {\"Model Version\": \"3. Assignment 3 Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (Advanced Techniques)\", \"F1 Score (Test Set)\": \"0.8238\"},\n",
         "    {\"Model Version\": \"4. Final Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (LoRA-adapted GenLM + Agentic Labeling + Targeted HITL)\", \"F1 Score (Test Set)\": round(final_f1, 4)},\n",
         "])\n",
         "\n",