new updated
Browse files- Assignment_4.ipynb +105 -3
Assignment_4.ipynb
CHANGED
|
@@ -3130,6 +3130,31 @@
|
|
| 3130 |
}
|
| 3131 |
]
|
| 3132 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3133 |
{
|
| 3134 |
"cell_type": "code",
|
| 3135 |
"source": [
|
|
@@ -3695,6 +3720,33 @@
|
|
| 3695 |
}
|
| 3696 |
]
|
| 3697 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3698 |
{
|
| 3699 |
"cell_type": "code",
|
| 3700 |
"source": [
|
|
@@ -4040,6 +4092,35 @@
|
|
| 4040 |
}
|
| 4041 |
]
|
| 4042 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4043 |
{
|
| 4044 |
"cell_type": "code",
|
| 4045 |
"source": [
|
|
@@ -4081,6 +4162,27 @@
|
|
| 4081 |
}
|
| 4082 |
]
|
| 4083 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4084 |
{
|
| 4085 |
"cell_type": "code",
|
| 4086 |
"source": [
|
|
@@ -4631,9 +4733,9 @@
|
|
| 4631 |
"final_f1 = 0.8144848954298993 # from your eval output\n",
|
| 4632 |
"\n",
|
| 4633 |
"table = pd.DataFrame([\n",
|
| 4634 |
-
" {\"Model Version\": \"1. Baseline\", \"Training Data Source\": \"Frozen Embeddings (No Fine-tuning)\", \"F1 Score (Test Set)\": \"
|
| 4635 |
-
" {\"Model Version\": \"2. Assignment 2 Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (Simple Generic LLM)\", \"F1 Score (Test Set)\": \"
|
| 4636 |
-
" {\"Model Version\": \"3. Assignment 3 Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (Advanced Techniques)\", \"F1 Score (Test Set)\": \"
|
| 4637 |
" {\"Model Version\": \"4. Final Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (LoRA-adapted GenLM + Agentic Labeling + Targeted HITL)\", \"F1 Score (Test Set)\": round(final_f1, 4)},\n",
|
| 4638 |
"])\n",
|
| 4639 |
"\n",
|
|
|
|
| 3130 |
}
|
| 3131 |
]
|
| 3132 |
},
|
| 3133 |
+
{
|
| 3134 |
+
"cell_type": "markdown",
|
| 3135 |
+
"source": [
|
| 3136 |
+
"The dataset patents_50k_green.parquet is uploaded and loaded into a pandas DataFrame. From this dataset, the train_silver split is selected for supervised fine-tuning.\n",
|
| 3137 |
+
"\n",
|
| 3138 |
+
"The dataset contains:\n",
|
| 3139 |
+
"\n",
|
| 3140 |
+
"Patent text\n",
|
| 3141 |
+
"\n",
|
| 3142 |
+
"A binary label (is_green) indicating whether the patent is\n",
|
| 3143 |
+
"environmentally sustainable\n",
|
| 3144 |
+
"\n",
|
| 3145 |
+
"The training data is formatted into instruction-style prompts suitable for supervised fine-tuning. Each example pairs:\n",
|
| 3146 |
+
"\n",
|
| 3147 |
+
"A structured input prompt\n",
|
| 3148 |
+
"\n",
|
| 3149 |
+
"The expected binary output (0 or 1)\n",
|
| 3150 |
+
"\n",
|
| 3151 |
+
"This transforms the classification task into an <\n",
|
| 3152 |
+
"instruction-following task compatible with GPT-style models."
|
| 3153 |
+
],
|
| 3154 |
+
"metadata": {
|
| 3155 |
+
"id": "Lt3gFHSf4wAP"
|
| 3156 |
+
}
|
| 3157 |
+
},
|
| 3158 |
{
|
| 3159 |
"cell_type": "code",
|
| 3160 |
"source": [
|
|
|
|
| 3720 |
}
|
| 3721 |
]
|
| 3722 |
},
|
| 3723 |
+
{
|
| 3724 |
+
"cell_type": "markdown",
|
| 3725 |
+
"source": [
|
| 3726 |
+
"The base model used is distilgpt2, a smaller and more computationally efficient version of GPT-2.\n",
|
| 3727 |
+
"\n",
|
| 3728 |
+
"Instead of fully fine-tuning all 82 million parameters, the notebook applies Low-Rank Adaptation (LoRA) using the peft library.\n",
|
| 3729 |
+
"\n",
|
| 3730 |
+
"Parameter Statistics\n",
|
| 3731 |
+
"\n",
|
| 3732 |
+
"Total parameters: ~82,060,032\n",
|
| 3733 |
+
"\n",
|
| 3734 |
+
"Trainable parameters: ~147,456\n",
|
| 3735 |
+
"\n",
|
| 3736 |
+
"Trainable percentage: 0.18%\n",
|
| 3737 |
+
"\n",
|
| 3738 |
+
"This demonstrates the efficiency of LoRA:\n",
|
| 3739 |
+
"\n",
|
| 3740 |
+
"The base model weights are frozen.\n",
|
| 3741 |
+
"\n",
|
| 3742 |
+
"Only small low-rank adapter matrices are trained.\n",
|
| 3743 |
+
"\n",
|
| 3744 |
+
"Training is significantly faster and requires fewer resources."
|
| 3745 |
+
],
|
| 3746 |
+
"metadata": {
|
| 3747 |
+
"id": "iiijMhmY5UX4"
|
| 3748 |
+
}
|
| 3749 |
+
},
|
| 3750 |
{
|
| 3751 |
"cell_type": "code",
|
| 3752 |
"source": [
|
|
|
|
| 4092 |
}
|
| 4093 |
]
|
| 4094 |
},
|
| 4095 |
+
{
|
| 4096 |
+
"cell_type": "markdown",
|
| 4097 |
+
"source": [
|
| 4098 |
+
"A dataset of 100 high-priority patents (hitl_green_100_with_llm.csv) is uploaded. These patents contain only:\n",
|
| 4099 |
+
"\n",
|
| 4100 |
+
"doc_id\n",
|
| 4101 |
+
"\n",
|
| 4102 |
+
"text\n",
|
| 4103 |
+
"\n",
|
| 4104 |
+
"A structured prompt is defined:\n",
|
| 4105 |
+
"\n",
|
| 4106 |
+
"“You are a patent judge. Return ONLY JSON with key is_green (0 or 1).”\n",
|
| 4107 |
+
"\n",
|
| 4108 |
+
"For each patent:\n",
|
| 4109 |
+
"\n",
|
| 4110 |
+
"The LLM generates a response.\n",
|
| 4111 |
+
"\n",
|
| 4112 |
+
"The system attempts to extract a JSON object.\n",
|
| 4113 |
+
"\n",
|
| 4114 |
+
"If JSON parsing fails, a fallback rule is applied:\n",
|
| 4115 |
+
"\n",
|
| 4116 |
+
"If the output contains “1” → assign label 1\n",
|
| 4117 |
+
"\n",
|
| 4118 |
+
"Otherwise → assign label 0"
|
| 4119 |
+
],
|
| 4120 |
+
"metadata": {
|
| 4121 |
+
"id": "ZxK1f5_75lX3"
|
| 4122 |
+
}
|
| 4123 |
+
},
|
| 4124 |
{
|
| 4125 |
"cell_type": "code",
|
| 4126 |
"source": [
|
|
|
|
| 4162 |
}
|
| 4163 |
]
|
| 4164 |
},
|
| 4165 |
+
{
|
| 4166 |
+
"cell_type": "markdown",
|
| 4167 |
+
"source": [
|
| 4168 |
+
"The final step merges the 100 newly generated gold labels with the full 50,000-patent dataset.\n",
|
| 4169 |
+
"\n",
|
| 4170 |
+
"Where gold labels exist:\n",
|
| 4171 |
+
"\n",
|
| 4172 |
+
"They override the original silver labels.\n",
|
| 4173 |
+
"\n",
|
| 4174 |
+
"This creates a partially corrected dataset where:\n",
|
| 4175 |
+
"\n",
|
| 4176 |
+
"Most labels remain silver (automatically generated)\n",
|
| 4177 |
+
"\n",
|
| 4178 |
+
"The top 100 are replaced with higher-quality LLM-assisted gold labels\n",
|
| 4179 |
+
"\n",
|
| 4180 |
+
"This hybrid labeling strategy improves dataset quality while keeping annotation costs low."
|
| 4181 |
+
],
|
| 4182 |
+
"metadata": {
|
| 4183 |
+
"id": "RtG2Ypq-5w9_"
|
| 4184 |
+
}
|
| 4185 |
+
},
|
| 4186 |
{
|
| 4187 |
"cell_type": "code",
|
| 4188 |
"source": [
|
|
|
|
| 4733 |
"final_f1 = 0.8144848954298993 # from your eval output\n",
|
| 4734 |
"\n",
|
| 4735 |
"table = pd.DataFrame([\n",
|
| 4736 |
+
" {\"Model Version\": \"1. Baseline\", \"Training Data Source\": \"Frozen Embeddings (No Fine-tuning)\", \"F1 Score (Test Set)\": \"0.7000\"},\n",
|
| 4737 |
+
" {\"Model Version\": \"2. Assignment 2 Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (Simple Generic LLM)\", \"F1 Score (Test Set)\": \"0.8113\"},\n",
|
| 4738 |
+
" {\"Model Version\": \"3. Assignment 3 Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (Advanced Techniques)\", \"F1 Score (Test Set)\": \"0.8238\"},\n",
|
| 4739 |
" {\"Model Version\": \"4. Final Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (LoRA-adapted GenLM + Agentic Labeling + Targeted HITL)\", \"F1 Score (Test Set)\": round(final_f1, 4)},\n",
|
| 4740 |
"])\n",
|
| 4741 |
"\n",
|