houlie3 commited on
Commit
91386d3
·
verified ·
1 Parent(s): 35c77db

new updated

Browse files
Files changed (1) hide show
  1. Assignment_4.ipynb +105 -3
Assignment_4.ipynb CHANGED
@@ -3130,6 +3130,31 @@
3130
  }
3131
  ]
3132
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3133
  {
3134
  "cell_type": "code",
3135
  "source": [
@@ -3695,6 +3720,33 @@
3695
  }
3696
  ]
3697
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3698
  {
3699
  "cell_type": "code",
3700
  "source": [
@@ -4040,6 +4092,35 @@
4040
  }
4041
  ]
4042
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4043
  {
4044
  "cell_type": "code",
4045
  "source": [
@@ -4081,6 +4162,27 @@
4081
  }
4082
  ]
4083
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4084
  {
4085
  "cell_type": "code",
4086
  "source": [
@@ -4631,9 +4733,9 @@
4631
  "final_f1 = 0.8144848954298993 # from your eval output\n",
4632
  "\n",
4633
  "table = pd.DataFrame([\n",
4634
- " {\"Model Version\": \"1. Baseline\", \"Training Data Source\": \"Frozen Embeddings (No Fine-tuning)\", \"F1 Score (Test Set)\": \"TODO\"},\n",
4635
- " {\"Model Version\": \"2. Assignment 2 Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (Simple Generic LLM)\", \"F1 Score (Test Set)\": \"TODO\"},\n",
4636
- " {\"Model Version\": \"3. Assignment 3 Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (Advanced Techniques)\", \"F1 Score (Test Set)\": \"TODO\"},\n",
4637
  " {\"Model Version\": \"4. Final Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (LoRA-adapted GenLM + Agentic Labeling + Targeted HITL)\", \"F1 Score (Test Set)\": round(final_f1, 4)},\n",
4638
  "])\n",
4639
  "\n",
 
3130
  }
3131
  ]
3132
  },
3133
+ {
3134
+ "cell_type": "markdown",
3135
+ "source": [
3136
+ "The dataset patents_50k_green.parquet is uploaded and loaded into a pandas DataFrame. From this dataset, the train_silver split is selected for supervised fine-tuning.\n",
3137
+ "\n",
3138
+ "The dataset contains:\n",
3139
+ "\n",
3140
+ "Patent text\n",
3141
+ "\n",
3142
+ "A binary label (is_green) indicating whether the patent is\n",
3143
+ "environmentally sustainable\n",
3144
+ "\n",
3145
+ "The training data is formatted into instruction-style prompts suitable for supervised fine-tuning. Each example pairs:\n",
3146
+ "\n",
3147
+ "A structured input prompt\n",
3148
+ "\n",
3149
+ "The expected binary output (0 or 1)\n",
3150
+ "\n",
3151
+ "This transforms the classification task into an <\n",
3152
+ "instruction-following task compatible with GPT-style models."
3153
+ ],
3154
+ "metadata": {
3155
+ "id": "Lt3gFHSf4wAP"
3156
+ }
3157
+ },
3158
  {
3159
  "cell_type": "code",
3160
  "source": [
 
3720
  }
3721
  ]
3722
  },
3723
+ {
3724
+ "cell_type": "markdown",
3725
+ "source": [
3726
+ "The base model used is distilgpt2, a smaller and more computationally efficient version of GPT-2.\n",
3727
+ "\n",
3728
+ "Instead of fully fine-tuning all 82 million parameters, the notebook applies Low-Rank Adaptation (LoRA) using the peft library.\n",
3729
+ "\n",
3730
+ "Parameter Statistics\n",
3731
+ "\n",
3732
+ "Total parameters: ~82,060,032\n",
3733
+ "\n",
3734
+ "Trainable parameters: ~147,456\n",
3735
+ "\n",
3736
+ "Trainable percentage: 0.18%\n",
3737
+ "\n",
3738
+ "This demonstrates the efficiency of LoRA:\n",
3739
+ "\n",
3740
+ "The base model weights are frozen.\n",
3741
+ "\n",
3742
+ "Only small low-rank adapter matrices are trained.\n",
3743
+ "\n",
3744
+ "Training is significantly faster and requires fewer resources."
3745
+ ],
3746
+ "metadata": {
3747
+ "id": "iiijMhmY5UX4"
3748
+ }
3749
+ },
3750
  {
3751
  "cell_type": "code",
3752
  "source": [
 
4092
  }
4093
  ]
4094
  },
4095
+ {
4096
+ "cell_type": "markdown",
4097
+ "source": [
4098
+ "A dataset of 100 high-priority patents (hitl_green_100_with_llm.csv) is uploaded. These patents contain only:\n",
4099
+ "\n",
4100
+ "doc_id\n",
4101
+ "\n",
4102
+ "text\n",
4103
+ "\n",
4104
+ "A structured prompt is defined:\n",
4105
+ "\n",
4106
+ "“You are a patent judge. Return ONLY JSON with key is_green (0 or 1).”\n",
4107
+ "\n",
4108
+ "For each patent:\n",
4109
+ "\n",
4110
+ "The LLM generates a response.\n",
4111
+ "\n",
4112
+ "The system attempts to extract a JSON object.\n",
4113
+ "\n",
4114
+ "If JSON parsing fails, a fallback rule is applied:\n",
4115
+ "\n",
4116
+ "If the output contains “1” → assign label 1\n",
4117
+ "\n",
4118
+ "Otherwise → assign label 0"
4119
+ ],
4120
+ "metadata": {
4121
+ "id": "ZxK1f5_75lX3"
4122
+ }
4123
+ },
4124
  {
4125
  "cell_type": "code",
4126
  "source": [
 
4162
  }
4163
  ]
4164
  },
4165
+ {
4166
+ "cell_type": "markdown",
4167
+ "source": [
4168
+ "The final step merges the 100 newly generated gold labels with the full 50,000-patent dataset.\n",
4169
+ "\n",
4170
+ "Where gold labels exist:\n",
4171
+ "\n",
4172
+ "They override the original silver labels.\n",
4173
+ "\n",
4174
+ "This creates a partially corrected dataset where:\n",
4175
+ "\n",
4176
+ "Most labels remain silver (automatically generated)\n",
4177
+ "\n",
4178
+ "The top 100 are replaced with higher-quality LLM-assisted gold labels\n",
4179
+ "\n",
4180
+ "This hybrid labeling strategy improves dataset quality while keeping annotation costs low."
4181
+ ],
4182
+ "metadata": {
4183
+ "id": "RtG2Ypq-5w9_"
4184
+ }
4185
+ },
4186
  {
4187
  "cell_type": "code",
4188
  "source": [
 
4733
  "final_f1 = 0.8144848954298993 # from your eval output\n",
4734
  "\n",
4735
  "table = pd.DataFrame([\n",
4736
+ " {\"Model Version\": \"1. Baseline\", \"Training Data Source\": \"Frozen Embeddings (No Fine-tuning)\", \"F1 Score (Test Set)\": \"0.7000\"},\n",
4737
+ " {\"Model Version\": \"2. Assignment 2 Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (Simple Generic LLM)\", \"F1 Score (Test Set)\": \"0.8113\"},\n",
4738
+ " {\"Model Version\": \"3. Assignment 3 Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (Advanced Techniques)\", \"F1 Score (Test Set)\": \"0.8238\"},\n",
4739
  " {\"Model Version\": \"4. Final Model\", \"Training Data Source\": \"Fine-tuned on Silver + Gold (LoRA-adapted GenLM + Agentic Labeling + Targeted HITL)\", \"F1 Score (Test Set)\": round(final_f1, 4)},\n",
4740
  "])\n",
4741
  "\n",