YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Green Patent Detection using PatentSBERTa with Active Learning and HITL
Project Overview
This project builds a binary classifier that identifies green technology patents (climate-change mitigation, renewable energy, pollution reduction) from patent claim text using a Human-in-the-Loop (HITL) active learning pipeline.
Pipeline
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Part A β β Part B β β Part C β β Part D β β Upload β
β Baseline ββββββΆβ Uncertainty ββββββΆβ LLM + Human ββββββΆβ Fine-Tune ββββββΆβ HF Hub β
β (Frozen) β β Sampling β β HITL Labels β β PatentSBERTaβ β β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
- Silver labels from CPC Y02* codes provide initial training signal
- Frozen embeddings + Logistic Regression give a strong baseline
- Uncertainty sampling identifies the 100 hardest examples
- An LLM (LLaMA 3.1) pre-labels those examples; a human expert reviews and corrects
- The model is fine-tuned once on the improved dataset
Dataset
| Property | Value |
|---|---|
| Source | AI-Growth-Lab/patents_claims_1.5m_traim_test |
| Balanced subset | 50,000 patents (25k green + 25k non-green) |
| Train split | 40,000 rows (80%) |
| Eval split | 5,000 rows (10%) |
| Pool split | 5,000 rows (10%) |
| Silver labels | CPC Y02* codes (auto-assigned) |
| Gold labels | 100 human-verified via HITL pipeline |
Model
| Property | Value |
|---|---|
| Base model | AI-Growth-Lab/PatentSBERTa |
| Architecture | BERT + classification head (2 labels) |
| Max sequence length | 256 tokens |
| Fine-tuning epochs | 1 |
| Learning rate | 2e-5 |
| Batch size | 16 |
Project Structure
green-patent-detector/
βββ create_dataset.py # Build balanced 50k dataset
βββ baseline_model.py # Frozen embeddings + LogisticRegression
βββ uncertainty_sampling.py # Select 100 most uncertain examples
βββ hitl_pipeline.py # LLM labeling + human review
βββ finetune_model.py # Fine-tune PatentSBERTa
βββ upload_to_huggingface.py # Upload model + dataset to HF Hub
βββ requirements.txt
βββ README.md
βββ .env # API key (not committed to git)
How to Run
Step 0 β Install dependencies
pip install -r requirements.txt
Step 1 β Create the dataset
python Part_A.py
Step 2 β Train baseline model (Part A)
python Part_A.py
Step 3 β Uncertainty sampling (Part B)
python Part_B.py
Step 4 β LLM + Human labeling (Part C)
echo "GROQ_API_KEY=your_key_here" > .env
python Part_C.py
# β Fill in is_green_human in hitl_green_100_gold.csv
python Part_C.py
Step 5 β Fine-tune (Part D)
python Part_D.py
Step 6 β Upload to Hugging Face
huggingface-cli login
python upload_to_huggingface.py
Results
Part A β Baseline (Frozen Embeddings + Logistic Regression)
Evaluated on eval_silver (5,000 examples):
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Non-Green (0) | 0.76 | 0.78 | 0.77 | 2,500 |
| Green (1) | 0.78 | 0.76 | 0.77 | 2,500 |
| Accuracy | 0.77 | 5,000 | ||
| Macro Avg | 0.77 | 0.77 | 0.77 | 5,000 |
| Weighted Avg | 0.77 | 0.77 | 0.77 | 5,000 |
Summary: Precision = 0.7787, Recall = 0.7588, F1 = 0.7686
Part D β Fine-Tuned PatentSBERTa
Evaluated on eval_silver (5,000 examples):
| Metric | Value |
|---|---|
| Loss | 0.4154 |
| Accuracy | 0.8082 |
| Precision | 0.8086 |
| Recall | 0.8076 |
| F1 | 0.8081 |
Evaluated on gold_100 (100 examples):
| Metric | Value |
|---|---|
| Loss | 0.6888 |
| Accuracy | 0.6300 |
| Precision | 0.0513 |
| Recall | 1.0000 |
| F1 | 0.0976 |
Classification report (gold_100):
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Non-Green | 1.00 | 0.62 | 0.77 | 98 |
| Green | 0.05 | 1.00 | 0.10 | 2 |
| Accuracy | 0.63 | 100 | ||
| Macro Avg | 0.53 | 0.81 | 0.43 | 100 |
| Weighted Avg | 0.98 | 0.63 | 0.75 | 100 |
Baseline vs Fine-Tuned Comparison (eval_silver)
| Metric | Baseline (LogReg) | Fine-Tuned | Improvement |
|---|---|---|---|
| Precision | 0.7787 | 0.8086 | +0.0299 |
| Recall | 0.7588 | 0.8076 | +0.0488 |
| F1 | 0.7686 | 0.8081 | +0.0395 |
Key findings:
- Fine-tuning improved F1 by +3.95 percentage points on eval_silver
- The model achieves 80.8% accuracy on the held-out evaluation set
- On gold_100, the extreme class imbalance (98 non-green vs 2 green) makes metrics unreliable β the model catches both green patents (recall=1.0) but also over-predicts green (precision=0.05)
Note on gold_100 Results
The gold_100 evaluation set contains only 2 green and 98 non-green examples out of 100. This heavy imbalance occurred because most of the uncertain patents near the decision boundary turned out to be non-green after human review. With only 2 positive examples, the per-class metrics for "Green" are statistically unreliable. The eval_silver results (5,000 balanced examples) are a more trustworthy measure of model quality.
Part C β HITL Override Report
I examined all 100 examples that were pre-labeled by the LLM (LLaMA 3.1 8B). The LLM saw only the claim text β no CPC codes or metadata.
| Metric | Value |
|---|---|
| Total examples labeled | 100 |
| Human agrees with LLM | 97 |
| Human overrides LLM | 3 |
Override Examples
Example 1 (doc_id: 46226)
- LLM suggested: 0 (non-green, high confidence)
- Human label: 1 (green)
- Human notes: "Describes a recycling process, which contributes to waste reduction and sustainability."
Example 2 (doc_id: 46852)
- LLM suggested: 1 (green, high confidence)
- Human label: 0 (non-green)
- Human notes: "Describes general technology lacking explicit environmental, renewable, or sustainability applications."
Example 3 (doc_id: 49570)
- LLM suggested: 1 (green, high confidence)
- Human label: 0 (non-green)
- Human notes: "Describes general technology lacking explicit environmental, renewable, or sustainability applications."
Technical Notes
- Silver labels: CPC Y02* codes β reliable but not perfect
- Gold labels: Human-verified via HITL pipeline (100 examples)
- Uncertainty formula:
u = 1 - 2 * |p - 0.5|where 1.0 = maximum uncertainty - Random seed: 42 used everywhere for reproducibility
- LLM: LLaMA 3.1 8B via Groq API (temperature=0, deterministic)
- Labeling rule: LLM and human used only claim text β no CPC codes
Hugging Face Links
- Model: mehta7408/green-patent-sberta
- Dataset: mehta7408/green-patent-dataset-50k
How to Use the Fine-Tuned Model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model
model_name = "your-username/green-patent-sberta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
# Classify a patent claim
claim = "A solar panel system comprising photovoltaic cells arranged to maximize energy capture..."
inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=-1).item()
print("Green" if prediction == 1 else "Non-Green")
References
- AI-Growth-Lab β PatentSBERTa model
- AI-Growth-Lab β Patent Claims 1.5M Dataset
- Downloads last month
- 1