YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Green Patent Detection using PatentSBERTa with Active Learning and HITL

Project Overview

This project builds a binary classifier that identifies green technology patents (climate-change mitigation, renewable energy, pollution reduction) from patent claim text using a Human-in-the-Loop (HITL) active learning pipeline.

Pipeline

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Part A       │     │  Part B       │     │  Part C       │     │  Part D       │     │  Upload      │
│  Baseline     │────▶│  Uncertainty  │────▶│  LLM + Human │────▶│  Fine-Tune   │────▶│  HF Hub      │
│  (Frozen)     │     │  Sampling     │     │  HITL Labels  │     │  PatentSBERTa│     │              │
└──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘

Silver labels from CPC Y02* codes provide initial training signal
Frozen embeddings + Logistic Regression give a strong baseline
Uncertainty sampling identifies the 100 hardest examples
An LLM (LLaMA 3.1) pre-labels those examples; a human expert reviews and corrects
The model is fine-tuned once on the improved dataset

Dataset

Property	Value
Source	AI-Growth-Lab/patents_claims_1.5m_traim_test
Balanced subset	50,000 patents (25k green + 25k non-green)
Train split	40,000 rows (80%)
Eval split	5,000 rows (10%)
Pool split	5,000 rows (10%)
Silver labels	CPC Y02* codes (auto-assigned)
Gold labels	100 human-verified via HITL pipeline

Model

Property	Value
Base model	AI-Growth-Lab/PatentSBERTa
Architecture	BERT + classification head (2 labels)
Max sequence length	256 tokens
Fine-tuning epochs	1
Learning rate	2e-5
Batch size	16

Project Structure

green-patent-detector/
├── create_dataset.py          # Build balanced 50k dataset
├── baseline_model.py          # Frozen embeddings + LogisticRegression
├── uncertainty_sampling.py    # Select 100 most uncertain examples
├── hitl_pipeline.py           # LLM labeling + human review
├── finetune_model.py          # Fine-tune PatentSBERTa
├── upload_to_huggingface.py   # Upload model + dataset to HF Hub
├── requirements.txt
├── README.md
└── .env                       # API key (not committed to git)

How to Run

Step 0 — Install dependencies

pip install -r requirements.txt

Step 1 — Create the dataset

python Part_A.py

Step 2 — Train baseline model (Part A)

python Part_A.py

Step 3 — Uncertainty sampling (Part B)

python Part_B.py

Step 4 — LLM + Human labeling (Part C)

echo "GROQ_API_KEY=your_key_here" > .env
python Part_C.py
# → Fill in is_green_human in hitl_green_100_gold.csv
python Part_C.py

Step 5 — Fine-tune (Part D)

python Part_D.py

Step 6 — Upload to Hugging Face

huggingface-cli login
python upload_to_huggingface.py

Results

Part A — Baseline (Frozen Embeddings + Logistic Regression)

Evaluated on eval_silver (5,000 examples):

Class	Precision	Recall	F1-Score	Support
Non-Green (0)	0.76	0.78	0.77	2,500
Green (1)	0.78	0.76	0.77	2,500
Accuracy			0.77	5,000
Macro Avg	0.77	0.77	0.77	5,000
Weighted Avg	0.77	0.77	0.77	5,000

Summary: Precision = 0.7787, Recall = 0.7588, F1 = 0.7686

Part D — Fine-Tuned PatentSBERTa

Evaluated on eval_silver (5,000 examples):

Metric	Value
Loss	0.4154
Accuracy	0.8082
Precision	0.8086
Recall	0.8076
F1	0.8081

Evaluated on gold_100 (100 examples):

Metric	Value
Loss	0.6888
Accuracy	0.6300
Precision	0.0513
Recall	1.0000
F1	0.0976

Classification report (gold_100):

Class	Precision	Recall	F1-Score	Support
Non-Green	1.00	0.62	0.77	98
Green	0.05	1.00	0.10	2
Accuracy			0.63	100
Macro Avg	0.53	0.81	0.43	100
Weighted Avg	0.98	0.63	0.75	100

Baseline vs Fine-Tuned Comparison (eval_silver)

Metric	Baseline (LogReg)	Fine-Tuned	Improvement
Precision	0.7787	0.8086	+0.0299
Recall	0.7588	0.8076	+0.0488
F1	0.7686	0.8081	+0.0395

Key findings:

Fine-tuning improved F1 by +3.95 percentage points on eval_silver
The model achieves 80.8% accuracy on the held-out evaluation set
On gold_100, the extreme class imbalance (98 non-green vs 2 green) makes metrics unreliable — the model catches both green patents (recall=1.0) but also over-predicts green (precision=0.05)

Note on gold_100 Results

The gold_100 evaluation set contains only 2 green and 98 non-green examples out of 100. This heavy imbalance occurred because most of the uncertain patents near the decision boundary turned out to be non-green after human review. With only 2 positive examples, the per-class metrics for "Green" are statistically unreliable. The eval_silver results (5,000 balanced examples) are a more trustworthy measure of model quality.

Part C — HITL Override Report

I examined all 100 examples that were pre-labeled by the LLM (LLaMA 3.1 8B). The LLM saw only the claim text — no CPC codes or metadata.

Metric	Value
Total examples labeled	100
Human agrees with LLM	97
Human overrides LLM	3

Override Examples

Example 1 (doc_id: 46226)

LLM suggested: 0 (non-green, high confidence)
Human label: 1 (green)
Human notes: "Describes a recycling process, which contributes to waste reduction and sustainability."

Example 2 (doc_id: 46852)

LLM suggested: 1 (green, high confidence)
Human label: 0 (non-green)
Human notes: "Describes general technology lacking explicit environmental, renewable, or sustainability applications."

Example 3 (doc_id: 49570)

LLM suggested: 1 (green, high confidence)
Human label: 0 (non-green)
Human notes: "Describes general technology lacking explicit environmental, renewable, or sustainability applications."

Technical Notes

Silver labels: CPC Y02* codes — reliable but not perfect
Gold labels: Human-verified via HITL pipeline (100 examples)
Uncertainty formula: u = 1 - 2 * |p - 0.5| where 1.0 = maximum uncertainty
Random seed: 42 used everywhere for reproducibility
LLM: LLaMA 3.1 8B via Groq API (temperature=0, deterministic)
Labeling rule: LLM and human used only claim text — no CPC codes

Hugging Face Links

Model: mehta7408/green-patent-sberta
Dataset: mehta7408/green-patent-dataset-50k

How to Use the Fine-Tuned Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model
model_name = "your-username/green-patent-sberta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# Classify a patent claim
claim = "A solar panel system comprising photovoltaic cells arranged to maximize energy capture..."
inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=-1).item()

print("Green" if prediction == 1 else "Non-Green")

References

AI-Growth-Lab — PatentSBERTa model
AI-Growth-Lab — Patent Claims 1.5M Dataset

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support