YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Green Patent Detection using PatentSBERTa with Active Learning and HITL

Project Overview

This project builds a binary classifier that identifies green technology patents (climate-change mitigation, renewable energy, pollution reduction) from patent claim text using a Human-in-the-Loop (HITL) active learning pipeline.

Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Part A       β”‚     β”‚  Part B       β”‚     β”‚  Part C       β”‚     β”‚  Part D       β”‚     β”‚  Upload      β”‚
β”‚  Baseline     │────▢│  Uncertainty  │────▢│  LLM + Human │────▢│  Fine-Tune   │────▢│  HF Hub      β”‚
β”‚  (Frozen)     β”‚     β”‚  Sampling     β”‚     β”‚  HITL Labels  β”‚     β”‚  PatentSBERTaβ”‚     β”‚              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  1. Silver labels from CPC Y02* codes provide initial training signal
  2. Frozen embeddings + Logistic Regression give a strong baseline
  3. Uncertainty sampling identifies the 100 hardest examples
  4. An LLM (LLaMA 3.1) pre-labels those examples; a human expert reviews and corrects
  5. The model is fine-tuned once on the improved dataset

Dataset

Property Value
Source AI-Growth-Lab/patents_claims_1.5m_traim_test
Balanced subset 50,000 patents (25k green + 25k non-green)
Train split 40,000 rows (80%)
Eval split 5,000 rows (10%)
Pool split 5,000 rows (10%)
Silver labels CPC Y02* codes (auto-assigned)
Gold labels 100 human-verified via HITL pipeline

Model

Property Value
Base model AI-Growth-Lab/PatentSBERTa
Architecture BERT + classification head (2 labels)
Max sequence length 256 tokens
Fine-tuning epochs 1
Learning rate 2e-5
Batch size 16

Project Structure

green-patent-detector/
β”œβ”€β”€ create_dataset.py          # Build balanced 50k dataset
β”œβ”€β”€ baseline_model.py          # Frozen embeddings + LogisticRegression
β”œβ”€β”€ uncertainty_sampling.py    # Select 100 most uncertain examples
β”œβ”€β”€ hitl_pipeline.py           # LLM labeling + human review
β”œβ”€β”€ finetune_model.py          # Fine-tune PatentSBERTa
β”œβ”€β”€ upload_to_huggingface.py   # Upload model + dataset to HF Hub
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
└── .env                       # API key (not committed to git)

How to Run

Step 0 β€” Install dependencies

pip install -r requirements.txt

Step 1 β€” Create the dataset

python Part_A.py

Step 2 β€” Train baseline model (Part A)

python Part_A.py

Step 3 β€” Uncertainty sampling (Part B)

python Part_B.py

Step 4 β€” LLM + Human labeling (Part C)

echo "GROQ_API_KEY=your_key_here" > .env
python Part_C.py
# β†’ Fill in is_green_human in hitl_green_100_gold.csv
python Part_C.py

Step 5 β€” Fine-tune (Part D)

python Part_D.py

Step 6 β€” Upload to Hugging Face

huggingface-cli login
python upload_to_huggingface.py

Results

Part A β€” Baseline (Frozen Embeddings + Logistic Regression)

Evaluated on eval_silver (5,000 examples):

Class Precision Recall F1-Score Support
Non-Green (0) 0.76 0.78 0.77 2,500
Green (1) 0.78 0.76 0.77 2,500
Accuracy 0.77 5,000
Macro Avg 0.77 0.77 0.77 5,000
Weighted Avg 0.77 0.77 0.77 5,000

Summary: Precision = 0.7787, Recall = 0.7588, F1 = 0.7686

Part D β€” Fine-Tuned PatentSBERTa

Evaluated on eval_silver (5,000 examples):

Metric Value
Loss 0.4154
Accuracy 0.8082
Precision 0.8086
Recall 0.8076
F1 0.8081

Evaluated on gold_100 (100 examples):

Metric Value
Loss 0.6888
Accuracy 0.6300
Precision 0.0513
Recall 1.0000
F1 0.0976

Classification report (gold_100):

Class Precision Recall F1-Score Support
Non-Green 1.00 0.62 0.77 98
Green 0.05 1.00 0.10 2
Accuracy 0.63 100
Macro Avg 0.53 0.81 0.43 100
Weighted Avg 0.98 0.63 0.75 100

Baseline vs Fine-Tuned Comparison (eval_silver)

Metric Baseline (LogReg) Fine-Tuned Improvement
Precision 0.7787 0.8086 +0.0299
Recall 0.7588 0.8076 +0.0488
F1 0.7686 0.8081 +0.0395

Key findings:

  • Fine-tuning improved F1 by +3.95 percentage points on eval_silver
  • The model achieves 80.8% accuracy on the held-out evaluation set
  • On gold_100, the extreme class imbalance (98 non-green vs 2 green) makes metrics unreliable β€” the model catches both green patents (recall=1.0) but also over-predicts green (precision=0.05)

Note on gold_100 Results

The gold_100 evaluation set contains only 2 green and 98 non-green examples out of 100. This heavy imbalance occurred because most of the uncertain patents near the decision boundary turned out to be non-green after human review. With only 2 positive examples, the per-class metrics for "Green" are statistically unreliable. The eval_silver results (5,000 balanced examples) are a more trustworthy measure of model quality.

Part C β€” HITL Override Report

I examined all 100 examples that were pre-labeled by the LLM (LLaMA 3.1 8B). The LLM saw only the claim text β€” no CPC codes or metadata.

Metric Value
Total examples labeled 100
Human agrees with LLM 97
Human overrides LLM 3

Override Examples

Example 1 (doc_id: 46226)

  • LLM suggested: 0 (non-green, high confidence)
  • Human label: 1 (green)
  • Human notes: "Describes a recycling process, which contributes to waste reduction and sustainability."

Example 2 (doc_id: 46852)

  • LLM suggested: 1 (green, high confidence)
  • Human label: 0 (non-green)
  • Human notes: "Describes general technology lacking explicit environmental, renewable, or sustainability applications."

Example 3 (doc_id: 49570)

  • LLM suggested: 1 (green, high confidence)
  • Human label: 0 (non-green)
  • Human notes: "Describes general technology lacking explicit environmental, renewable, or sustainability applications."

Technical Notes

  • Silver labels: CPC Y02* codes β€” reliable but not perfect
  • Gold labels: Human-verified via HITL pipeline (100 examples)
  • Uncertainty formula: u = 1 - 2 * |p - 0.5| where 1.0 = maximum uncertainty
  • Random seed: 42 used everywhere for reproducibility
  • LLM: LLaMA 3.1 8B via Groq API (temperature=0, deterministic)
  • Labeling rule: LLM and human used only claim text β€” no CPC codes

Hugging Face Links

How to Use the Fine-Tuned Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model
model_name = "your-username/green-patent-sberta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# Classify a patent claim
claim = "A solar panel system comprising photovoltaic cells arranged to maximize energy capture..."
inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=-1).item()

print("Green" if prediction == 1 else "Non-Green")

References

Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support