PatentSBERTa Green Patent Classifier — Assignment 2

Binary classifier for green patent detection (Y02 CPC codes). Fine-tuned from AI-Growth-Lab/PatentSBERTa using active learning + GPT-4o-mini HITL gold labels.

Training

  • Base model: AI-Growth-Lab/PatentSBERTa (MPNet-based)
  • Task: Binary classification — is_green (Y02 CPC codes)
  • Training data: 35,000 silver labels (CPC Y02*) + 100 gold labels (GPT-4o-mini HITL)
  • HITL process: Uncertainty sampling selected 100 most uncertain pool claims; GPT-4o-mini labeled each with confidence + rationale; human simulated review with 90% agreement (10 overrides on low-confidence cases)
  • Fine-tuning: 1 epoch, lr=2e-5, max_length=256, batch_size=16, fp16

Evaluation (eval_silver, 5,000 claims)

Metric Value
F1 0.8099
Precision 0.8207
Recall 0.7994
Accuracy 0.8126

Baseline (frozen PatentSBERTa + Logistic Regression): F1=0.7696

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("AI-Growth-Lab/PatentSBERTa", use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained("Peter512/patentsbert-green-a2")
model.eval()

text = "A photovoltaic cell comprising a perovskite absorber layer..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    logits = model(**inputs).logits
label = logits.argmax().item()  # 0=not_green, 1=green

Dataset

  • Source: AI-Growth-Lab/patents_claims_1.5m_traim_test
  • Silver labels: CPC Y02* codes (is_green_silver)
  • Splits: train_silver (35k), eval_silver (5k), pool_unlabeled (10k)
  • Balance: 50/50 green/not-green
Downloads last month
44
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Peter512/patentsbert-green-a2

Finetuned
(19)
this model

Dataset used to train Peter512/patentsbert-green-a2