Peter512's picture
Add model card with exact eval metrics
fb69f3e verified
metadata
language: en
license: apache-2.0
tags:
  - patent
  - green-technology
  - text-classification
  - patentsbert
  - active-learning
  - hitl
datasets:
  - Peter512/patents-50k-green
base_model: AI-Growth-Lab/PatentSBERTa

PatentSBERTa Green Patent Classifier — Assignment 2

Binary classifier for green patent detection (Y02 CPC codes). Fine-tuned from AI-Growth-Lab/PatentSBERTa using active learning + GPT-4o-mini HITL gold labels.

Training

  • Base model: AI-Growth-Lab/PatentSBERTa (MPNet-based)
  • Task: Binary classification — is_green (Y02 CPC codes)
  • Training data: 35,000 silver labels (CPC Y02*) + 100 gold labels (GPT-4o-mini HITL)
  • HITL process: Uncertainty sampling selected 100 most uncertain pool claims; GPT-4o-mini labeled each with confidence + rationale; human simulated review with 90% agreement (10 overrides on low-confidence cases)
  • Fine-tuning: 1 epoch, lr=2e-5, max_length=256, batch_size=16, fp16

Evaluation (eval_silver, 5,000 claims)

Metric Value
F1 0.8099
Precision 0.8207
Recall 0.7994
Accuracy 0.8126

Baseline (frozen PatentSBERTa + Logistic Regression): F1=0.7696

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("AI-Growth-Lab/PatentSBERTa", use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained("Peter512/patentsbert-green-a2")
model.eval()

text = "A photovoltaic cell comprising a perovskite absorber layer..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    logits = model(**inputs).logits
label = logits.argmax().item()  # 0=not_green, 1=green

Dataset

  • Source: AI-Growth-Lab/patents_claims_1.5m_traim_test
  • Silver labels: CPC Y02* codes (is_green_silver)
  • Splits: train_silver (35k), eval_silver (5k), pool_unlabeled (10k)
  • Balance: 50/50 green/not-green