PatentSBERTa — Green Patent Classifier

A fine-tuned AI-Growth-Lab/PatentSBERTa model for binary classification of patent claims as green technology (1) or not green (0).

Developed as part of the Applied Deep Learning (AAU, Spring 2025) exam assignment on active learning, human-in-the-loop labelling, and multi-agent systems for patent classification.

Model Details

Property Value
Architecture MPNetForSequenceClassification (12 layers, 768 hidden)
Parameters 109.5 M (all trainable)
Base model AI-Growth-Lab/PatentSBERTa
Max sequence length 512 tokens
Labels 0 — not green, 1 — green
Framework Transformers 5.2.0, PyTorch

Training

Pipeline overview

  1. Part A–B: Frozen PatentSBERTa baseline + uncertainty-based active-learning pool selection
  2. Part C: QLoRA-tuned Llama-3.1-8B powering a LangGraph Multi-Agent System (Advocate → Skeptic → Judge → Exception) to generate silver labels on the 15k most uncertain patents
  3. Part D: Human-in-the-loop review of 100 critical samples → gold labels, then final full-parameter fine-tuning of PatentSBERTa

Training data

Split Rows Source
train_silver 25,000 Silver labels from Parts A–C
gold_labels 100 (× 25 upsampled = 2,500) HITL-verified labels
Total training 27,500 Combined
eval_silver 10,000 Held-out balanced evaluation set

Hyperparameters

Parameter Value
Learning rate 2e-5
Epochs 5
Effective batch size 128 (4 × 16 × grad_accum 2)
LR scheduler Cosine with 6% warmup
Weight decay 0.01
Label smoothing 0.05
Gold upsample factor 25×
Early stopping patience 3
Precision bf16
Seed 42

Hardware

  • 4 × NVIDIA L4 (24 GB each), DDP via torchrun
  • AAU AI-Lab (SLURM cluster)
  • Wall-clock time: ~23 minutes

Evaluation

Evaluated on the held-out eval_silver split (10,000 samples, balanced).

Precision Recall F1-score Support
not-green (0) 0.8121 0.8058 0.8090 5,000
green (1) 0.8073 0.8136 0.8104 5,000
Accuracy 0.8097 10,000

Confusion Matrix

Pred not-green Pred green
Actual not-green 4,029 971
Actual green 932 4,068

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "CTB2001/PatentSBERTa-green-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

claim = "A wind turbine blade comprising a spar cap formed from pultruded carbon strips..."
inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    pred = torch.argmax(logits, dim=-1).item()

print("green" if pred == 1 else "not-green")

Intended Use

  • Primary: Classifying patent claims as green / not-green technology
  • Domain: Patent text (US/EP/WO first claims)
  • Not suitable for: General-purpose NLI, legal advice, or production patent screening without additional validation

Limitations

  • Trained and evaluated on silver labels (machine-generated); a small fraction may be noisy
  • Only 100 gold (human-verified) labels were available — upsampled 25× to amplify signal
  • Performance on out-of-domain patent offices or languages is unknown

Citation

@misc{trost-bertelsen2025patentsberta-green,
  author       = {Trøst-Bertelsen, Christian},
  title        = {PatentSBERTa Green Patent Classifier},
  year         = {2025},
  howpublished = {Hugging Face Model Hub},
  url          = {https://huggingface.co/CTB2001/PatentSBERTa-green-classifier}
}

Author

Christian Trøst-Bertelsen — Aalborg University, Student ID 20224083 Course: Applied Deep Learning, 8th semester, Spring 2025

Downloads last month
11
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CTB2001/PatentSBERTa-green-classifier

Finetuned
(19)
this model

Dataset used to train CTB2001/PatentSBERTa-green-classifier

Evaluation results