metadata
language:
- en
license: apache-2.0
library_name: transformers
tags:
- patent-classification
- green-technology
- text-classification
- mpnet
- sentence-transformers
- fine-tuned
datasets:
- CTB2001/patents-green-50k
metrics:
- f1
- accuracy
- precision
- recall
base_model: AI-Growth-Lab/PatentSBERTa
pipeline_tag: text-classification
model-index:
- name: PatentSBERTa-green-classifier
results:
- task:
type: text-classification
name: Green Patent Classification
dataset:
type: CTB2001/patents-green-50k
name: patents-green-50k (eval split)
split: eval
metrics:
- type: f1
value: 0.8104
name: Green-class F1
- type: accuracy
value: 0.8097
- type: precision
value: 0.8073
- type: recall
value: 0.8136
PatentSBERTa — Green Patent Classifier
A fine-tuned AI-Growth-Lab/PatentSBERTa model for binary classification of patent claims as green technology (1) or not green (0).
Developed as part of the Applied Deep Learning (AAU, Spring 2025) exam assignment on active learning, human-in-the-loop labelling, and multi-agent systems for patent classification.
Model Details
| Property | Value |
|---|---|
| Architecture | MPNetForSequenceClassification (12 layers, 768 hidden) |
| Parameters | 109.5 M (all trainable) |
| Base model | AI-Growth-Lab/PatentSBERTa |
| Max sequence length | 512 tokens |
| Labels | 0 — not green, 1 — green |
| Framework | Transformers 5.2.0, PyTorch |
Training
Pipeline overview
- Part A–B: Frozen PatentSBERTa baseline + uncertainty-based active-learning pool selection
- Part C: QLoRA-tuned Llama-3.1-8B powering a LangGraph Multi-Agent System (Advocate → Skeptic → Judge → Exception) to generate silver labels on the 15k most uncertain patents
- Part D: Human-in-the-loop review of 100 critical samples → gold labels, then final full-parameter fine-tuning of PatentSBERTa
Training data
| Split | Rows | Source |
|---|---|---|
| train_silver | 25,000 | Silver labels from Parts A–C |
| gold_labels | 100 (× 25 upsampled = 2,500) | HITL-verified labels |
| Total training | 27,500 | Combined |
| eval_silver | 10,000 | Held-out balanced evaluation set |
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Epochs | 5 |
| Effective batch size | 128 (4 × 16 × grad_accum 2) |
| LR scheduler | Cosine with 6% warmup |
| Weight decay | 0.01 |
| Label smoothing | 0.05 |
| Gold upsample factor | 25× |
| Early stopping patience | 3 |
| Precision | bf16 |
| Seed | 42 |
Hardware
- 4 × NVIDIA L4 (24 GB each), DDP via
torchrun - AAU AI-Lab (SLURM cluster)
- Wall-clock time: ~23 minutes
Evaluation
Evaluated on the held-out eval_silver split (10,000 samples, balanced).
| Precision | Recall | F1-score | Support | |
|---|---|---|---|---|
| not-green (0) | 0.8121 | 0.8058 | 0.8090 | 5,000 |
| green (1) | 0.8073 | 0.8136 | 0.8104 | 5,000 |
| Accuracy | 0.8097 | 10,000 |
Confusion Matrix
| Pred not-green | Pred green | |
|---|---|---|
| Actual not-green | 4,029 | 971 |
| Actual green | 932 | 4,068 |
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "CTB2001/PatentSBERTa-green-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
claim = "A wind turbine blade comprising a spar cap formed from pultruded carbon strips..."
inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
print("green" if pred == 1 else "not-green")
Intended Use
- Primary: Classifying patent claims as green / not-green technology
- Domain: Patent text (US/EP/WO first claims)
- Not suitable for: General-purpose NLI, legal advice, or production patent screening without additional validation
Limitations
- Trained and evaluated on silver labels (machine-generated); a small fraction may be noisy
- Only 100 gold (human-verified) labels were available — upsampled 25× to amplify signal
- Performance on out-of-domain patent offices or languages is unknown
Citation
@misc{trost-bertelsen2025patentsberta-green,
author = {Trøst-Bertelsen, Christian},
title = {PatentSBERTa Green Patent Classifier},
year = {2025},
howpublished = {Hugging Face Model Hub},
url = {https://huggingface.co/CTB2001/PatentSBERTa-green-classifier}
}
Author
Christian Trøst-Bertelsen — Aalborg University, Student ID 20224083 Course: Applied Deep Learning, 8th semester, Spring 2025