CTB2001

Create README.md

4d76917 verified about 2 months ago

5.04 kB

language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - patent-classification
  - green-technology
  - text-classification
  - mpnet
  - sentence-transformers
  - fine-tuned
datasets:
  - CTB2001/patents-green-50k
metrics:
  - f1
  - accuracy
  - precision
  - recall
base_model: AI-Growth-Lab/PatentSBERTa
pipeline_tag: text-classification
model-index:
  - name: PatentSBERTa-green-classifier
    results:
      - task:
          type: text-classification
          name: Green Patent Classification
        dataset:
          type: CTB2001/patents-green-50k
          name: patents-green-50k (eval split)
          split: eval
        metrics:
          - type: f1
            value: 0.8104
            name: Green-class F1
          - type: accuracy
            value: 0.8097
          - type: precision
            value: 0.8073
          - type: recall
            value: 0.8136

PatentSBERTa — Green Patent Classifier

A fine-tuned AI-Growth-Lab/PatentSBERTa model for binary classification of patent claims as green technology (1) or not green (0).

Developed as part of the Applied Deep Learning (AAU, Spring 2025) exam assignment on active learning, human-in-the-loop labelling, and multi-agent systems for patent classification.

Model Details

Property	Value
Architecture	MPNetForSequenceClassification (12 layers, 768 hidden)
Parameters	109.5 M (all trainable)
Base model	AI-Growth-Lab/PatentSBERTa
Max sequence length	512 tokens
Labels	`0` — not green, `1` — green
Framework	Transformers 5.2.0, PyTorch

Training

Pipeline overview

Part A–B: Frozen PatentSBERTa baseline + uncertainty-based active-learning pool selection
Part C: QLoRA-tuned Llama-3.1-8B powering a LangGraph Multi-Agent System (Advocate → Skeptic → Judge → Exception) to generate silver labels on the 15k most uncertain patents
Part D: Human-in-the-loop review of 100 critical samples → gold labels, then final full-parameter fine-tuning of PatentSBERTa

Training data

Split	Rows	Source
train_silver	25,000	Silver labels from Parts A–C
gold_labels	100 (× 25 upsampled = 2,500)	HITL-verified labels
Total training	27,500	Combined
eval_silver	10,000	Held-out balanced evaluation set

Hyperparameters

Parameter	Value
Learning rate	2e-5
Epochs	5
Effective batch size	128 (4 × 16 × grad_accum 2)
LR scheduler	Cosine with 6% warmup
Weight decay	0.01
Label smoothing	0.05
Gold upsample factor	25×
Early stopping patience	3
Precision	bf16
Seed	42

Hardware

4 × NVIDIA L4 (24 GB each), DDP via torchrun
AAU AI-Lab (SLURM cluster)
Wall-clock time: ~23 minutes

Evaluation

Evaluated on the held-out eval_silver split (10,000 samples, balanced).

	Precision	Recall	F1-score	Support
not-green (0)	0.8121	0.8058	0.8090	5,000
green (1)	0.8073	0.8136	0.8104	5,000
Accuracy			0.8097	10,000

Confusion Matrix

	Pred not-green	Pred green
Actual not-green	4,029	971
Actual green	932	4,068

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "CTB2001/PatentSBERTa-green-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

claim = "A wind turbine blade comprising a spar cap formed from pultruded carbon strips..."
inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    pred = torch.argmax(logits, dim=-1).item()

print("green" if pred == 1 else "not-green")

Intended Use

Primary: Classifying patent claims as green / not-green technology
Domain: Patent text (US/EP/WO first claims)
Not suitable for: General-purpose NLI, legal advice, or production patent screening without additional validation

Limitations

Trained and evaluated on silver labels (machine-generated); a small fraction may be noisy
Only 100 gold (human-verified) labels were available — upsampled 25× to amplify signal
Performance on out-of-domain patent offices or languages is unknown

Citation

@misc{trost-bertelsen2025patentsberta-green,
  author       = {Trøst-Bertelsen, Christian},
  title        = {PatentSBERTa Green Patent Classifier},
  year         = {2025},
  howpublished = {Hugging Face Model Hub},
  url          = {https://huggingface.co/CTB2001/PatentSBERTa-green-classifier}
}

Author

Christian Trøst-Bertelsen — Aalborg University, Student ID 20224083 Course: Applied Deep Learning, 8th semester, Spring 2025