Model Card: PatentSBERTa Fine-Tuned on Green Patent Claims (Assignment 2)

Model Summary

This model is a fine-tuned version of AI-Growth-Lab/PatentSBERTa for binary classification of patent claims as green technology (Y02) or not. It was developed as part of a course assignment in Applied Deep Learning and AI at Aalborg University, exploring Human-in-the-Loop (HITL) data labeling pipelines for patent classification.

Model Details

Developed by: Anders Sønderbý (as58zr@student.aau.dk)
Model type: Sentence Transformer with classification head (binary)
Base model: AI-Growth-Lab/PatentSBERTa
Language: English
License: MIT
Task: Binary text classification — Green Technology (Y02) vs. Not Green

What This Model Does

Given the text of a patent claim, the model predicts whether the claim relates to green technology as defined by the CPC Y02 classification system. The output is a binary label:

1 — Green technology (Y02)
0 — Not green technology

Training Pipeline Overview

This model was produced through a 4-stage pipeline:

Stage 1 — Baseline (Frozen Embeddings)

A baseline classifier was trained using frozen PatentSBERTa embeddings with a Logistic Regression head on the train_silver split. This baseline was used to compute uncertainty scores for active learning.

Stage 2 — Uncertainty Sampling

The baseline model computed p_green (predicted probability of green) for all examples in pool_unlabeled. An uncertainty score was computed as:

u = 1 − 2 · |p − 0.5|

The top 100 highest-uncertainty claims were exported as hitl_green_100.csv for human review.

Stage 3 — LLM → Human HITL Labeling

For each of the 100 high-risk claims, an LLM first suggested a label (llm_green_suggested), confidence (llm_confidence), and rationale (llm_rationale). A human reviewer then assigned the final gold label (is_green_human), overriding the LLM where necessary. Only the claim text was used during labeling — no CPC codes or metadata.

Stage 4 — Fine-Tuning

PatentSBERTa was fine-tuned for binary classification using the combined train_silver + gold_100 dataset, where gold labels override silver labels for the 100 HITL-reviewed claims.

Training Data

Dataset: Derived from AI-Growth-Lab/patents_claims_1.5m_traim_test
Working file: patents_50k_green.parquet — a balanced 50k sample (25,000 green, 25,000 not green)
Silver label source: CPC Y02* classification codes (is_green_silver)
Gold labels: 100 human-reviewed claims from uncertainty sampling (is_green_gold)

Dataset Splits

Split	Size	Description
train_silver	~40,000	Silver-labeled training set (CPC-derived)
eval_silver	~5,000	Silver-labeled evaluation set
pool_unlabeled	~5,000	Unlabeled pool used for uncertainty sampling
gold_100	100	Human-reviewed high-uncertainty claims

Training Hyperparameters

Parameter	Value
Base model	AI-Growth-Lab/PatentSBERTa
Max sequence length	256
Epochs	1
Learning rate	2e-5
Training set size	~40,100 (train_silver + gold_100)

Evaluation Results

Evaluation Set	F1 Score	Notes
eval_silver (5,000)	0.818	Primary evaluation metric
gold_100 (100)	0.667	Human-reviewed high-uncertainty claims

HITL Reporting

As required by the assignment, the human reviewer assessed all 100 high-uncertainty claims. The LLM suggestion and human final label were recorded for each claim. Disagreements between the LLM and human reviewer were documented in the labeling file (hitl_final.csv).

Intended Use

Primary use: Academic research and coursework in patent classification
Intended users: Course instructors and students at Aalborg University
Out-of-scope: Production patent classification systems, legal patent assessment, or any commercial use

Limitations

Trained on a balanced 50k sample — performance may differ on the full unbalanced patent corpus
Silver labels are derived from CPC codes, which may contain noise
The model classifies based on claim text only and does not use metadata, citations, or CPC codes at inference time
Fine-tuned for 1 epoch only due to compute constraints

Repository

The full code, notebooks, and data files for this assignment are available in the course GitHub repository.

Downloads last month: 10

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Anders-sonderby/patentsbert-finetuned

Base model

AI-Growth-Lab/PatentSBERTa

Finetuned

(20)

this model