Model Card: PatentSBERTa Fine-Tuned on Green Patent Claims (Assignment 2)

Model Summary

This model is a fine-tuned version of AI-Growth-Lab/PatentSBERTa for binary classification of patent claims as green technology (Y02) or not. It was developed as part of a course assignment in Applied Deep Learning and AI at Aalborg University, exploring Human-in-the-Loop (HITL) data labeling pipelines for patent classification.


Model Details

  • Developed by: Anders Sønderbý (as58zr@student.aau.dk)
  • Model type: Sentence Transformer with classification head (binary)
  • Base model: AI-Growth-Lab/PatentSBERTa
  • Language: English
  • License: MIT
  • Task: Binary text classification — Green Technology (Y02) vs. Not Green

What This Model Does

Given the text of a patent claim, the model predicts whether the claim relates to green technology as defined by the CPC Y02 classification system. The output is a binary label:

  • 1 — Green technology (Y02)
  • 0 — Not green technology

Training Pipeline Overview

This model was produced through a 4-stage pipeline:

Stage 1 — Baseline (Frozen Embeddings)

A baseline classifier was trained using frozen PatentSBERTa embeddings with a Logistic Regression head on the train_silver split. This baseline was used to compute uncertainty scores for active learning.

Stage 2 — Uncertainty Sampling

The baseline model computed p_green (predicted probability of green) for all examples in pool_unlabeled. An uncertainty score was computed as:

u = 1 − 2 · |p − 0.5|

The top 100 highest-uncertainty claims were exported as hitl_green_100.csv for human review.

Stage 3 — LLM → Human HITL Labeling

For each of the 100 high-risk claims, an LLM first suggested a label (llm_green_suggested), confidence (llm_confidence), and rationale (llm_rationale). A human reviewer then assigned the final gold label (is_green_human), overriding the LLM where necessary. Only the claim text was used during labeling — no CPC codes or metadata.

Stage 4 — Fine-Tuning

PatentSBERTa was fine-tuned for binary classification using the combined train_silver + gold_100 dataset, where gold labels override silver labels for the 100 HITL-reviewed claims.


Training Data

  • Dataset: Derived from AI-Growth-Lab/patents_claims_1.5m_traim_test
  • Working file: patents_50k_green.parquet — a balanced 50k sample (25,000 green, 25,000 not green)
  • Silver label source: CPC Y02* classification codes (is_green_silver)
  • Gold labels: 100 human-reviewed claims from uncertainty sampling (is_green_gold)

Dataset Splits

Split Size Description
train_silver ~40,000 Silver-labeled training set (CPC-derived)
eval_silver ~5,000 Silver-labeled evaluation set
pool_unlabeled ~5,000 Unlabeled pool used for uncertainty sampling
gold_100 100 Human-reviewed high-uncertainty claims

Training Hyperparameters

Parameter Value
Base model AI-Growth-Lab/PatentSBERTa
Max sequence length 256
Epochs 1
Learning rate 2e-5
Training set size ~40,100 (train_silver + gold_100)

Evaluation Results

Evaluation Set F1 Score Notes
eval_silver (5,000) 0.818 Primary evaluation metric
gold_100 (100) 0.667 Human-reviewed high-uncertainty claims

HITL Reporting

As required by the assignment, the human reviewer assessed all 100 high-uncertainty claims. The LLM suggestion and human final label were recorded for each claim. Disagreements between the LLM and human reviewer were documented in the labeling file (hitl_final.csv).


Intended Use

  • Primary use: Academic research and coursework in patent classification
  • Intended users: Course instructors and students at Aalborg University
  • Out-of-scope: Production patent classification systems, legal patent assessment, or any commercial use

Limitations

  • Trained on a balanced 50k sample — performance may differ on the full unbalanced patent corpus
  • Silver labels are derived from CPC codes, which may contain noise
  • The model classifies based on claim text only and does not use metadata, citations, or CPC codes at inference time
  • Fine-tuned for 1 epoch only due to compute constraints

Repository

The full code, notebooks, and data files for this assignment are available in the course GitHub repository.

Downloads last month
10
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Anders-sonderby/patentsbert-finetuned

Finetuned
(20)
this model