PatentSBERTa Green Patent Classifier (Silver + Gold + MAS + HITL)

This repository contains the final fine-tuned PatentSBERTa model developed for the Advanced Agentic Workflow with QLoRA final project.

The model classifies patent claims as green vs non-green technologies, focusing on climate mitigation technologies aligned with CPC Y02 classifications.

The training pipeline combines silver labels, agent debate labeling, and targeted human review to improve classification quality on difficult claims.

Model Overview

Base model: AI-Growth-Lab/PatentSBERTa
Task: Binary classification
Labels:

Label	Meaning
0	Non-green technology
1	Green technology (climate mitigation related)

The model is fine-tuned using HuggingFace AutoModelForSequenceClassification.

Training Data

The training dataset is based on a balanced 50k patent claim dataset derived from:

AI-Growth-Lab/patents_claims_1.5m_train_test

Dataset composition:

Source	Description
Silver Labels	Automatically derived from CPC Y02 indicators
Gold Labels	100 high-uncertainty claims reviewed using MAS + Human-in-the-Loop

Final training set:

train_silver + gold_100

The gold dataset overrides the silver labels for those claims to improve supervision on ambiguous cases.

Pipeline Architecture

The full system used in the project consists of several stages:

Baseline Model

Frozen PatentSBERTa embeddings
Logistic Regression classifier

Uncertainty Sampling

Identifies 100 claims with highest prediction uncertainty

QLoRA Domain Adaptation

LLM fine-tuned to better understand patent language

Multi-Agent System (MAS)

Advocate agent: argues claim is green
Skeptic agent: argues claim is not green
Judge agent: decides final classification

Targeted Human Review

Human only reviews cases where MAS confidence is low or agents disagree

Final Model Training

PatentSBERTa fine-tuned on silver data + gold labels

Evaluation

The model is evaluated on the eval_silver split of the dataset.

Primary metric:

F1 score

Additional metrics reported:

Precision
Recall
Accuracy
Confusion Matrix

The evaluation script also exports prediction probabilities for analysis.

Usage

Example inference using HuggingFace Transformers:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YOUR_USERNAME/BDS_M4_exam_final_model"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "A system for capturing carbon emissions using advanced filtration..."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)

prob_green = torch.softmax(outputs.logits, dim=-1)[0,1].item()

print("Probability green:", prob_green)

Limitations

Silver labels are derived from CPC Y02 classifications and may contain noise.
Only 100 claims were manually reviewed, meaning supervision improvements are limited to high-uncertainty cases.
Patent claims can be extremely technical and ambiguous, which may impact classification accuracy.

Project Context

This model was developed as part of the M4 Advanced AI Systems final assignment.
The project explores agentic workflows for data labeling, combining:

QLoRA fine-tuning
Multi-Agent Systems
Human-in-the-Loop review
Transformer fine-tuning

Citation

If referencing this model in academic work:

Green Patent Detection with Agentic Workflows.
M4 Advanced AI Systems Final Project.

Authors

Student project submission by Daniel Hjerresen.

Downloads last month: -; Downloads are not tracked for this model. How to track