PatentSBERTa Green Patent Classifier (Silver + Gold + MAS + HITL)

This repository contains the final fine-tuned PatentSBERTa model developed for the Advanced Agentic Workflow with QLoRA final project.

The model classifies patent claims as green vs non-green technologies, focusing on climate mitigation technologies aligned with CPC Y02 classifications.

The training pipeline combines silver labels, agent debate labeling, and targeted human review to improve classification quality on difficult claims.


Model Overview

Base model: AI-Growth-Lab/PatentSBERTa
Task: Binary classification
Labels:

Label Meaning
0 Non-green technology
1 Green technology (climate mitigation related)

The model is fine-tuned using HuggingFace AutoModelForSequenceClassification.


Training Data

The training dataset is based on a balanced 50k patent claim dataset derived from:

AI-Growth-Lab/patents_claims_1.5m_train_test

Dataset composition:

Source Description
Silver Labels Automatically derived from CPC Y02 indicators
Gold Labels 100 high-uncertainty claims reviewed using MAS + Human-in-the-Loop

Final training set:

train_silver + gold_100

The gold dataset overrides the silver labels for those claims to improve supervision on ambiguous cases.


Pipeline Architecture

The full system used in the project consists of several stages:

  1. Baseline Model
  • Frozen PatentSBERTa embeddings
  • Logistic Regression classifier
  1. Uncertainty Sampling
  • Identifies 100 claims with highest prediction uncertainty
  1. QLoRA Domain Adaptation
  • LLM fine-tuned to better understand patent language
  1. Multi-Agent System (MAS)
  • Advocate agent: argues claim is green
  • Skeptic agent: argues claim is not green
  • Judge agent: decides final classification
  1. Targeted Human Review
  • Human only reviews cases where MAS confidence is low or agents disagree
  1. Final Model Training
  • PatentSBERTa fine-tuned on silver data + gold labels

Evaluation

The model is evaluated on the eval_silver split of the dataset.

Primary metric:

F1 score

Additional metrics reported:

  • Precision
  • Recall
  • Accuracy
  • Confusion Matrix

The evaluation script also exports prediction probabilities for analysis.


Usage

Example inference using HuggingFace Transformers:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YOUR_USERNAME/BDS_M4_exam_final_model"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "A system for capturing carbon emissions using advanced filtration..."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)

prob_green = torch.softmax(outputs.logits, dim=-1)[0,1].item()

print("Probability green:", prob_green)

Limitations

  • Silver labels are derived from CPC Y02 classifications and may contain noise.
  • Only 100 claims were manually reviewed, meaning supervision improvements are limited to high-uncertainty cases.
  • Patent claims can be extremely technical and ambiguous, which may impact classification accuracy.

Project Context

This model was developed as part of the M4 Advanced AI Systems final assignment.
The project explores agentic workflows for data labeling, combining:

  • QLoRA fine-tuning
  • Multi-Agent Systems
  • Human-in-the-Loop review
  • Transformer fine-tuning

Citation

If referencing this model in academic work:

Green Patent Detection with Agentic Workflows.
M4 Advanced AI Systems Final Project.

Authors

Student project submission by Daniel Hjerresen.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support