File size: 4,048 Bytes

d09965d

---
license: mit
language:
  - en
library_name: transformers
pipeline_tag: text-classification
tags:
  - patents
  - climate-tech
  - green-patents
  - patentsberta
  - classification
  - academic-project
---

# PatentSBERTa Green Patent Classifier (Silver + Gold + MAS + HITL)

This repository contains the **final fine-tuned PatentSBERTa model** developed for the *Advanced Agentic Workflow with QLoRA* final project.

The model classifies **patent claims as green vs non-green technologies**, focusing on climate mitigation technologies aligned with **CPC Y02 classifications**.

The training pipeline combines **silver labels, agent debate labeling, and targeted human review** to improve classification quality on difficult claims.

---

## Model Overview

**Base model:** `AI-Growth-Lab/PatentSBERTa`  
**Task:** Binary classification  
**Labels:**

| Label | Meaning |
| --- | --- |
| 0 | Non-green technology |
| 1 | Green technology (climate mitigation related) |

The model is fine-tuned using HuggingFace `AutoModelForSequenceClassification`.

---

## Training Data

The training dataset is based on a **balanced 50k patent claim dataset** derived from:

`AI-Growth-Lab/patents_claims_1.5m_train_test`

Dataset composition:

| Source | Description |
| --- | --- |
| Silver Labels | Automatically derived from CPC Y02 indicators |
| Gold Labels | 100 high-uncertainty claims reviewed using MAS + Human-in-the-Loop |

Final training set:

`train_silver + gold_100`

The **gold dataset overrides the silver labels** for those claims to improve supervision on ambiguous cases.

---

## Pipeline Architecture

The full system used in the project consists of several stages:

1. **Baseline Model**
  - Frozen PatentSBERTa embeddings
  - Logistic Regression classifier

2. **Uncertainty Sampling**
  - Identifies 100 claims with highest prediction uncertainty

3. **QLoRA Domain Adaptation**
  - LLM fine-tuned to better understand patent language

4. **Multi-Agent System (MAS)**
  - Advocate agent: argues claim is green
  - Skeptic agent: argues claim is not green
  - Judge agent: decides final classification

5. **Targeted Human Review**
  - Human only reviews cases where MAS confidence is low or agents disagree

6. **Final Model Training**
  - PatentSBERTa fine-tuned on silver data + gold labels

---

## Evaluation

The model is evaluated on the **eval_silver split** of the dataset.

Primary metric:

**F1 score**

Additional metrics reported:

- Precision
- Recall
- Accuracy
- Confusion Matrix

The evaluation script also exports prediction probabilities for analysis.

---

## Usage

Example inference using HuggingFace Transformers:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YOUR_USERNAME/BDS_M4_exam_final_model"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "A system for capturing carbon emissions using advanced filtration..."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)

prob_green = torch.softmax(outputs.logits, dim=-1)[0,1].item()

print("Probability green:", prob_green)
```

## Limitations

- Silver labels are derived from CPC Y02 classifications and may contain noise.
- Only 100 claims were manually reviewed, meaning supervision improvements are limited to high-uncertainty cases.
- Patent claims can be extremely technical and ambiguous, which may impact classification accuracy.

## Project Context

This model was developed as part of the M4 Advanced AI Systems final assignment.  
The project explores agentic workflows for data labeling, combining:

- QLoRA fine-tuning
- Multi-Agent Systems
- Human-in-the-Loop review
- Transformer fine-tuning

## Citation

If referencing this model in academic work:

**Green Patent Detection with Agentic Workflows.**  
**M4 Advanced AI Systems Final Project.**

## Authors

Student project submission by Daniel Hjerresen.