File size: 4,048 Bytes
d09965d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
license: mit
language:
  - en
library_name: transformers
pipeline_tag: text-classification
tags:
  - patents
  - climate-tech
  - green-patents
  - patentsberta
  - classification
  - academic-project
---

# PatentSBERTa Green Patent Classifier (Silver + Gold + MAS + HITL)

This repository contains the **final fine-tuned PatentSBERTa model** developed for the *Advanced Agentic Workflow with QLoRA* final project.

The model classifies **patent claims as green vs non-green technologies**, focusing on climate mitigation technologies aligned with **CPC Y02 classifications**.

The training pipeline combines **silver labels, agent debate labeling, and targeted human review** to improve classification quality on difficult claims.

---

## Model Overview

**Base model:** `AI-Growth-Lab/PatentSBERTa`  
**Task:** Binary classification  
**Labels:**

| Label | Meaning |
| --- | --- |
| 0 | Non-green technology |
| 1 | Green technology (climate mitigation related) |

The model is fine-tuned using HuggingFace `AutoModelForSequenceClassification`.

---

## Training Data

The training dataset is based on a **balanced 50k patent claim dataset** derived from:

`AI-Growth-Lab/patents_claims_1.5m_train_test`

Dataset composition:

| Source | Description |
| --- | --- |
| Silver Labels | Automatically derived from CPC Y02 indicators |
| Gold Labels | 100 high-uncertainty claims reviewed using MAS + Human-in-the-Loop |

Final training set:

`train_silver + gold_100`

The **gold dataset overrides the silver labels** for those claims to improve supervision on ambiguous cases.

---

## Pipeline Architecture

The full system used in the project consists of several stages:

1. **Baseline Model**
  - Frozen PatentSBERTa embeddings
  - Logistic Regression classifier

2. **Uncertainty Sampling**
  - Identifies 100 claims with highest prediction uncertainty

3. **QLoRA Domain Adaptation**
  - LLM fine-tuned to better understand patent language

4. **Multi-Agent System (MAS)**
  - Advocate agent: argues claim is green
  - Skeptic agent: argues claim is not green
  - Judge agent: decides final classification

5. **Targeted Human Review**
  - Human only reviews cases where MAS confidence is low or agents disagree

6. **Final Model Training**
  - PatentSBERTa fine-tuned on silver data + gold labels

---

## Evaluation

The model is evaluated on the **eval_silver split** of the dataset.

Primary metric:

**F1 score**

Additional metrics reported:

- Precision
- Recall
- Accuracy
- Confusion Matrix

The evaluation script also exports prediction probabilities for analysis.

---

## Usage

Example inference using HuggingFace Transformers:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YOUR_USERNAME/BDS_M4_exam_final_model"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "A system for capturing carbon emissions using advanced filtration..."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)

prob_green = torch.softmax(outputs.logits, dim=-1)[0,1].item()

print("Probability green:", prob_green)
```

## Limitations

- Silver labels are derived from CPC Y02 classifications and may contain noise.
- Only 100 claims were manually reviewed, meaning supervision improvements are limited to high-uncertainty cases.
- Patent claims can be extremely technical and ambiguous, which may impact classification accuracy.

## Project Context

This model was developed as part of the M4 Advanced AI Systems final assignment.  
The project explores agentic workflows for data labeling, combining:

- QLoRA fine-tuning
- Multi-Agent Systems
- Human-in-the-Loop review
- Transformer fine-tuning

## Citation

If referencing this model in academic work:

**Green Patent Detection with Agentic Workflows.**  
**M4 Advanced AI Systems Final Project.**

## Authors

Student project submission by Daniel Hjerresen.