danielhjerresen's picture
Rename README_hf_model.md to README.md
1632090 verified
---
license: mit
language:
- en
library_name: transformers
pipeline_tag: text-classification
tags:
- patents
- climate-tech
- green-patents
- patentsberta
- classification
- academic-project
---
# PatentSBERTa Green Patent Classifier (Silver + Gold + MAS + HITL)
This repository contains the **final fine-tuned PatentSBERTa model** developed for the *Advanced Agentic Workflow with QLoRA* final project.
The model classifies **patent claims as green vs non-green technologies**, focusing on climate mitigation technologies aligned with **CPC Y02 classifications**.
The training pipeline combines **silver labels, agent debate labeling, and targeted human review** to improve classification quality on difficult claims.
---
## Model Overview
**Base model:** `AI-Growth-Lab/PatentSBERTa`
**Task:** Binary classification
**Labels:**
| Label | Meaning |
| --- | --- |
| 0 | Non-green technology |
| 1 | Green technology (climate mitigation related) |
The model is fine-tuned using HuggingFace `AutoModelForSequenceClassification`.
---
## Training Data
The training dataset is based on a **balanced 50k patent claim dataset** derived from:
`AI-Growth-Lab/patents_claims_1.5m_train_test`
Dataset composition:
| Source | Description |
| --- | --- |
| Silver Labels | Automatically derived from CPC Y02 indicators |
| Gold Labels | 100 high-uncertainty claims reviewed using MAS + Human-in-the-Loop |
Final training set:
`train_silver + gold_100`
The **gold dataset overrides the silver labels** for those claims to improve supervision on ambiguous cases.
---
## Pipeline Architecture
The full system used in the project consists of several stages:
1. **Baseline Model**
- Frozen PatentSBERTa embeddings
- Logistic Regression classifier
2. **Uncertainty Sampling**
- Identifies 100 claims with highest prediction uncertainty
3. **QLoRA Domain Adaptation**
- LLM fine-tuned to better understand patent language
4. **Multi-Agent System (MAS)**
- Advocate agent: argues claim is green
- Skeptic agent: argues claim is not green
- Judge agent: decides final classification
5. **Targeted Human Review**
- Human only reviews cases where MAS confidence is low or agents disagree
6. **Final Model Training**
- PatentSBERTa fine-tuned on silver data + gold labels
---
## Evaluation
The model is evaluated on the **eval_silver split** of the dataset.
Primary metric:
**F1 score**
Additional metrics reported:
- Precision
- Recall
- Accuracy
- Confusion Matrix
The evaluation script also exports prediction probabilities for analysis.
---
## Usage
Example inference using HuggingFace Transformers:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "YOUR_USERNAME/BDS_M4_exam_final_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "A system for capturing carbon emissions using advanced filtration..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
outputs = model(**inputs)
prob_green = torch.softmax(outputs.logits, dim=-1)[0,1].item()
print("Probability green:", prob_green)
```
## Limitations
- Silver labels are derived from CPC Y02 classifications and may contain noise.
- Only 100 claims were manually reviewed, meaning supervision improvements are limited to high-uncertainty cases.
- Patent claims can be extremely technical and ambiguous, which may impact classification accuracy.
## Project Context
This model was developed as part of the M4 Advanced AI Systems final assignment.
The project explores agentic workflows for data labeling, combining:
- QLoRA fine-tuning
- Multi-Agent Systems
- Human-in-the-Loop review
- Transformer fine-tuning
## Citation
If referencing this model in academic work:
**Green Patent Detection with Agentic Workflows.**
**M4 Advanced AI Systems Final Project.**
## Authors
Student project submission by Daniel Hjerresen.