--- license: mit language: - en library_name: transformers pipeline_tag: text-classification tags: - patents - climate-tech - green-patents - patentsberta - classification - academic-project --- # PatentSBERTa Green Patent Classifier (Silver + Gold + MAS + HITL) This repository contains the **final fine-tuned PatentSBERTa model** developed for the *Advanced Agentic Workflow with QLoRA* final project. The model classifies **patent claims as green vs non-green technologies**, focusing on climate mitigation technologies aligned with **CPC Y02 classifications**. The training pipeline combines **silver labels, agent debate labeling, and targeted human review** to improve classification quality on difficult claims. --- ## Model Overview **Base model:** `AI-Growth-Lab/PatentSBERTa` **Task:** Binary classification **Labels:** | Label | Meaning | | --- | --- | | 0 | Non-green technology | | 1 | Green technology (climate mitigation related) | The model is fine-tuned using HuggingFace `AutoModelForSequenceClassification`. --- ## Training Data The training dataset is based on a **balanced 50k patent claim dataset** derived from: `AI-Growth-Lab/patents_claims_1.5m_train_test` Dataset composition: | Source | Description | | --- | --- | | Silver Labels | Automatically derived from CPC Y02 indicators | | Gold Labels | 100 high-uncertainty claims reviewed using MAS + Human-in-the-Loop | Final training set: `train_silver + gold_100` The **gold dataset overrides the silver labels** for those claims to improve supervision on ambiguous cases. --- ## Pipeline Architecture The full system used in the project consists of several stages: 1. **Baseline Model** - Frozen PatentSBERTa embeddings - Logistic Regression classifier 2. **Uncertainty Sampling** - Identifies 100 claims with highest prediction uncertainty 3. **QLoRA Domain Adaptation** - LLM fine-tuned to better understand patent language 4. **Multi-Agent System (MAS)** - Advocate agent: argues claim is green - Skeptic agent: argues claim is not green - Judge agent: decides final classification 5. **Targeted Human Review** - Human only reviews cases where MAS confidence is low or agents disagree 6. **Final Model Training** - PatentSBERTa fine-tuned on silver data + gold labels --- ## Evaluation The model is evaluated on the **eval_silver split** of the dataset. Primary metric: **F1 score** Additional metrics reported: - Precision - Recall - Accuracy - Confusion Matrix The evaluation script also exports prediction probabilities for analysis. --- ## Usage Example inference using HuggingFace Transformers: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "YOUR_USERNAME/BDS_M4_exam_final_model" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) text = "A system for capturing carbon emissions using advanced filtration..." inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) with torch.no_grad(): outputs = model(**inputs) prob_green = torch.softmax(outputs.logits, dim=-1)[0,1].item() print("Probability green:", prob_green) ``` ## Limitations - Silver labels are derived from CPC Y02 classifications and may contain noise. - Only 100 claims were manually reviewed, meaning supervision improvements are limited to high-uncertainty cases. - Patent claims can be extremely technical and ambiguous, which may impact classification accuracy. ## Project Context This model was developed as part of the M4 Advanced AI Systems final assignment. The project explores agentic workflows for data labeling, combining: - QLoRA fine-tuning - Multi-Agent Systems - Human-in-the-Loop review - Transformer fine-tuning ## Citation If referencing this model in academic work: **Green Patent Detection with Agentic Workflows.** **M4 Advanced AI Systems Final Project.** ## Authors Student project submission by Daniel Hjerresen.