Green Patent Detection via Iterative Label Refinement

This repository contains the implementation for "Green Patent Detection via Iterative Label Refinement with Agentic Reasoning and Targeted HITL" β€” a project investigating whether progressively improving training labels can enhance green patent classification performance.

πŸ“‹ Abstract

We present an iterative pipeline for binary classification of patent claims into green versus non-green categories (operationalized through Y02-related relevance). The system evolves from frozen embeddings to a sophisticated multi-agent architecture with targeted human-in-the-loop (HITL) validation, achieving an F1 score of 0.8113 on the final iteration.

πŸ—οΈ Pipeline Architecture

The system is developed across four iterative assignments, maintaining a consistent PatentSBERTa backbone while progressively improving label quality:

Stage Method F1 Score Gain
A1 (Baseline) Logistic Regression on Frozen Embeddings 0.7693 β€”
A2 PatentSBERTa Fine-Tuning 0.8056 +0.0363
A3 Optimized Transformer + Uncertainty Sampling 0.8076 +0.0020
A4 (Final) MAS + Targeted HITL β†’ Retrain 0.8113 +0.0037

Key Components

  • πŸ” Uncertainty Sampling: Identifies top-100 high-risk claims closest to the decision boundary for targeted refinement
  • πŸ€– Multi-Agent System (MAS): Three-agent workflow:
    • Advocate: Argues for green label classification
    • Skeptic: Argues against (including greenwashing detection)
    • Judge: Produces final label with confidence score and rationale
  • πŸ‘€ Targeted HITL: Human review focused only on low-confidence or structurally invalid decisions

πŸ“Š Results

The largest performance gain occurs from baseline to fine-tuned PatentSBERTa (A2), demonstrating the value of task-specific adaptation. Subsequent iterations show consistent incremental improvements from label-quality refinement, with the final system achieving the best overall F1 score of 0.8113.

Key findings:

  • Fine-tuning provides substantial gains over frozen embeddings
  • Uncertainty-driven selection focuses refinement efforts on ambiguous cases
  • Agentic reasoning with targeted HITL improves reliability in error-prone regions
  • Label quality improvements yield measurable performance gains beyond standard fine-tuning

πŸ› οΈ Technical Implementation

Model Details

  • Base Model: AI-Growth-Lab/PatentSBERTa
  • Fine-tuning: Full fine-tuning (A2-A3) / QLoRA-powered (A4)
  • Quantization: 4-bit loading for local deployment (Mistral-7B agents)

Dataset

  • Size: 50k balanced benchmark dataset
  • Split: Fixed 35k/10k/5k train/test/eval
  • Task: Binary classification (green vs. non-green via Y02 relevance)

Engineering Challenges Addressed

  • Local Deployment: 4-bit quantization and memory optimization for GPU-constrained environments
  • Agent Orchestration: Structured JSON outputs for label, confidence, and rationale
  • Reproducibility: Consistent data splits and fixed high-risk sets across iterations

πŸš€ Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load fine-tuned PatentSBERTa
model_name = "your-username/green-patent-detector"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Classify patent claim
claim = "A solar panel assembly with improved photovoltaic efficiency..."
inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(dim=-1)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support