🌿 Green Patent Classifier — Multi-Agent HITL Pipeline

A complete pipeline for classifying patent claims as green technology (Y02 classification) using a combination of fine-tuned language models, multi-agent debate systems, and human-in-the-loop review.

📋 Project Overview

This project implements a 4-part pipeline to classify 50,000 patent claims as green or non-green technology:

Part	Description	Key Output
Part A	Silver labeling with PatentSBERTa	`patents_50k_green.parquet`
Part B	QLoRA fine-tuning of Mistral-7B	`./qlora_patent_model/`
Part C	Multi-Agent debate (CrewAI) on 100 uncertain claims	`mas_labeled_100.csv`
Part D	Human-in-the-loop review + final PatentSBERTa fine-tuning	`./patent_sberta_finetuned_partd/`

🏗️ Architecture

50,000 Patent Claims
        │
        ▼
┌─────────────────────┐
│  Part A: Silver      │  PatentSBERTa zero-shot → silver labels
│  Labeling            │  Train/Eval/Pool split (40k/5k/5k)
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Part B: QLoRA       │  Mistral-7B-Instruct-v0.3 + QLoRA
│  Fine-Tuning         │  4-bit quantization, 5,000 examples
└────────┬────────────┘
         │
         ▼
┌─────────────────────────────────────────────┐
│  Part C: Multi-Agent Debate (CrewAI 1.9.x)  │
│                                              │
│  100 most uncertain claims from pool         │
│                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ ADVOCATE  │  │ SKEPTIC  │  │  JUDGE   │  │
│  │ (QLoRA   │  │ (Groq    │  │ (Groq    │  │
│  │ Mistral) │  │ Llama-3) │  │ Llama-3) │  │
│  │          │  │          │  │          │  │
│  │ Argues   │  │ Argues   │  │ Weighs   │  │
│  │ FOR      │→ │ AGAINST  │→ │ & Decides│  │
│  │ green    │  │ green    │  │ JSON out │  │
│  └──────────┘  └──────────┘  └──────────┘  │
└────────┬────────────────────────────────────┘
         │
         ▼
┌─────────────────────┐
│  Part D: HITL +      │  Exception-based human review
│  Final Fine-Tune     │  PatentSBERTa on silver + gold
└──────────────────────┘

🤖 Models Used

1. QLoRA Fine-Tuned Mistral-7B (Advocate)

Base: mistralai/Mistral-7B-Instruct-v0.3
Method: QLoRA (4-bit NF4 quantization)
Training: 5,000 patent claims, 1 epoch
Trainable params: 41,943,040 / 7,289,966,592 (0.58%)
Role: The Advocate agent in multi-agent debate

Metric	Value
Training Loss	0.9651
Training Runtime	6,254 sec (~104 min)
Samples/sec	0.799

2. Groq Llama-3.1-8B-Instant (Skeptic & Judge)

Model: llama-3.1-8b-instant via Groq Cloud API
Role: The Skeptic and Judge agents in multi-agent debate
Why: Fast inference, reliable JSON output, strong argumentation

3. PatentSBERTa — Final Fine-Tuned Classifier

Base: AI-Growth-Lab/PatentSBERTa
Task: Binary classification (green vs non-green)
Training data: 40,000 (silver) + 100 (gold) = 40,100 examples
Epochs: 3 | LR: 2e-05 | Batch: 16

📊 Results

Part C — Multi-Agent Debate Results

Metric	Value
Total claims debated	100
CrewAI success rate	100/100
Judge confidence: High	100
Final Green labels	1
Final Non-Green labels	99
Required human intervention	0 out of 100
Human overrode Judge	0

Agent Setup:

Advocate (Local QLoRA Mistral-7B): Argued FOR green classification
Skeptic (Groq Llama-3.1-8B): Argued AGAINST green classification
Judge (Groq Llama-3.1-8B): Weighed both arguments, produced final verdict

All claims reached consensus — no human intervention was required. The 100 claims were auto-accepted based on the Judge's high confidence decision.

Part D — PatentSBERTa Fine-Tuning Results

Training Progress

Epoch	Train Loss	Eval Loss	Eval Accuracy	Eval F1
1	0.4203	0.4431	0.7944	0.8049
2	0.3556	0.4359	0.8058	0.8172
3	0.2935	0.4577	0.8096	0.8093

Best model selected: Epoch 2 (highest F1)

Eval Silver Results (5,000 examples — balanced)

Metric	Score
Accuracy	0.8058
Precision	0.7720
Recall	0.8680
F1	0.8172

Gold 100 Results (100 examples — 99 non-green, 1 green)

Metric	Score
Accuracy	0.5100
Precision	0.0200
Recall	1.0000
F1	0.0392

Note on Gold 100 metrics: The low F1 (0.039) on gold_100 reflects the extreme class imbalance (99 non-green vs 1 green) in the MAS-labeled gold set, not poor model quality. The Skeptic agent's stronger argumentation via the cloud LLM led the Judge to classify most borderline claims as non-green. The eval_silver F1 of 0.8172 on the balanced 5,000-example evaluation set demonstrates the model's true classification ability.

Classification Report (Gold 100)

               precision    recall  f1-score   support

Non-Green (0)     1.0000    0.5051    0.6711        99
    Green (1)     0.0200    1.0000    0.0392         1

     accuracy                         0.5100       100
    macro avg     0.5100    0.7525    0.3552       100
 weighted avg     0.9902    0.5100    0.6648       100

📁 Project Files

Data Files

File	Description
`patents_50k_green.parquet`	50k patents with silver labels + splits
`hitl_green_100.csv`	100 most uncertain claims (from pool)
`mas_labeled_100.csv`	100 claims with agent arguments + Judge verdict
`hitl_green_100_gold_partd.csv`	Final gold labels for 100 claims
`patents_50k_green_with_gold_partd.parquet`	Merged dataset (silver + gold)

Model Directories

Directory	Description
`./qlora_patent_model/`	QLoRA adapter for Mistral-7B
`./patent_sberta_finetuned_partd/`	Fine-tuned PatentSBERTa classifier

🚀 How to Run

Prerequisites

pip install torch transformers peft bitsandbytes datasets
pip install crewai litellm flask pandas scikit-learn
pip install python-dotenv huggingface_hub

Environment Variables

# .env file
GROQ_API_KEY=your_groq_api_key_here

Execution Order

# Part A: Silver labeling
python Part_A.py

# Part B: QLoRA fine-tuning
python qlora_finetune.py

# Part C: Multi-agent debate
python multi_agent_labeling.py

# Part D: HITL review + final fine-tuning
python prepare_gold.py
python part_d_hitl_finetune.py

# Upload to HuggingFace
python upload_final_model.py

🔑 Key Design Decisions

Heterogeneous Agent LLMs: The Advocate uses a local fine-tuned model (domain expertise) while the Skeptic and Judge use Groq cloud (stronger reasoning, reliable JSON output).
Uncertainty Sampling: The 100 claims for multi-agent debate were selected based on highest prediction uncertainty (p_green closest to 0.5), targeting the most informative examples for labeling.
Exception-Based HITL: Instead of reviewing all 100 claims, human intervention is only triggered for low-confidence or error cases. All 100 claims reached high-confidence consensus, requiring 0 human interventions.
Gold Override: When merging labels, gold labels from the multi-agent system override silver labels, improving label quality for the most uncertain predictions.

📝 License

This project was developed as part of a university assignment at Aalborg University (AAU).

🙏 Acknowledgments

AI-Growth-Lab/PatentSBERTa — Base patent embedding model
Mistral AI — Base LLM for QLoRA
Groq — Fast cloud inference API
CrewAI — Multi-agent orchestration framework

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support