πΏ Green Patent Classifier β Multi-Agent HITL Pipeline
A complete pipeline for classifying patent claims as green technology (Y02 classification) using a combination of fine-tuned language models, multi-agent debate systems, and human-in-the-loop review.
π Project Overview
This project implements a 4-part pipeline to classify 50,000 patent claims as green or non-green technology:
| Part | Description | Key Output |
|---|---|---|
| Part A | Silver labeling with PatentSBERTa | patents_50k_green.parquet |
| Part B | QLoRA fine-tuning of Mistral-7B | ./qlora_patent_model/ |
| Part C | Multi-Agent debate (CrewAI) on 100 uncertain claims | mas_labeled_100.csv |
| Part D | Human-in-the-loop review + final PatentSBERTa fine-tuning | ./patent_sberta_finetuned_partd/ |
ποΈ Architecture
50,000 Patent Claims
β
βΌ
βββββββββββββββββββββββ
β Part A: Silver β PatentSBERTa zero-shot β silver labels
β Labeling β Train/Eval/Pool split (40k/5k/5k)
ββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Part B: QLoRA β Mistral-7B-Instruct-v0.3 + QLoRA
β Fine-Tuning β 4-bit quantization, 5,000 examples
ββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Part C: Multi-Agent Debate (CrewAI 1.9.x) β
β β
β 100 most uncertain claims from pool β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β ADVOCATE β β SKEPTIC β β JUDGE β β
β β (QLoRA β β (Groq β β (Groq β β
β β Mistral) β β Llama-3) β β Llama-3) β β
β β β β β β β β
β β Argues β β Argues β β Weighs β β
β β FOR ββ β AGAINST ββ β & Decidesβ β
β β green β β green β β JSON out β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
ββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Part D: HITL + β Exception-based human review
β Final Fine-Tune β PatentSBERTa on silver + gold
ββββββββββββββββββββββββ
π€ Models Used
1. QLoRA Fine-Tuned Mistral-7B (Advocate)
- Base:
mistralai/Mistral-7B-Instruct-v0.3 - Method: QLoRA (4-bit NF4 quantization)
- Training: 5,000 patent claims, 1 epoch
- Trainable params: 41,943,040 / 7,289,966,592 (0.58%)
- Role: The Advocate agent in multi-agent debate
| Metric | Value |
|---|---|
| Training Loss | 0.9651 |
| Training Runtime | 6,254 sec (~104 min) |
| Samples/sec | 0.799 |
2. Groq Llama-3.1-8B-Instant (Skeptic & Judge)
- Model:
llama-3.1-8b-instantvia Groq Cloud API - Role: The Skeptic and Judge agents in multi-agent debate
- Why: Fast inference, reliable JSON output, strong argumentation
3. PatentSBERTa β Final Fine-Tuned Classifier
- Base:
AI-Growth-Lab/PatentSBERTa - Task: Binary classification (green vs non-green)
- Training data: 40,000 (silver) + 100 (gold) = 40,100 examples
- Epochs: 3 | LR: 2e-05 | Batch: 16
π Results
Part C β Multi-Agent Debate Results
| Metric | Value |
|---|---|
| Total claims debated | 100 |
| CrewAI success rate | 100/100 |
| Judge confidence: High | 100 |
| Final Green labels | 1 |
| Final Non-Green labels | 99 |
| Required human intervention | 0 out of 100 |
| Human overrode Judge | 0 |
Agent Setup:
- Advocate (Local QLoRA Mistral-7B): Argued FOR green classification
- Skeptic (Groq Llama-3.1-8B): Argued AGAINST green classification
- Judge (Groq Llama-3.1-8B): Weighed both arguments, produced final verdict
All claims reached consensus β no human intervention was required. The 100 claims were auto-accepted based on the Judge's high confidence decision.
Part D β PatentSBERTa Fine-Tuning Results
Training Progress
| Epoch | Train Loss | Eval Loss | Eval Accuracy | Eval F1 |
|---|---|---|---|---|
| 1 | 0.4203 | 0.4431 | 0.7944 | 0.8049 |
| 2 | 0.3556 | 0.4359 | 0.8058 | 0.8172 |
| 3 | 0.2935 | 0.4577 | 0.8096 | 0.8093 |
Best model selected: Epoch 2 (highest F1)
Eval Silver Results (5,000 examples β balanced)
| Metric | Score |
|---|---|
| Accuracy | 0.8058 |
| Precision | 0.7720 |
| Recall | 0.8680 |
| F1 | 0.8172 |
Gold 100 Results (100 examples β 99 non-green, 1 green)
| Metric | Score |
|---|---|
| Accuracy | 0.5100 |
| Precision | 0.0200 |
| Recall | 1.0000 |
| F1 | 0.0392 |
Note on Gold 100 metrics: The low F1 (0.039) on gold_100 reflects the extreme class imbalance (99 non-green vs 1 green) in the MAS-labeled gold set, not poor model quality. The Skeptic agent's stronger argumentation via the cloud LLM led the Judge to classify most borderline claims as non-green. The eval_silver F1 of 0.8172 on the balanced 5,000-example evaluation set demonstrates the model's true classification ability.
Classification Report (Gold 100)
precision recall f1-score support
Non-Green (0) 1.0000 0.5051 0.6711 99
Green (1) 0.0200 1.0000 0.0392 1
accuracy 0.5100 100
macro avg 0.5100 0.7525 0.3552 100
weighted avg 0.9902 0.5100 0.6648 100
π Project Files
Data Files
| File | Description |
|---|---|
patents_50k_green.parquet |
50k patents with silver labels + splits |
hitl_green_100.csv |
100 most uncertain claims (from pool) |
mas_labeled_100.csv |
100 claims with agent arguments + Judge verdict |
hitl_green_100_gold_partd.csv |
Final gold labels for 100 claims |
patents_50k_green_with_gold_partd.parquet |
Merged dataset (silver + gold) |
Model Directories
| Directory | Description |
|---|---|
./qlora_patent_model/ |
QLoRA adapter for Mistral-7B |
./patent_sberta_finetuned_partd/ |
Fine-tuned PatentSBERTa classifier |
π How to Run
Prerequisites
pip install torch transformers peft bitsandbytes datasets
pip install crewai litellm flask pandas scikit-learn
pip install python-dotenv huggingface_hub
Environment Variables
# .env file
GROQ_API_KEY=your_groq_api_key_here
Execution Order
# Part A: Silver labeling
python Part_A.py
# Part B: QLoRA fine-tuning
python qlora_finetune.py
# Part C: Multi-agent debate
python multi_agent_labeling.py
# Part D: HITL review + final fine-tuning
python prepare_gold.py
python part_d_hitl_finetune.py
# Upload to HuggingFace
python upload_final_model.py
π Key Design Decisions
Heterogeneous Agent LLMs: The Advocate uses a local fine-tuned model (domain expertise) while the Skeptic and Judge use Groq cloud (stronger reasoning, reliable JSON output).
Uncertainty Sampling: The 100 claims for multi-agent debate were selected based on highest prediction uncertainty (p_green closest to 0.5), targeting the most informative examples for labeling.
Exception-Based HITL: Instead of reviewing all 100 claims, human intervention is only triggered for low-confidence or error cases. All 100 claims reached high-confidence consensus, requiring 0 human interventions.
Gold Override: When merging labels, gold labels from the multi-agent system override silver labels, improving label quality for the most uncertain predictions.
π License
This project was developed as part of a university assignment at Aalborg University (AAU).
π Acknowledgments
- AI-Growth-Lab/PatentSBERTa β Base patent embedding model
- Mistral AI β Base LLM for QLoRA
- Groq β Fast cloud inference API
- CrewAI β Multi-agent orchestration framework
- Downloads last month
- 39