YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🌿 Green Patent Classifier β€” Multi-Agent HITL Pipeline

A complete pipeline for classifying patent claims as green technology (Y02 classification) using a combination of fine-tuned language models, multi-agent debate systems, and human-in-the-loop review.


πŸ“‹ Project Overview

This project implements a 4-part pipeline to classify 50,000 patent claims as green or non-green technology:

Part Description Key Output
Part A Silver labeling with PatentSBERTa patents_50k_green.parquet
Part B QLoRA fine-tuning of Mistral-7B ./qlora_patent_model/
Part C Multi-Agent debate (CrewAI) on 100 uncertain claims mas_labeled_100.csv
Part D Human-in-the-loop review + final PatentSBERTa fine-tuning ./patent_sberta_finetuned_partd/

πŸ—οΈ Architecture

50,000 Patent Claims
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Part A: Silver      β”‚  PatentSBERTa zero-shot β†’ silver labels
β”‚  Labeling            β”‚  Train/Eval/Pool split (40k/5k/5k)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Part B: QLoRA       β”‚  Mistral-7B-Instruct-v0.3 + QLoRA
β”‚  Fine-Tuning         β”‚  4-bit quantization, 5,000 examples
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Part C: Multi-Agent Debate (CrewAI 1.9.x)  β”‚
β”‚                                              β”‚
β”‚  100 most uncertain claims from pool         β”‚
β”‚                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ ADVOCATE  β”‚  β”‚ SKEPTIC  β”‚  β”‚  JUDGE   β”‚  β”‚
β”‚  β”‚ (QLoRA   β”‚  β”‚ (Groq    β”‚  β”‚ (Groq    β”‚  β”‚
β”‚  β”‚ Mistral) β”‚  β”‚ Llama-3) β”‚  β”‚ Llama-3) β”‚  β”‚
β”‚  β”‚          β”‚  β”‚          β”‚  β”‚          β”‚  β”‚
β”‚  β”‚ Argues   β”‚  β”‚ Argues   β”‚  β”‚ Weighs   β”‚  β”‚
β”‚  β”‚ FOR      β”‚β†’ β”‚ AGAINST  β”‚β†’ β”‚ & Decidesβ”‚  β”‚
β”‚  β”‚ green    β”‚  β”‚ green    β”‚  β”‚ JSON out β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Part D: HITL +      β”‚  Exception-based human review
β”‚  Final Fine-Tune     β”‚  PatentSBERTa on silver + gold
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ€– Models Used

1. QLoRA Fine-Tuned Mistral-7B (Advocate)

  • Base: mistralai/Mistral-7B-Instruct-v0.3
  • Method: QLoRA (4-bit NF4 quantization)
  • Training: 5,000 patent claims, 1 epoch
  • Trainable params: 41,943,040 / 7,289,966,592 (0.58%)
  • Role: The Advocate agent in multi-agent debate
Metric Value
Training Loss 0.9651
Training Runtime 6,254 sec (~104 min)
Samples/sec 0.799

2. Groq Llama-3.1-8B-Instant (Skeptic & Judge)

  • Model: llama-3.1-8b-instant via Groq Cloud API
  • Role: The Skeptic and Judge agents in multi-agent debate
  • Why: Fast inference, reliable JSON output, strong argumentation

3. PatentSBERTa β€” Final Fine-Tuned Classifier

  • Base: AI-Growth-Lab/PatentSBERTa
  • Task: Binary classification (green vs non-green)
  • Training data: 40,000 (silver) + 100 (gold) = 40,100 examples
  • Epochs: 3 | LR: 2e-05 | Batch: 16

πŸ“Š Results

Part C β€” Multi-Agent Debate Results

Metric Value
Total claims debated 100
CrewAI success rate 100/100
Judge confidence: High 100
Final Green labels 1
Final Non-Green labels 99
Required human intervention 0 out of 100
Human overrode Judge 0

Agent Setup:

  • Advocate (Local QLoRA Mistral-7B): Argued FOR green classification
  • Skeptic (Groq Llama-3.1-8B): Argued AGAINST green classification
  • Judge (Groq Llama-3.1-8B): Weighed both arguments, produced final verdict

All claims reached consensus β€” no human intervention was required. The 100 claims were auto-accepted based on the Judge's high confidence decision.

Part D β€” PatentSBERTa Fine-Tuning Results

Training Progress

Epoch Train Loss Eval Loss Eval Accuracy Eval F1
1 0.4203 0.4431 0.7944 0.8049
2 0.3556 0.4359 0.8058 0.8172
3 0.2935 0.4577 0.8096 0.8093

Best model selected: Epoch 2 (highest F1)

Eval Silver Results (5,000 examples β€” balanced)

Metric Score
Accuracy 0.8058
Precision 0.7720
Recall 0.8680
F1 0.8172

Gold 100 Results (100 examples β€” 99 non-green, 1 green)

Metric Score
Accuracy 0.5100
Precision 0.0200
Recall 1.0000
F1 0.0392

Note on Gold 100 metrics: The low F1 (0.039) on gold_100 reflects the extreme class imbalance (99 non-green vs 1 green) in the MAS-labeled gold set, not poor model quality. The Skeptic agent's stronger argumentation via the cloud LLM led the Judge to classify most borderline claims as non-green. The eval_silver F1 of 0.8172 on the balanced 5,000-example evaluation set demonstrates the model's true classification ability.

Classification Report (Gold 100)

               precision    recall  f1-score   support

Non-Green (0)     1.0000    0.5051    0.6711        99
    Green (1)     0.0200    1.0000    0.0392         1

     accuracy                         0.5100       100
    macro avg     0.5100    0.7525    0.3552       100
 weighted avg     0.9902    0.5100    0.6648       100

πŸ“ Project Files

Data Files

File Description
patents_50k_green.parquet 50k patents with silver labels + splits
hitl_green_100.csv 100 most uncertain claims (from pool)
mas_labeled_100.csv 100 claims with agent arguments + Judge verdict
hitl_green_100_gold_partd.csv Final gold labels for 100 claims
patents_50k_green_with_gold_partd.parquet Merged dataset (silver + gold)

Model Directories

Directory Description
./qlora_patent_model/ QLoRA adapter for Mistral-7B
./patent_sberta_finetuned_partd/ Fine-tuned PatentSBERTa classifier

πŸš€ How to Run

Prerequisites

pip install torch transformers peft bitsandbytes datasets
pip install crewai litellm flask pandas scikit-learn
pip install python-dotenv huggingface_hub

Environment Variables

# .env file
GROQ_API_KEY=your_groq_api_key_here

Execution Order

# Part A: Silver labeling
python Part_A.py

# Part B: QLoRA fine-tuning
python qlora_finetune.py

# Part C: Multi-agent debate
python multi_agent_labeling.py

# Part D: HITL review + final fine-tuning
python prepare_gold.py
python part_d_hitl_finetune.py

# Upload to HuggingFace
python upload_final_model.py

πŸ”‘ Key Design Decisions

  1. Heterogeneous Agent LLMs: The Advocate uses a local fine-tuned model (domain expertise) while the Skeptic and Judge use Groq cloud (stronger reasoning, reliable JSON output).

  2. Uncertainty Sampling: The 100 claims for multi-agent debate were selected based on highest prediction uncertainty (p_green closest to 0.5), targeting the most informative examples for labeling.

  3. Exception-Based HITL: Instead of reviewing all 100 claims, human intervention is only triggered for low-confidence or error cases. All 100 claims reached high-confidence consensus, requiring 0 human interventions.

  4. Gold Override: When merging labels, gold labels from the multi-agent system override silver labels, improving label quality for the most uncertain predictions.


πŸ“ License

This project was developed as part of a university assignment at Aalborg University (AAU).


πŸ™ Acknowledgments

Downloads last month
39
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support