Model Card: PatentSBERTa Final Model — QLoRA MAS + Targeted HITL

Model Summary

This is the final fine-tuned version of AI-Growth-Lab/PatentSBERTa for binary classification of patent claims as green technology (Y02) or not. It was developed as the culmination of the Final Assignment in the Applied Deep Learning and AI course at Aalborg University.

Compared to previous iterations, this model uses the most advanced labeling pipeline: a QLoRA fine-tuned Mistral-7B model acting as the Judge in a three-agent Multi-Agent System (MAS), combined with exception-based Human-in-the-Loop (HITL) review targeting only the lowest-confidence claims.


Model Details

  • Developed by: Anders Sønderbý (as58zr@student.aau.dk)
  • Model type: Sentence Transformer with classification head (binary)
  • Base model: AI-Growth-Lab/PatentSBERTa
  • Language: English
  • License: MIT
  • Task: Binary text classification — Green Technology (Y02) vs. Not Green

What This Model Does

Given the text of a patent claim, the model predicts whether the claim relates to green technology as defined by the CPC Y02 classification system:

  • 1 — Green technology (Y02)
  • 0 — Not green technology

Full Pipeline Overview

This model is the downstream product of a four-stage pipeline:

Stage 1 — QLoRA Fine-Tuning (Mistral-7B)

Mistral-7B-v0.1 was fine-tuned on 40,000 silver-labeled patent claims using QLoRA (4-bit NF4 quantization + LoRA adapters, r=16). The resulting adapter was pushed to Anders-sonderby/mistral-7b-patent-qlora. This domain-adapted model served as the Judge agent in the MAS.

Stage 2 — Multi-Agent System (MAS)

A three-agent debate pipeline classified 100 high-risk patent claims:

Agent Model Role
Advocate microsoft/Phi-3-mini-4k-instruct (4-bit) Argues for Y02 green classification
Skeptic microsoft/Phi-3-mini-4k-instruct (4-bit) Challenges the classification, identifies greenwashing
Judge Anders-sonderby/mistral-7b-patent-qlora Weighs both arguments, produces label + confidence + rationale

Both models were loaded simultaneously on a single NVIDIA L4 GPU (24GB VRAM) using 4-bit quantization and sequential CUDA cache clearing.

Stage 3 — Exception-Based HITL

Rather than reviewing all 100 claims manually, only claims where the Judge's confidence fell below 0.60 were flagged for human review. This resulted in 48 out of 100 claims being reviewed by a human. For the remaining 52 claims the Judge's decision was accepted automatically. Of the 48 reviewed claims, some Judge decisions were accepted and some were overridden based on human judgment.

Stage 4 — Final PatentSBERTa Fine-Tuning

PatentSBERTa was fine-tuned for binary classification using the combined train_silver + hitl_gold_final dataset, where the 100 gold labels override silver labels for the high-risk claims.


Training Data

  • Base dataset: Derived from AI-Growth-Lab/patents_claims_1.5m_traim_test
  • Working file: patents_50k_green.parquet — balanced 50k sample (25,000 green, 25,000 not green)
  • Silver labels: CPC Y02* classification codes (is_green_silver)
  • Gold labels: 100 claims labeled via QLoRA MAS + targeted human review (hitl_gold_final.csv)

Dataset Splits Used for Training

Split Size Description
train_silver ~40,000 Silver-labeled training set
hitl_gold_final 100 Gold-labeled high-risk claims (MAS + HITL)
eval_silver ~5,000 Silver-labeled evaluation set (not used in training)

Training Hyperparameters

Parameter Value
Base model AI-Growth-Lab/PatentSBERTa
Max sequence length 256
Epochs 1
Learning rate 2e-5
Total training examples ~40,100

Evaluation Results

Metric Value
F1 Score (eval_silver) 0.821

Comparative Performance Across All Iterations

Model Version Training Data Source F1 Score
1. Baseline Frozen Embeddings (No Fine-tuning) 0.780
2. Assignment 2 Fine-tuned on Silver + Gold (Simple LLM) 0.818
3. Assignment 3 Fine-tuned on Silver + Gold (MAS - CrewAI) 0.824
4. Final Model Fine-tuned on Silver + Gold (QLoRA MAS + Targeted HITL) 0.821

The final model achieves an F1 of 0.821, which is an improvement over the simple LLM baseline from Assignment 2 (+0.003) but slightly below the Assignment 3 MAS result (−0.003). This suggests that the quality of the 100 gold labels is the primary bottleneck rather than the sophistication of the labeling pipeline — the marginal differences between iterations indicate that expanding the gold label set beyond 100 claims would likely yield more significant gains than further architectural complexity.


HITL Summary

Metric Value
Total claims reviewed by MAS 100
Claims flagged for human review (confidence < 0.60) 48
Claims auto-accepted (confidence ≥ 0.60) 52
Human intervention rate 48%

The high intervention rate of 48% reflects the adversarial debate structure of the MAS — when both Advocate and Skeptic argued effectively, the Judge defaulted to uncertainty rather than making a confident decision. This is a desirable property for a targeted HITL system, as it correctly identifies the genuinely ambiguous claims for human review.


Engineering Notes

  • The QLoRA Judge was fine-tuned on ### Answer: YES/NO format, creating a tension when prompting it to output structured JSON in the MAS. A JSON completion forcing technique (ending the prompt with {) and a robust fallback parser were used to handle malformed outputs
  • Two models (Phi-3-mini + Mistral QLoRA) were loaded simultaneously on a single L4 GPU using 4-bit quantization and max_memory capping
  • The QLoRA adapter must be loaded via PeftModel.from_pretrained() on top of the frozen base model — it cannot be used as a standalone model

Related Models & Resources

Resource Link
Mistral-7B QLoRA Adapter Anders-sonderby/mistral-7b-patent-qlora
PatentSBERTa Assignment 2 Anders-sonderby/patentsbert-assignment2
PatentSBERTa Assignment 3 Anders-sonderby/patentsbert-assignment3
Green Patent Dataset Anders-sonderby/green-patent-dataset
GitHub Repository Anders-sonderby/M4_assignments

Intended Use

  • Primary use: Academic research and coursework in patent classification
  • Intended users: Course instructors and students at Aalborg University
  • Out-of-scope: Production patent classification, legal patent assessment, or any commercial use

Limitations

  • Trained on a balanced 50k sample — performance may differ on the full unbalanced patent corpus
  • Gold labels are based on 100 claims only — a larger gold set would likely produce more significant F1 improvements
  • The marginal F1 differences between iterations suggest the 100-claim gold set is a bottleneck regardless of labeling method sophistication
Downloads last month
21
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Anders-sonderby/patentsbert-final-model

Finetuned
(19)
this model