Model Card: PatentSBERTa Final Model — QLoRA MAS + Targeted HITL

Model Summary

This is the final fine-tuned version of AI-Growth-Lab/PatentSBERTa for binary classification of patent claims as green technology (Y02) or not. It was developed as the culmination of the Final Assignment in the Applied Deep Learning and AI course at Aalborg University.

Compared to previous iterations, this model uses the most advanced labeling pipeline: a QLoRA fine-tuned Mistral-7B model acting as the Judge in a three-agent Multi-Agent System (MAS), combined with exception-based Human-in-the-Loop (HITL) review targeting only the lowest-confidence claims.

Model Details

Developed by: Anders Sønderbý (as58zr@student.aau.dk)
Model type: Sentence Transformer with classification head (binary)
Base model: AI-Growth-Lab/PatentSBERTa
Language: English
License: MIT
Task: Binary text classification — Green Technology (Y02) vs. Not Green

What This Model Does

Given the text of a patent claim, the model predicts whether the claim relates to green technology as defined by the CPC Y02 classification system:

1 — Green technology (Y02)
0 — Not green technology

Full Pipeline Overview

This model is the downstream product of a four-stage pipeline:

Stage 1 — QLoRA Fine-Tuning (Mistral-7B)

Mistral-7B-v0.1 was fine-tuned on 40,000 silver-labeled patent claims using QLoRA (4-bit NF4 quantization + LoRA adapters, r=16). The resulting adapter was pushed to Anders-sonderby/mistral-7b-patent-qlora. This domain-adapted model served as the Judge agent in the MAS.

Stage 2 — Multi-Agent System (MAS)

A three-agent debate pipeline classified 100 high-risk patent claims:

Agent	Model	Role
Advocate	microsoft/Phi-3-mini-4k-instruct (4-bit)	Argues for Y02 green classification
Skeptic	microsoft/Phi-3-mini-4k-instruct (4-bit)	Challenges the classification, identifies greenwashing
Judge	Anders-sonderby/mistral-7b-patent-qlora	Weighs both arguments, produces label + confidence + rationale

Both models were loaded simultaneously on a single NVIDIA L4 GPU (24GB VRAM) using 4-bit quantization and sequential CUDA cache clearing.

Stage 3 — Exception-Based HITL

Rather than reviewing all 100 claims manually, only claims where the Judge's confidence fell below 0.60 were flagged for human review. This resulted in 48 out of 100 claims being reviewed by a human. For the remaining 52 claims the Judge's decision was accepted automatically. Of the 48 reviewed claims, some Judge decisions were accepted and some were overridden based on human judgment.

Stage 4 — Final PatentSBERTa Fine-Tuning

PatentSBERTa was fine-tuned for binary classification using the combined train_silver + hitl_gold_final dataset, where the 100 gold labels override silver labels for the high-risk claims.

Training Data

Base dataset: Derived from AI-Growth-Lab/patents_claims_1.5m_traim_test
Working file: patents_50k_green.parquet — balanced 50k sample (25,000 green, 25,000 not green)
Silver labels: CPC Y02* classification codes (is_green_silver)
Gold labels: 100 claims labeled via QLoRA MAS + targeted human review (hitl_gold_final.csv)

Dataset Splits Used for Training

Split	Size	Description
train_silver	~40,000	Silver-labeled training set
hitl_gold_final	100	Gold-labeled high-risk claims (MAS + HITL)
eval_silver	~5,000	Silver-labeled evaluation set (not used in training)

Training Hyperparameters

Parameter	Value
Base model	AI-Growth-Lab/PatentSBERTa
Max sequence length	256
Epochs	1
Learning rate	2e-5
Total training examples	~40,100

Evaluation Results

Metric	Value
F1 Score (eval_silver)	0.821

Comparative Performance Across All Iterations

Model Version	Training Data Source	F1 Score
1. Baseline	Frozen Embeddings (No Fine-tuning)	0.780
2. Assignment 2	Fine-tuned on Silver + Gold (Simple LLM)	0.818
3. Assignment 3	Fine-tuned on Silver + Gold (MAS - CrewAI)	0.824
4. Final Model	Fine-tuned on Silver + Gold (QLoRA MAS + Targeted HITL)	0.821

The final model achieves an F1 of 0.821, which is an improvement over the simple LLM baseline from Assignment 2 (+0.003) but slightly below the Assignment 3 MAS result (−0.003). This suggests that the quality of the 100 gold labels is the primary bottleneck rather than the sophistication of the labeling pipeline — the marginal differences between iterations indicate that expanding the gold label set beyond 100 claims would likely yield more significant gains than further architectural complexity.

HITL Summary

Metric	Value
Total claims reviewed by MAS	100
Claims flagged for human review (confidence < 0.60)	48
Claims auto-accepted (confidence ≥ 0.60)	52
Human intervention rate	48%

The high intervention rate of 48% reflects the adversarial debate structure of the MAS — when both Advocate and Skeptic argued effectively, the Judge defaulted to uncertainty rather than making a confident decision. This is a desirable property for a targeted HITL system, as it correctly identifies the genuinely ambiguous claims for human review.

Engineering Notes

The QLoRA Judge was fine-tuned on ### Answer: YES/NO format, creating a tension when prompting it to output structured JSON in the MAS. A JSON completion forcing technique (ending the prompt with {) and a robust fallback parser were used to handle malformed outputs
Two models (Phi-3-mini + Mistral QLoRA) were loaded simultaneously on a single L4 GPU using 4-bit quantization and max_memory capping
The QLoRA adapter must be loaded via PeftModel.from_pretrained() on top of the frozen base model — it cannot be used as a standalone model

Related Models & Resources

Resource	Link
Mistral-7B QLoRA Adapter	Anders-sonderby/mistral-7b-patent-qlora
PatentSBERTa Assignment 2	Anders-sonderby/patentsbert-assignment2
PatentSBERTa Assignment 3	Anders-sonderby/patentsbert-assignment3
Green Patent Dataset	Anders-sonderby/green-patent-dataset
GitHub Repository	Anders-sonderby/M4_assignments

Intended Use

Primary use: Academic research and coursework in patent classification
Intended users: Course instructors and students at Aalborg University
Out-of-scope: Production patent classification, legal patent assessment, or any commercial use

Limitations

Trained on a balanced 50k sample — performance may differ on the full unbalanced patent corpus
Gold labels are based on 100 claims only — a larger gold set would likely produce more significant F1 improvements
The marginal F1 differences between iterations suggest the 100-claim gold set is a bottleneck regardless of labeling method sophistication

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Anders-sonderby/patentsbert-final-model

Base model

AI-Growth-Lab/PatentSBERTa

Finetuned

(20)

this model