Model Card: PatentSBERTa Final Model — QLoRA MAS + Targeted HITL
Model Summary
This is the final fine-tuned version of AI-Growth-Lab/PatentSBERTa for binary classification of patent claims as green technology (Y02) or not. It was developed as the culmination of the Final Assignment in the Applied Deep Learning and AI course at Aalborg University.
Compared to previous iterations, this model uses the most advanced labeling pipeline: a QLoRA fine-tuned Mistral-7B model acting as the Judge in a three-agent Multi-Agent System (MAS), combined with exception-based Human-in-the-Loop (HITL) review targeting only the lowest-confidence claims.
Model Details
- Developed by: Anders Sønderbý (as58zr@student.aau.dk)
- Model type: Sentence Transformer with classification head (binary)
- Base model: AI-Growth-Lab/PatentSBERTa
- Language: English
- License: MIT
- Task: Binary text classification — Green Technology (Y02) vs. Not Green
What This Model Does
Given the text of a patent claim, the model predicts whether the claim relates to green technology as defined by the CPC Y02 classification system:
1— Green technology (Y02)0— Not green technology
Full Pipeline Overview
This model is the downstream product of a four-stage pipeline:
Stage 1 — QLoRA Fine-Tuning (Mistral-7B)
Mistral-7B-v0.1 was fine-tuned on 40,000 silver-labeled patent claims using QLoRA (4-bit NF4 quantization + LoRA adapters, r=16). The resulting adapter was pushed to Anders-sonderby/mistral-7b-patent-qlora. This domain-adapted model served as the Judge agent in the MAS.
Stage 2 — Multi-Agent System (MAS)
A three-agent debate pipeline classified 100 high-risk patent claims:
| Agent | Model | Role |
|---|---|---|
| Advocate | microsoft/Phi-3-mini-4k-instruct (4-bit) | Argues for Y02 green classification |
| Skeptic | microsoft/Phi-3-mini-4k-instruct (4-bit) | Challenges the classification, identifies greenwashing |
| Judge | Anders-sonderby/mistral-7b-patent-qlora | Weighs both arguments, produces label + confidence + rationale |
Both models were loaded simultaneously on a single NVIDIA L4 GPU (24GB VRAM) using 4-bit quantization and sequential CUDA cache clearing.
Stage 3 — Exception-Based HITL
Rather than reviewing all 100 claims manually, only claims where the Judge's confidence fell below 0.60 were flagged for human review. This resulted in 48 out of 100 claims being reviewed by a human. For the remaining 52 claims the Judge's decision was accepted automatically. Of the 48 reviewed claims, some Judge decisions were accepted and some were overridden based on human judgment.
Stage 4 — Final PatentSBERTa Fine-Tuning
PatentSBERTa was fine-tuned for binary classification using the combined train_silver + hitl_gold_final dataset, where the 100 gold labels override silver labels for the high-risk claims.
Training Data
- Base dataset: Derived from AI-Growth-Lab/patents_claims_1.5m_traim_test
- Working file:
patents_50k_green.parquet— balanced 50k sample (25,000 green, 25,000 not green) - Silver labels: CPC Y02* classification codes (
is_green_silver) - Gold labels: 100 claims labeled via QLoRA MAS + targeted human review (
hitl_gold_final.csv)
Dataset Splits Used for Training
| Split | Size | Description |
|---|---|---|
| train_silver | ~40,000 | Silver-labeled training set |
| hitl_gold_final | 100 | Gold-labeled high-risk claims (MAS + HITL) |
| eval_silver | ~5,000 | Silver-labeled evaluation set (not used in training) |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base model | AI-Growth-Lab/PatentSBERTa |
| Max sequence length | 256 |
| Epochs | 1 |
| Learning rate | 2e-5 |
| Total training examples | ~40,100 |
Evaluation Results
| Metric | Value |
|---|---|
| F1 Score (eval_silver) | 0.821 |
Comparative Performance Across All Iterations
| Model Version | Training Data Source | F1 Score |
|---|---|---|
| 1. Baseline | Frozen Embeddings (No Fine-tuning) | 0.780 |
| 2. Assignment 2 | Fine-tuned on Silver + Gold (Simple LLM) | 0.818 |
| 3. Assignment 3 | Fine-tuned on Silver + Gold (MAS - CrewAI) | 0.824 |
| 4. Final Model | Fine-tuned on Silver + Gold (QLoRA MAS + Targeted HITL) | 0.821 |
The final model achieves an F1 of 0.821, which is an improvement over the simple LLM baseline from Assignment 2 (+0.003) but slightly below the Assignment 3 MAS result (−0.003). This suggests that the quality of the 100 gold labels is the primary bottleneck rather than the sophistication of the labeling pipeline — the marginal differences between iterations indicate that expanding the gold label set beyond 100 claims would likely yield more significant gains than further architectural complexity.
HITL Summary
| Metric | Value |
|---|---|
| Total claims reviewed by MAS | 100 |
| Claims flagged for human review (confidence < 0.60) | 48 |
| Claims auto-accepted (confidence ≥ 0.60) | 52 |
| Human intervention rate | 48% |
The high intervention rate of 48% reflects the adversarial debate structure of the MAS — when both Advocate and Skeptic argued effectively, the Judge defaulted to uncertainty rather than making a confident decision. This is a desirable property for a targeted HITL system, as it correctly identifies the genuinely ambiguous claims for human review.
Engineering Notes
- The QLoRA Judge was fine-tuned on
### Answer: YES/NOformat, creating a tension when prompting it to output structured JSON in the MAS. A JSON completion forcing technique (ending the prompt with{) and a robust fallback parser were used to handle malformed outputs - Two models (Phi-3-mini + Mistral QLoRA) were loaded simultaneously on a single L4 GPU using 4-bit quantization and
max_memorycapping - The QLoRA adapter must be loaded via
PeftModel.from_pretrained()on top of the frozen base model — it cannot be used as a standalone model
Related Models & Resources
| Resource | Link |
|---|---|
| Mistral-7B QLoRA Adapter | Anders-sonderby/mistral-7b-patent-qlora |
| PatentSBERTa Assignment 2 | Anders-sonderby/patentsbert-assignment2 |
| PatentSBERTa Assignment 3 | Anders-sonderby/patentsbert-assignment3 |
| Green Patent Dataset | Anders-sonderby/green-patent-dataset |
| GitHub Repository | Anders-sonderby/M4_assignments |
Intended Use
- Primary use: Academic research and coursework in patent classification
- Intended users: Course instructors and students at Aalborg University
- Out-of-scope: Production patent classification, legal patent assessment, or any commercial use
Limitations
- Trained on a balanced 50k sample — performance may differ on the full unbalanced patent corpus
- Gold labels are based on 100 claims only — a larger gold set would likely produce more significant F1 improvements
- The marginal F1 differences between iterations suggest the 100-claim gold set is a bottleneck regardless of labeling method sophistication
- Downloads last month
- 21
Model tree for Anders-sonderby/patentsbert-final-model
Base model
AI-Growth-Lab/PatentSBERTa