Assignment 3 — Green Patent Detection (Agents)

Goal

Compare whether an advanced architecture produces better training labels than the simple LLM approach used in Assignment 2.
Chosen path: Multi-Agent System (MAS).

Method (What we did)

Setup (Part A & B): Loaded patents_50k_green.parquet and prepared the same high-risk 100 claims (including u and p_green).
Agentic labeling (Part C): Ran a 3-agent workflow where Advocate and Skeptic debated each claim and a Judge produced a final suggested label (agents_is_green_suggested).
Human-in-the-loop (Part D): Reviewed all 100 claims manually and created the final human gold label column is_green_human_gold.
Final integration + fine-tuning (Part D): Merged the 100 gold samples into the silver training split and fine-tuned PatentSBERTa again on Silver + Gold (100) using AI Lab GPU.
Evaluation + comparison (Part E): Evaluated the Assignment 3 model and compared F1 against Baseline and Assignment 2.

Data

patents_50k_green.parquet (balanced 50k sample)
High-risk 100 claims (with uncertainty score u and probability p_green)
Human review column: is_green_human_gold
Agent suggestion column: agents_is_green_suggested

Agentic Labeling (Part C)

We used three roles:

Advocate: argues the claim is green (Y02)
Skeptic: argues against it (greenwashing check)
Judge: final decision and label suggestion

Human-in-the-Loop + Agreement (Part D)

Agreement is reported against the human gold label:

Assignment 2 (Simple LLM): 100%
Assignment 3 (Agents):
- Agreement (ALL samples; missing AI counted as mismatch): 62.00% (100 rows)
- Agreement (SUCCESS only; AI non-null): 91.18% (68 rows)

Fine-tuning PatentSBERTa (Part D)

We fine-tuned PatentSBERTa using:

Silver training data + 100 human-reviewed high-risk examples (gold)

Comparative Analysis (Part E)

Model Version	Training Data Source	F1 Score
1. Baseline	Frozen Embeddings (No fine-tuning)	0.811
2. Assignment 2 Model	Fine-tuned on Silver + Gold (Simple LLM)	0.811
3. Assignment 3 Model	Fine-tuned on Silver + Gold (Agents)	0.82385

Reflection

The agentic workflow improved downstream performance slightly: F1 increased from 0.811 (Assignment 2) to 0.82385 (Assignment 3).
Although the overall agreement drops when counting missing agent outputs as mismatches, the successful agent runs match human labels 91.18% of the time, suggesting that agentic debate can generate high-quality labels when it completes successfully.

Link to video

https://aaudk-my.sharepoint.com/:v:/r/personal/ox26dw_student_aau_dk/Documents/Optagelser/Meeting%20in%20Group%20AAE-20260302_120454-Meeting%20Recording.mp4?csf=1&web=1&e=MtFNSf&nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJTdHJlYW1XZWJBcHAiLCJyZWZlcnJhbFZpZXciOiJTaGFyZURpYWxvZy1MaW5rIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXcifX0%3D

Downloads last month: 3

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support