YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Assignment 3 โ€” Green Patent Detection (Agents)

Goal

Compare whether an advanced architecture produces better training labels than the simple LLM approach used in Assignment 2.
Chosen path: Multi-Agent System (MAS).

Method (What we did)

  1. Setup (Part A & B): Loaded patents_50k_green.parquet and prepared the same high-risk 100 claims (including u and p_green).
  2. Agentic labeling (Part C): Ran a 3-agent workflow where Advocate and Skeptic debated each claim and a Judge produced a final suggested label (agents_is_green_suggested).
  3. Human-in-the-loop (Part D): Reviewed all 100 claims manually and created the final human gold label column is_green_human_gold.
  4. Final integration + fine-tuning (Part D): Merged the 100 gold samples into the silver training split and fine-tuned PatentSBERTa again on Silver + Gold (100) using AI Lab GPU.
  5. Evaluation + comparison (Part E): Evaluated the Assignment 3 model and compared F1 against Baseline and Assignment 2.

Data

  • patents_50k_green.parquet (balanced 50k sample)
  • High-risk 100 claims (with uncertainty score u and probability p_green)
  • Human review column: is_green_human_gold
  • Agent suggestion column: agents_is_green_suggested

Agentic Labeling (Part C)

We used three roles:

  • Advocate: argues the claim is green (Y02)
  • Skeptic: argues against it (greenwashing check)
  • Judge: final decision and label suggestion

Human-in-the-Loop + Agreement (Part D)

Agreement is reported against the human gold label:

  • Assignment 2 (Simple LLM): 100%
  • Assignment 3 (Agents):
    • Agreement (ALL samples; missing AI counted as mismatch): 62.00% (100 rows)
    • Agreement (SUCCESS only; AI non-null): 91.18% (68 rows)

Fine-tuning PatentSBERTa (Part D)

We fine-tuned PatentSBERTa using:

  • Silver training data + 100 human-reviewed high-risk examples (gold)

Comparative Analysis (Part E)

Model Version Training Data Source F1 Score
1. Baseline Frozen Embeddings (No fine-tuning) 0.811
2. Assignment 2 Model Fine-tuned on Silver + Gold (Simple LLM) 0.811
3. Assignment 3 Model Fine-tuned on Silver + Gold (Agents) 0.82385

Reflection

The agentic workflow improved downstream performance slightly: F1 increased from 0.811 (Assignment 2) to 0.82385 (Assignment 3).
Although the overall agreement drops when counting missing agent outputs as mismatches, the successful agent runs match human labels 91.18% of the time, suggesting that agentic debate can generate high-quality labels when it completes successfully.

Link to video

https://aaudk-my.sharepoint.com/:v:/r/personal/ox26dw_student_aau_dk/Documents/Optagelser/Meeting%20in%20Group%20AAE-20260302_120454-Meeting%20Recording.mp4?csf=1&web=1&e=MtFNSf&nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJTdHJlYW1XZWJBcHAiLCJyZWZlcnJhbFZpZXciOiJTaGFyZURpYWxvZy1MaW5rIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXcifX0%3D

Downloads last month
52
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support