YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Assignment 3 โ Green Patent Detection (Agents)
Goal
Compare whether an advanced architecture produces better training labels than the simple LLM approach used in Assignment 2.
Chosen path: Multi-Agent System (MAS).
Method (What we did)
- Setup (Part A & B): Loaded
patents_50k_green.parquetand prepared the same high-risk 100 claims (includinguandp_green). - Agentic labeling (Part C): Ran a 3-agent workflow where Advocate and Skeptic debated each claim and a Judge produced a final suggested label (
agents_is_green_suggested). - Human-in-the-loop (Part D): Reviewed all 100 claims manually and created the final human gold label column
is_green_human_gold. - Final integration + fine-tuning (Part D): Merged the 100 gold samples into the silver training split and fine-tuned PatentSBERTa again on Silver + Gold (100) using AI Lab GPU.
- Evaluation + comparison (Part E): Evaluated the Assignment 3 model and compared F1 against Baseline and Assignment 2.
Data
- patents_50k_green.parquet (balanced 50k sample)
- High-risk 100 claims (with uncertainty score
uand probabilityp_green) - Human review column:
is_green_human_gold - Agent suggestion column:
agents_is_green_suggested
Agentic Labeling (Part C)
We used three roles:
- Advocate: argues the claim is green (Y02)
- Skeptic: argues against it (greenwashing check)
- Judge: final decision and label suggestion
Human-in-the-Loop + Agreement (Part D)
Agreement is reported against the human gold label:
- Assignment 2 (Simple LLM): 100%
- Assignment 3 (Agents):
- Agreement (ALL samples; missing AI counted as mismatch): 62.00% (100 rows)
- Agreement (SUCCESS only; AI non-null): 91.18% (68 rows)
Fine-tuning PatentSBERTa (Part D)
We fine-tuned PatentSBERTa using:
- Silver training data + 100 human-reviewed high-risk examples (gold)
Comparative Analysis (Part E)
| Model Version | Training Data Source | F1 Score |
|---|---|---|
| 1. Baseline | Frozen Embeddings (No fine-tuning) | 0.811 |
| 2. Assignment 2 Model | Fine-tuned on Silver + Gold (Simple LLM) | 0.811 |
| 3. Assignment 3 Model | Fine-tuned on Silver + Gold (Agents) | 0.82385 |
Reflection
The agentic workflow improved downstream performance slightly: F1 increased from 0.811 (Assignment 2) to 0.82385 (Assignment 3).
Although the overall agreement drops when counting missing agent outputs as mismatches, the successful agent runs match human labels 91.18% of the time, suggesting that agentic debate can generate high-quality labels when it completes successfully.
Link to video
- Downloads last month
- 52
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support