Agentic Human-in-the-Loop (HITL) Setup

For Assignment 3, I implemented an agentic Human-in-the-Loop (HITL) framework to classify patent claims as green technology.

Agentic Classification

Each patent claim is evaluated by three specialized agents:

  • Advocate – argues for green classification by identifying technical features related to energy efficiency, emissions reduction, or sustainability.

  • Skeptic – argues against green classification, focusing on greenwashing, vague language, or lack of explicit environmental impact.

  • Judge – weighs both arguments and produces a final classification (llm_green_suggested), confidence level, and rationale in structured JSON.

This setup encourages explicit reasoning and exposes disagreements between the pro and anti green interpretations of each claim.

Human-in-the-Loop Review

After the agentic decision, a human reviewer inspects:

  • the full patent claim,

  • the Advocate and Skeptic arguments,

  • the Judge’s final decision and rationale.

The human then assigns a final label (is_green_human) and optional notes. These human gold labels are used both for high-quality training data for model fine-tuning and evaluation.

Agreement Reporting (Human vs AI)

I measured agreement as the percentage of claims where my final human label matches the AI suggested label (llm_green_suggested).

  • Assignment 2: 100% agreement (I largely accepted the AI suggestions due to limited domain certainty on many claims)
  • Assignment 3: 86% agreement (The agentic setup and different model resulted in a more optimistic labeling approach, which increased both the number of 1 labels and disagreements with the AI suggestion)

How computed:
Agreement = mean( human_label == llm_green_suggested ) over all rows.

Comparative Analysis

Model Performance Comparison

The table below compares model performance across the three stages of the project using the same silver evaluation set.

Model Version Training Data Source F1 Score (Eval Set)
Baseline Frozen PatentSBERTa embeddings (no fine-tuning) 0.7719
Assignment 2 Model Fine-tuned on Silver + Gold (Simple LLM labels + HITL) 0.8030
Assignment 3 Model Fine-tuned on Silver + Gold (Agentic LLM + HITL) 0.8051

Reflection

Fine-tuning PatentSBERTa substantially improved performance over the frozen-embedding baseline, confirming the value of task-specific fine-tuning. The agentic labeling workflow in Assignment 3 led to a small but noteable improvement over the simpler LLM approach used in Assignment 2, suggesting that multi-agent reasoning can marginally improve label quality and downstream model performance. While the performance gain is modest, it indicates that the added complexity of the agentic setup can provide incremental benefits.

The limited size of the improvement is expected given the already strong performance of the Assignment 2 model and the relatively small size of the gold dataset (100 samples).

Downloads last month
2
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support