CTB2001
/

Assignment_3_Model

Text Classification

sequence-classification

Eval Results (legacy)

text-embeddings-inference

Model card Files Files and versions

CTB2001 commited on Feb 18

Commit

24a0beb

·

verified ·

1 Parent(s): 91a0cab

Update README.md

Files changed (1) hide show

README.md +8 -0

README.md CHANGED Viewed

@@ -69,6 +69,14 @@ Compared to Assignment 2, the Assignment 3 QLoRA workflow produced a small impro
 This indicates that the advanced data-generation approach provided a measurable but modest downstream gain over the simpler Assignment 2 setup in this run.
 However, both fine-tuned pipelines remained substantially below the frozen-embedding baseline, suggesting that data quality and labeling strategy still dominate final performance.
 ## Intended Use
 - Educational/research use for green patent classification experiments.
 - Binary label output: non-green (0) vs green (1).

 This indicates that the advanced data-generation approach provided a measurable but modest downstream gain over the simpler Assignment 2 setup in this run.
 However, both fine-tuned pipelines remained substantially below the frozen-embedding baseline, suggesting that data quality and labeling strategy still dominate final performance.
+## Extended Reflection on Part E Results
+The observed F1 score results show that downstream fine-tuning underperformed the frozen-embedding baseline in this run: Baseline macro F1 = **0.7727**, Assignment 2 = **0.4975**, and Assignment 3 = **0.5006** on the eval set. Although the advanced QLoRA workflow in Assignment 3 improved slightly over Assignment 2 (+0.0031), both fine-tuned models remained far below the baseline, indicating that additional training did not translate into better generalization here.
+One plausible explanation is label quality in the high-risk set. In Assignment 3, the 100 uncertain examples were finalized using an **auto-accept policy** (no independent human correction), so potential labeling errors in the most ambiguous cases may have been passed directly into training. Because these examples are deliberately selected near the decision boundary, they are highly influential; if their labels are noisy, they can destabilize class boundaries and reduce macro F1 on eval data.
+Another interpretation is that the fine-tuning stage is more sensitive to supervision quality and distribution mismatch than the linear baseline. A strong frozen-embedding + logistic model can be robust when labels are imperfect, while full downstream fine-tuning may overfit to noisy or weakly validated labels. Overall, the results suggest that the **quality of gold labels on uncertain samples** is a critical bottleneck, and that true human adjudication on high-risk claims is likely necessary to realize the intended gains from advanced workflows such as QLoRA.
 ## Intended Use
 - Educational/research use for green patent classification experiments.
 - Binary label output: non-green (0) vs green (1).