CTB2001 commited on
Commit
9c66b9f
·
verified ·
1 Parent(s): ed4fc0c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -0
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ library_name: transformers
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - patents
9
+ - green-tech
10
+ - qlora
11
+ - peft
12
+ - sequence-classification
13
+ model-index:
14
+ - name: assignment3-patentsberta-qlora-gold100
15
+ results:
16
+ - task:
17
+ type: text-classification
18
+ name: Green patent detection
19
+ dataset:
20
+ name: patents_50k_green (eval_silver)
21
+ type: custom
22
+ metrics:
23
+ - type: f1
24
+ value: 0.5006382068
25
+ name: Macro F1
26
+ - type: accuracy
27
+ value: 0.5008
28
+ name: Accuracy
29
+ ---
30
+
31
+ # Assignment 3 Model — Green Patent Detection (QLoRA + PatentSBERTa)
32
+
33
+ ## Model Summary
34
+ This repository contains the **final downstream Assignment 3 classifier** for green patent detection.
35
+ Workflow:
36
+ 1. Baseline uncertainty sampling on patent claims.
37
+ 2. QLoRA-based labeling/rationale generation on top-100 high-risk examples.
38
+ 3. Final PatentSBERTa fine-tuning on `train_silver + 100 gold high-risk`.
39
+
40
+ ## Base Model
41
+ - `AI-Growth-Lab/PatentSBERTa` (sequence classification head, 2 labels)
42
+
43
+ ## Training Setup
44
+ - Seed: 42
45
+ - Train rows (augmented): 20,100
46
+ - Eval rows: 5,000
47
+ - Gold rows: 100
48
+ - Hardware used: NVIDIA L4
49
+ - Frameworks: `transformers`, `datasets`, `torch`, `peft`, `bitsandbytes`
50
+
51
+ ## Results
52
+
53
+ ### Assignment 3 final model (this repo)
54
+ - Eval accuracy: **0.5008**
55
+ - Eval macro F1: **0.5006382068**
56
+ - Gold100 accuracy: **0.53**
57
+ - Gold100 macro F1: **0.5037482842**
58
+
59
+ ### Comparison table (Assignment requirement)
60
+ | Model Version | Training Data Source | F1 Score (Eval Set) |
61
+ |---|---|---:|
62
+ | 1. Baseline | Frozen Embeddings (No Fine-tuning) | 0.7727474956 |
63
+ | 2. Assignment 2 Model | Fine-tuned on Silver + Gold (Simple LLM) | 0.4975369710 |
64
+ | 3. Assignment 3 Model | Fine-tuned on Silver + Gold (QLoRA) | 0.5006382068 |
65
+
66
+ ## Intended Use
67
+ - Educational/research use for green patent classification experiments.
68
+ - Binary label output: non-green (0) vs green (1).
69
+
70
+ ## Limitations
71
+ - Dataset and labels are project-specific and may not generalize broadly.
72
+ - Part C used automated acceptance policy for gold labels in this run (no manual overrides).
73
+ - Model should not be used for legal/commercial patent decisions without human review.
74
+
75
+ ## Files in this Repository
76
+ - `config.json`
77
+ - `model.safetensors`
78
+ - tokenizer files
79
+ - optional: training/evaluation summaries
80
+
81
+ ## Example Usage
82
+
83
+ ```python
84
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
85
+ import torch
86
+
87
+ repo_id = "<your-username>/assignment3-patentsberta-qlora-gold100"
88
+ tokenizer = AutoTokenizer.from_pretrained(repo_id)
89
+ model = AutoModelForSequenceClassification.from_pretrained(repo_id)
90
+
91
+ text = "A method for reducing CO2 emissions in industrial heat recovery systems..."
92
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
93
+ with torch.no_grad():
94
+ logits = model(**inputs).logits
95
+ pred = torch.argmax(logits, dim=-1).item()
96
+ print({"label": int(pred)}) # 0=not_green, 1=green