File size: 2,915 Bytes
9c66b9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
language:
- en
license: mit
library_name: transformers
pipeline_tag: text-classification
tags:
- patents
- green-tech
- qlora
- peft
- sequence-classification
model-index:
- name: assignment3-patentsberta-qlora-gold100
  results:
  - task:
      type: text-classification
      name: Green patent detection
    dataset:
      name: patents_50k_green (eval_silver)
      type: custom
    metrics:
    - type: f1
      value: 0.5006382068
      name: Macro F1
    - type: accuracy
      value: 0.5008
      name: Accuracy
---

# Assignment 3 Model — Green Patent Detection (QLoRA + PatentSBERTa)

## Model Summary
This repository contains the **final downstream Assignment 3 classifier** for green patent detection.  
Workflow:
1. Baseline uncertainty sampling on patent claims.
2. QLoRA-based labeling/rationale generation on top-100 high-risk examples.
3. Final PatentSBERTa fine-tuning on `train_silver + 100 gold high-risk`.

## Base Model
- `AI-Growth-Lab/PatentSBERTa` (sequence classification head, 2 labels)

## Training Setup
- Seed: 42
- Train rows (augmented): 20,100
- Eval rows: 5,000
- Gold rows: 100
- Hardware used: NVIDIA L4
- Frameworks: `transformers`, `datasets`, `torch`, `peft`, `bitsandbytes`

## Results

### Assignment 3 final model (this repo)
- Eval accuracy: **0.5008**
- Eval macro F1: **0.5006382068**
- Gold100 accuracy: **0.53**
- Gold100 macro F1: **0.5037482842**

### Comparison table (Assignment requirement)
| Model Version | Training Data Source | F1 Score (Eval Set) |
|---|---|---:|
| 1. Baseline | Frozen Embeddings (No Fine-tuning) | 0.7727474956 |
| 2. Assignment 2 Model | Fine-tuned on Silver + Gold (Simple LLM) | 0.4975369710 |
| 3. Assignment 3 Model | Fine-tuned on Silver + Gold (QLoRA) | 0.5006382068 |

## Intended Use
- Educational/research use for green patent classification experiments.
- Binary label output: non-green (0) vs green (1).

## Limitations
- Dataset and labels are project-specific and may not generalize broadly.
- Part C used automated acceptance policy for gold labels in this run (no manual overrides).
- Model should not be used for legal/commercial patent decisions without human review.

## Files in this Repository
- `config.json`
- `model.safetensors`
- tokenizer files
- optional: training/evaluation summaries

## Example Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo_id = "<your-username>/assignment3-patentsberta-qlora-gold100"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)

text = "A method for reducing CO2 emissions in industrial heat recovery systems..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
print({"label": int(pred)})  # 0=not_green, 1=green