Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: mit
|
| 5 |
+
library_name: transformers
|
| 6 |
+
pipeline_tag: text-classification
|
| 7 |
+
tags:
|
| 8 |
+
- patents
|
| 9 |
+
- green-tech
|
| 10 |
+
- qlora
|
| 11 |
+
- peft
|
| 12 |
+
- sequence-classification
|
| 13 |
+
model-index:
|
| 14 |
+
- name: assignment3-patentsberta-qlora-gold100
|
| 15 |
+
results:
|
| 16 |
+
- task:
|
| 17 |
+
type: text-classification
|
| 18 |
+
name: Green patent detection
|
| 19 |
+
dataset:
|
| 20 |
+
name: patents_50k_green (eval_silver)
|
| 21 |
+
type: custom
|
| 22 |
+
metrics:
|
| 23 |
+
- type: f1
|
| 24 |
+
value: 0.5006382068
|
| 25 |
+
name: Macro F1
|
| 26 |
+
- type: accuracy
|
| 27 |
+
value: 0.5008
|
| 28 |
+
name: Accuracy
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
# Assignment 3 Model — Green Patent Detection (QLoRA + PatentSBERTa)
|
| 32 |
+
|
| 33 |
+
## Model Summary
|
| 34 |
+
This repository contains the **final downstream Assignment 3 classifier** for green patent detection.
|
| 35 |
+
Workflow:
|
| 36 |
+
1. Baseline uncertainty sampling on patent claims.
|
| 37 |
+
2. QLoRA-based labeling/rationale generation on top-100 high-risk examples.
|
| 38 |
+
3. Final PatentSBERTa fine-tuning on `train_silver + 100 gold high-risk`.
|
| 39 |
+
|
| 40 |
+
## Base Model
|
| 41 |
+
- `AI-Growth-Lab/PatentSBERTa` (sequence classification head, 2 labels)
|
| 42 |
+
|
| 43 |
+
## Training Setup
|
| 44 |
+
- Seed: 42
|
| 45 |
+
- Train rows (augmented): 20,100
|
| 46 |
+
- Eval rows: 5,000
|
| 47 |
+
- Gold rows: 100
|
| 48 |
+
- Hardware used: NVIDIA L4
|
| 49 |
+
- Frameworks: `transformers`, `datasets`, `torch`, `peft`, `bitsandbytes`
|
| 50 |
+
|
| 51 |
+
## Results
|
| 52 |
+
|
| 53 |
+
### Assignment 3 final model (this repo)
|
| 54 |
+
- Eval accuracy: **0.5008**
|
| 55 |
+
- Eval macro F1: **0.5006382068**
|
| 56 |
+
- Gold100 accuracy: **0.53**
|
| 57 |
+
- Gold100 macro F1: **0.5037482842**
|
| 58 |
+
|
| 59 |
+
### Comparison table (Assignment requirement)
|
| 60 |
+
| Model Version | Training Data Source | F1 Score (Eval Set) |
|
| 61 |
+
|---|---|---:|
|
| 62 |
+
| 1. Baseline | Frozen Embeddings (No Fine-tuning) | 0.7727474956 |
|
| 63 |
+
| 2. Assignment 2 Model | Fine-tuned on Silver + Gold (Simple LLM) | 0.4975369710 |
|
| 64 |
+
| 3. Assignment 3 Model | Fine-tuned on Silver + Gold (QLoRA) | 0.5006382068 |
|
| 65 |
+
|
| 66 |
+
## Intended Use
|
| 67 |
+
- Educational/research use for green patent classification experiments.
|
| 68 |
+
- Binary label output: non-green (0) vs green (1).
|
| 69 |
+
|
| 70 |
+
## Limitations
|
| 71 |
+
- Dataset and labels are project-specific and may not generalize broadly.
|
| 72 |
+
- Part C used automated acceptance policy for gold labels in this run (no manual overrides).
|
| 73 |
+
- Model should not be used for legal/commercial patent decisions without human review.
|
| 74 |
+
|
| 75 |
+
## Files in this Repository
|
| 76 |
+
- `config.json`
|
| 77 |
+
- `model.safetensors`
|
| 78 |
+
- tokenizer files
|
| 79 |
+
- optional: training/evaluation summaries
|
| 80 |
+
|
| 81 |
+
## Example Usage
|
| 82 |
+
|
| 83 |
+
```python
|
| 84 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 85 |
+
import torch
|
| 86 |
+
|
| 87 |
+
repo_id = "<your-username>/assignment3-patentsberta-qlora-gold100"
|
| 88 |
+
tokenizer = AutoTokenizer.from_pretrained(repo_id)
|
| 89 |
+
model = AutoModelForSequenceClassification.from_pretrained(repo_id)
|
| 90 |
+
|
| 91 |
+
text = "A method for reducing CO2 emissions in industrial heat recovery systems..."
|
| 92 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
|
| 93 |
+
with torch.no_grad():
|
| 94 |
+
logits = model(**inputs).logits
|
| 95 |
+
pred = torch.argmax(logits, dim=-1).item()
|
| 96 |
+
print({"label": int(pred)}) # 0=not_green, 1=green
|