Assignment_3_Model / README.md
CTB2001's picture
Create README.md
9c66b9f verified
|
raw
history blame
2.92 kB
---
language:
- en
license: mit
library_name: transformers
pipeline_tag: text-classification
tags:
- patents
- green-tech
- qlora
- peft
- sequence-classification
model-index:
- name: assignment3-patentsberta-qlora-gold100
results:
- task:
type: text-classification
name: Green patent detection
dataset:
name: patents_50k_green (eval_silver)
type: custom
metrics:
- type: f1
value: 0.5006382068
name: Macro F1
- type: accuracy
value: 0.5008
name: Accuracy
---
# Assignment 3 Model — Green Patent Detection (QLoRA + PatentSBERTa)
## Model Summary
This repository contains the **final downstream Assignment 3 classifier** for green patent detection.
Workflow:
1. Baseline uncertainty sampling on patent claims.
2. QLoRA-based labeling/rationale generation on top-100 high-risk examples.
3. Final PatentSBERTa fine-tuning on `train_silver + 100 gold high-risk`.
## Base Model
- `AI-Growth-Lab/PatentSBERTa` (sequence classification head, 2 labels)
## Training Setup
- Seed: 42
- Train rows (augmented): 20,100
- Eval rows: 5,000
- Gold rows: 100
- Hardware used: NVIDIA L4
- Frameworks: `transformers`, `datasets`, `torch`, `peft`, `bitsandbytes`
## Results
### Assignment 3 final model (this repo)
- Eval accuracy: **0.5008**
- Eval macro F1: **0.5006382068**
- Gold100 accuracy: **0.53**
- Gold100 macro F1: **0.5037482842**
### Comparison table (Assignment requirement)
| Model Version | Training Data Source | F1 Score (Eval Set) |
|---|---|---:|
| 1. Baseline | Frozen Embeddings (No Fine-tuning) | 0.7727474956 |
| 2. Assignment 2 Model | Fine-tuned on Silver + Gold (Simple LLM) | 0.4975369710 |
| 3. Assignment 3 Model | Fine-tuned on Silver + Gold (QLoRA) | 0.5006382068 |
## Intended Use
- Educational/research use for green patent classification experiments.
- Binary label output: non-green (0) vs green (1).
## Limitations
- Dataset and labels are project-specific and may not generalize broadly.
- Part C used automated acceptance policy for gold labels in this run (no manual overrides).
- Model should not be used for legal/commercial patent decisions without human review.
## Files in this Repository
- `config.json`
- `model.safetensors`
- tokenizer files
- optional: training/evaluation summaries
## Example Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
repo_id = "<your-username>/assignment3-patentsberta-qlora-gold100"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)
text = "A method for reducing CO2 emissions in industrial heat recovery systems..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
print({"label": int(pred)}) # 0=not_green, 1=green