| | --- |
| | language: |
| | - en |
| | license: mit |
| | library_name: transformers |
| | pipeline_tag: text-classification |
| | tags: |
| | - patents |
| | - green-tech |
| | - qlora |
| | - peft |
| | - sequence-classification |
| | model-index: |
| | - name: assignment3-patentsberta-qlora-gold100 |
| | results: |
| | - task: |
| | type: text-classification |
| | name: Green patent detection |
| | dataset: |
| | name: patents_50k_green (eval_silver) |
| | type: custom |
| | metrics: |
| | - type: f1 |
| | value: 0.5006382068 |
| | name: Macro F1 |
| | - type: accuracy |
| | value: 0.5008 |
| | name: Accuracy |
| | --- |
| | |
| | # Assignment 3 Model — Green Patent Detection (QLoRA + PatentSBERTa) |
| |
|
| | ## Model Summary |
| | This repository contains the **final downstream Assignment 3 classifier** for green patent detection. |
| | Workflow: |
| | 1. Baseline uncertainty sampling on patent claims. |
| | 2. QLoRA-based labeling/rationale generation on top-100 high-risk examples. |
| | 3. Final PatentSBERTa fine-tuning on `train_silver + 100 gold high-risk`. |
| |
|
| | ## Base Model |
| | - `AI-Growth-Lab/PatentSBERTa` (sequence classification head, 2 labels) |
| |
|
| | ## Training Setup |
| | - Seed: 42 |
| | - Train rows (augmented): 20,100 |
| | - Eval rows: 5,000 |
| | - Gold rows: 100 |
| | - Hardware used: NVIDIA L4 |
| | - Frameworks: `transformers`, `datasets`, `torch`, `peft`, `bitsandbytes` |
| |
|
| | ## Results |
| |
|
| | ### Assignment 3 final model (this repo) |
| | - Eval accuracy: **0.5008** |
| | - Eval macro F1: **0.5006382068** |
| | - Gold100 accuracy: **0.53** |
| | - Gold100 macro F1: **0.5037482842** |
| |
|
| | ### Comparison table (Assignment requirement) |
| | | Model Version | Training Data Source | F1 Score (Eval Set) | |
| | |---|---|---:| |
| | | 1. Baseline | Frozen Embeddings (No Fine-tuning) | 0.7727474956 | |
| | | 2. Assignment 2 Model | Fine-tuned on Silver + Gold (Simple LLM) | 0.4975369710 | |
| | | 3. Assignment 3 Model | Fine-tuned on Silver + Gold (QLoRA) | 0.5006382068 | |
| |
|
| | ## Intended Use |
| | - Educational/research use for green patent classification experiments. |
| | - Binary label output: non-green (0) vs green (1). |
| |
|
| | ## Limitations |
| | - Dataset and labels are project-specific and may not generalize broadly. |
| | - Part C used automated acceptance policy for gold labels in this run (no manual overrides). |
| | - Model should not be used for legal/commercial patent decisions without human review. |
| |
|
| | ## Files in this Repository |
| | - `config.json` |
| | - `model.safetensors` |
| | - tokenizer files |
| | - optional: training/evaluation summaries |
| |
|
| | ## Example Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch |
| | |
| | repo_id = "<your-username>/assignment3-patentsberta-qlora-gold100" |
| | tokenizer = AutoTokenizer.from_pretrained(repo_id) |
| | model = AutoModelForSequenceClassification.from_pretrained(repo_id) |
| | |
| | text = "A method for reducing CO2 emissions in industrial heat recovery systems..." |
| | inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) |
| | with torch.no_grad(): |
| | logits = model(**inputs).logits |
| | pred = torch.argmax(logits, dim=-1).item() |
| | print({"label": int(pred)}) # 0=not_green, 1=green |