CTB2001
/

Assignment_3_Model

Text Classification

sequence-classification

Eval Results (legacy)

Model card Files Files and versions

Assignment_3_Model / README.md

CTB2001's picture

Create README.md

9c66b9f verified 10 days ago

|

2.92 kB

	---
	language:
	- en
	license: mit
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- patents
	- green-tech
	- qlora
	- peft
	- sequence-classification
	model-index:
	- name: assignment3-patentsberta-qlora-gold100
	results:
	- task:
	type: text-classification
	name: Green patent detection
	dataset:
	name: patents_50k_green (eval_silver)
	type: custom
	metrics:
	- type: f1
	value: 0.5006382068
	name: Macro F1
	- type: accuracy
	value: 0.5008
	name: Accuracy
	---

	# Assignment 3 Model — Green Patent Detection (QLoRA + PatentSBERTa)

	## Model Summary
	This repository contains the final downstream Assignment 3 classifier for green patent detection.
	Workflow:
	1. Baseline uncertainty sampling on patent claims.
	2. QLoRA-based labeling/rationale generation on top-100 high-risk examples.
	3. Final PatentSBERTa fine-tuning on `train_silver + 100 gold high-risk`.

	## Base Model
	- `AI-Growth-Lab/PatentSBERTa` (sequence classification head, 2 labels)

	## Training Setup
	- Seed: 42
	- Train rows (augmented): 20,100
	- Eval rows: 5,000
	- Gold rows: 100
	- Hardware used: NVIDIA L4
	- Frameworks: `transformers`, `datasets`, `torch`, `peft`, `bitsandbytes`

	## Results

	### Assignment 3 final model (this repo)
	- Eval accuracy: 0.5008
	- Eval macro F1: 0.5006382068
	- Gold100 accuracy: 0.53
	- Gold100 macro F1: 0.5037482842

	### Comparison table (Assignment requirement)
	\| Model Version \| Training Data Source \| F1 Score (Eval Set) \|
	\|---\|---\|---:\|
	\| 1. Baseline \| Frozen Embeddings (No Fine-tuning) \| 0.7727474956 \|
	\| 2. Assignment 2 Model \| Fine-tuned on Silver + Gold (Simple LLM) \| 0.4975369710 \|
	\| 3. Assignment 3 Model \| Fine-tuned on Silver + Gold (QLoRA) \| 0.5006382068 \|

	## Intended Use
	- Educational/research use for green patent classification experiments.
	- Binary label output: non-green (0) vs green (1).

	## Limitations
	- Dataset and labels are project-specific and may not generalize broadly.
	- Part C used automated acceptance policy for gold labels in this run (no manual overrides).
	- Model should not be used for legal/commercial patent decisions without human review.

	## Files in this Repository
	- `config.json`
	- `model.safetensors`
	- tokenizer files
	- optional: training/evaluation summaries

	## Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	repo_id = "<your-username>/assignment3-patentsberta-qlora-gold100"
	tokenizer = AutoTokenizer.from_pretrained(repo_id)
	model = AutoModelForSequenceClassification.from_pretrained(repo_id)

	text = "A method for reducing CO2 emissions in industrial heat recovery systems..."
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
	with torch.no_grad():
	logits = model(**inputs).logits
	pred = torch.argmax(logits, dim=-1).item()
	print({"label": int(pred)}) # 0=not_green, 1=green