CTB2001

Create README.md

9c66b9f verified 10 days ago

2.92 kB

language:
  - en
license: mit
library_name: transformers
pipeline_tag: text-classification
tags:
  - patents
  - green-tech
  - qlora
  - peft
  - sequence-classification
model-index:
  - name: assignment3-patentsberta-qlora-gold100
    results:
      - task:
          type: text-classification
          name: Green patent detection
        dataset:
          name: patents_50k_green (eval_silver)
          type: custom
        metrics:
          - type: f1
            value: 0.5006382068
            name: Macro F1
          - type: accuracy
            value: 0.5008
            name: Accuracy

Assignment 3 Model — Green Patent Detection (QLoRA + PatentSBERTa)

Model Summary

This repository contains the final downstream Assignment 3 classifier for green patent detection.
Workflow:

Baseline uncertainty sampling on patent claims.
QLoRA-based labeling/rationale generation on top-100 high-risk examples.
Final PatentSBERTa fine-tuning on train_silver + 100 gold high-risk.

Base Model

AI-Growth-Lab/PatentSBERTa (sequence classification head, 2 labels)

Training Setup

Seed: 42
Train rows (augmented): 20,100
Eval rows: 5,000
Gold rows: 100
Hardware used: NVIDIA L4
Frameworks: transformers, datasets, torch, peft, bitsandbytes

Results

Assignment 3 final model (this repo)

Eval accuracy: 0.5008
Eval macro F1: 0.5006382068
Gold100 accuracy: 0.53
Gold100 macro F1: 0.5037482842

Comparison table (Assignment requirement)

Model Version	Training Data Source	F1 Score (Eval Set)
1. Baseline	Frozen Embeddings (No Fine-tuning)	0.7727474956
2. Assignment 2 Model	Fine-tuned on Silver + Gold (Simple LLM)	0.4975369710
3. Assignment 3 Model	Fine-tuned on Silver + Gold (QLoRA)	0.5006382068

Intended Use

Educational/research use for green patent classification experiments.
Binary label output: non-green (0) vs green (1).

Limitations

Dataset and labels are project-specific and may not generalize broadly.
Part C used automated acceptance policy for gold labels in this run (no manual overrides).
Model should not be used for legal/commercial patent decisions without human review.

Files in this Repository

config.json
model.safetensors
tokenizer files
optional: training/evaluation summaries

Example Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo_id = "<your-username>/assignment3-patentsberta-qlora-gold100"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)

text = "A method for reducing CO2 emissions in industrial heat recovery systems..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
print({"label": int(pred)})  # 0=not_green, 1=green