Assignment_3_Model / README.md
CTB2001's picture
Create README.md
9c66b9f verified
|
raw
history blame
2.92 kB
metadata
language:
  - en
license: mit
library_name: transformers
pipeline_tag: text-classification
tags:
  - patents
  - green-tech
  - qlora
  - peft
  - sequence-classification
model-index:
  - name: assignment3-patentsberta-qlora-gold100
    results:
      - task:
          type: text-classification
          name: Green patent detection
        dataset:
          name: patents_50k_green (eval_silver)
          type: custom
        metrics:
          - type: f1
            value: 0.5006382068
            name: Macro F1
          - type: accuracy
            value: 0.5008
            name: Accuracy

Assignment 3 Model — Green Patent Detection (QLoRA + PatentSBERTa)

Model Summary

This repository contains the final downstream Assignment 3 classifier for green patent detection.
Workflow:

  1. Baseline uncertainty sampling on patent claims.
  2. QLoRA-based labeling/rationale generation on top-100 high-risk examples.
  3. Final PatentSBERTa fine-tuning on train_silver + 100 gold high-risk.

Base Model

  • AI-Growth-Lab/PatentSBERTa (sequence classification head, 2 labels)

Training Setup

  • Seed: 42
  • Train rows (augmented): 20,100
  • Eval rows: 5,000
  • Gold rows: 100
  • Hardware used: NVIDIA L4
  • Frameworks: transformers, datasets, torch, peft, bitsandbytes

Results

Assignment 3 final model (this repo)

  • Eval accuracy: 0.5008
  • Eval macro F1: 0.5006382068
  • Gold100 accuracy: 0.53
  • Gold100 macro F1: 0.5037482842

Comparison table (Assignment requirement)

Model Version Training Data Source F1 Score (Eval Set)
1. Baseline Frozen Embeddings (No Fine-tuning) 0.7727474956
2. Assignment 2 Model Fine-tuned on Silver + Gold (Simple LLM) 0.4975369710
3. Assignment 3 Model Fine-tuned on Silver + Gold (QLoRA) 0.5006382068

Intended Use

  • Educational/research use for green patent classification experiments.
  • Binary label output: non-green (0) vs green (1).

Limitations

  • Dataset and labels are project-specific and may not generalize broadly.
  • Part C used automated acceptance policy for gold labels in this run (no manual overrides).
  • Model should not be used for legal/commercial patent decisions without human review.

Files in this Repository

  • config.json
  • model.safetensors
  • tokenizer files
  • optional: training/evaluation summaries

Example Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo_id = "<your-username>/assignment3-patentsberta-qlora-gold100"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)

text = "A method for reducing CO2 emissions in industrial heat recovery systems..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
print({"label": int(pred)})  # 0=not_green, 1=green