virustechhacks
/

distil-bert-classifier

Text Classification

sentiment-analysis

new-closed-neutral

text-embeddings-inference

Model card Files Files and versions

distil-bert-classifier / README.md

virustechhacks's picture

Update README.md

3662fa6 verified about 2 months ago

|

history blame contribute delete

2.97 kB

	---
	library_name: transformers
	tags:
	- text-classification
	- distilbert
	- sentiment-analysis
	- new-closed-neutral
	- colab
	---

	# 📌 Model Card: distil-bert-classifier

	This model is a fine-tuned DistilBERT model for sequence classification, designed to identify whether a place (e.g., restaurants, businesses) is NEW, CLOSED, or NEUTRAL based on short text snippets.

	---

	## 🧠 Model Details

	### Model Description

	- Base Model: `distilbert-base-uncased`
	- Task: Sequence Classification
	- Classes: `NEW`, `CLOSED`, `NEUTRAL`
	- Language: English
	- License: MIT (confirm if needed)
	- Developer: virustechhacks

	This model helps extract signals about business status from textual data such as reviews, posts, or headlines.

	---

	## 🔗 Model Sources

	- Repository: https://huggingface.co/virustechhacks/distil-bert-classifier

	---

	## 🚀 Uses

	### ✅ Direct Use

	Classify short text snippets into:
	- `NEW` → Newly opened places
	- `CLOSED` → Shut down or no longer operating
	- `NEUTRAL` → No clear status signal

	### 🔄 Downstream Use

	Outputs can be aggregated into features like:
	- `closed_signal_ratio`
	- `new_signal_ratio`
	- `mention_count`

	These can feed into larger ML pipelines (e.g., XGBoost models).

	### ⚠️ Out-of-Scope

	- General sentiment analysis beyond defined labels
	- Non-English text
	- Long documents (>128 tokens)
	- High-stakes decision-making systems

	---

	## ⚠️ Bias, Risks, and Limitations

	- Synthetic Data Bias:
	Trained on rule-based synthetic data → may not generalize well to real-world language.

	- No Time Awareness:
	Cannot distinguish recent vs outdated signals.

	- Token Limit:
	Inputs >128 tokens are truncated.

	---

	## 💡 Recommendations

	For production use:
	- Fine-tune on real-world datasets
	- Add timestamp-based features
	- Evaluate thoroughly on live data

	---

	## 🛠️ How to Use

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import torch.nn.functional as F

	repo_name = "virustechhacks/distil-bert-classifier"

	tokenizer = AutoTokenizer.from_pretrained(repo_name)
	model = AutoModelForSequenceClassification.from_pretrained(repo_name)

	id_to_label = {0: "NEW", 1: "CLOSED", 2: "NEUTRAL"}

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)

	def predict_status(text):
	inputs = tokenizer(
	text,
	truncation=True,
	padding="max_length",
	max_length=128,
	return_tensors="pt"
	)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	with torch.no_grad():
	outputs = model(**inputs)

	probs = F.softmax(outputs.logits, dim=-1)
	confidence, pred = torch.max(probs, dim=1)

	return id_to_label[pred.item()], confidence.item()

	# Example
	print(predict_status("Grand opening this weekend!"))
	print(predict_status("The store ceased operations."))