virustechhacks's picture
Update README.md
3662fa6 verified
---
library_name: transformers
tags:
- text-classification
- distilbert
- sentiment-analysis
- new-closed-neutral
- colab
---
# πŸ“Œ Model Card: distil-bert-classifier
This model is a fine-tuned DistilBERT model for sequence classification, designed to identify whether a place (e.g., restaurants, businesses) is **NEW**, **CLOSED**, or **NEUTRAL** based on short text snippets.
---
## 🧠 Model Details
### Model Description
- **Base Model:** `distilbert-base-uncased`
- **Task:** Sequence Classification
- **Classes:** `NEW`, `CLOSED`, `NEUTRAL`
- **Language:** English
- **License:** MIT *(confirm if needed)*
- **Developer:** virustechhacks
This model helps extract signals about business status from textual data such as reviews, posts, or headlines.
---
## πŸ”— Model Sources
- **Repository:** https://huggingface.co/virustechhacks/distil-bert-classifier
---
## πŸš€ Uses
### βœ… Direct Use
Classify short text snippets into:
- `NEW` β†’ Newly opened places
- `CLOSED` β†’ Shut down or no longer operating
- `NEUTRAL` β†’ No clear status signal
### πŸ”„ Downstream Use
Outputs can be aggregated into features like:
- `closed_signal_ratio`
- `new_signal_ratio`
- `mention_count`
These can feed into larger ML pipelines (e.g., XGBoost models).
### ⚠️ Out-of-Scope
- General sentiment analysis beyond defined labels
- Non-English text
- Long documents (>128 tokens)
- High-stakes decision-making systems
---
## ⚠️ Bias, Risks, and Limitations
- **Synthetic Data Bias:**
Trained on rule-based synthetic data β†’ may not generalize well to real-world language.
- **No Time Awareness:**
Cannot distinguish *recent vs outdated* signals.
- **Token Limit:**
Inputs >128 tokens are truncated.
---
## πŸ’‘ Recommendations
For production use:
- Fine-tune on real-world datasets
- Add timestamp-based features
- Evaluate thoroughly on live data
---
## πŸ› οΈ How to Use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
repo_name = "virustechhacks/distil-bert-classifier"
tokenizer = AutoTokenizer.from_pretrained(repo_name)
model = AutoModelForSequenceClassification.from_pretrained(repo_name)
id_to_label = {0: "NEW", 1: "CLOSED", 2: "NEUTRAL"}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
def predict_status(text):
inputs = tokenizer(
text,
truncation=True,
padding="max_length",
max_length=128,
return_tensors="pt"
)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
probs = F.softmax(outputs.logits, dim=-1)
confidence, pred = torch.max(probs, dim=1)
return id_to_label[pred.item()], confidence.item()
# Example
print(predict_status("Grand opening this weekend!"))
print(predict_status("The store ceased operations."))