nimishgarg's picture
Upload 7 files
068dbc2 verified
# Roberta-Base Quantized Model for Toxic-Comment-Classification
This repository hosts a quantized version of the Roberta model, fine-tuned for Toxic-comment classification . The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss.
## Model Details
- **Model Architecture:** Roberta Base Uncased
- **Task:** Binary Sentiment Classification (Positive/Negative)
- **Dataset:** Classified_comments
- **Quantization:** Float16
- **Fine-tuning Framework:** Hugging Face Transformers
---
## Installation
```bash
pip install transformers datasets scikit-learn
```
---
## Loading the Model
```python
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch
# Load tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2).to(device)
# Define test sentences
new_comments = [
"I hate you so much, you are disgusting.",
"What a terrible idea. Just awful.",
"You are looking beautiful today"
]
# Tokenize and predict
def predict_comments(texts, model, tokenizer):
# If a single string is passed, convert to list
if isinstance(texts, str):
texts = [texts]
# Preprocess (same as training)
def preprocess(text):
text = text.lower()
text = re.sub(r"http\S+|www\S+|https\S+", '', text)
text = re.sub(r'\@\w+|\#','', text)
text = re.sub(r"[^a-zA-Z0-9\s.,!?']", '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text
cleaned_texts = [preprocess(text) for text in texts]
# Tokenize
inputs = tokenizer(cleaned_texts, padding=True, truncation=True, return_tensors="pt").to(device)
# Move to model's device (CPU/GPU)
model.eval()
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=1).tolist()
# Map predictions
label_map = {0: "Non-Toxic", 1: "Toxic"}
return [label_map[pred] for pred in predictions]
```
---
## Performance Metrics
- **Accuracy:** 0.979737
- **Precision:** 0.976084
- **Recall:** 0.984133
- **F1 Score:** 0.980092
---
## Fine-Tuning Details
### Dataset
The dataset is sourced from Kaggle Classified_comment.csv . It contains 140000 labeled comments (Toxic or Non toxic).
The original training and testing sets were merged, shuffled, and re-split using an 80/20 ratio.
### Training
- **Epochs:** 3
- **Batch size:** 8
- **Learning rate:** 2e-5
- **Evaluation strategy:** `epoch`
---
## Quantization
Post-training quantization was applied using PyTorch’s `half()` precision (FP16) to reduce model size and inference time.
---
## Repository Structure
```python
.
β”œβ”€β”€ quantized-model/ # Contains the quantized model files
β”‚ β”œβ”€β”€ config.json
β”‚ β”œβ”€β”€ model.safetensors
β”‚ β”œβ”€β”€ tokenizer_config.json
β”‚ β”œβ”€β”€ vocab.txt
β”‚ └── special_tokens_map.json
β”œβ”€β”€ README.md # Model documentation
```
---
## Limitations
- The model is trained specifically for binary sentiment classification on Toxic comments.
- FP16 quantization may result in slight numerical instability in edge cases.
---
## Contributing
Feel free to open issues or submit pull requests to improve the model or documentation.