# Roberta-Base Quantized Model for Toxic-Comment-Classification

This repository hosts a quantized version of the Roberta model, fine-tuned for Toxic-comment classification . The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss.

## Model Details

- **Model Architecture:** Roberta Base Uncased  
- **Task:** Binary Sentiment Classification (Positive/Negative)  
- **Dataset:** Classified_comments 
- **Quantization:** Float16  
- **Fine-tuning Framework:** Hugging Face Transformers  

---

## Installation

```bash
pip install transformers datasets scikit-learn
```

---

## Loading the Model

```python
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch

# Load tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2).to(device)

# Define test sentences
new_comments = [
    "I hate you so much, you are disgusting.",
    
    "What a terrible idea. Just awful.",
    "You are looking beautiful today"
]


# Tokenize and predict
def predict_comments(texts, model, tokenizer):
    # If a single string is passed, convert to list
    if isinstance(texts, str):
        texts = [texts]
    
    # Preprocess (same as training)
    def preprocess(text):
        text = text.lower()
        text = re.sub(r"http\S+|www\S+|https\S+", '', text)
        text = re.sub(r'\@\w+|\#','', text)
        text = re.sub(r"[^a-zA-Z0-9\s.,!?']", '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text

    cleaned_texts = [preprocess(text) for text in texts]

    # Tokenize
    inputs = tokenizer(cleaned_texts, padding=True, truncation=True, return_tensors="pt").to(device)

    # Move to model's device (CPU/GPU)
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=1).tolist()

    # Map predictions
    label_map = {0: "Non-Toxic", 1: "Toxic"}
    return [label_map[pred] for pred in predictions]

```

---

## Performance Metrics

- **Accuracy:** 0.979737  
- **Precision:** 0.976084 
- **Recall:** 0.984133  
- **F1 Score:** 0.980092  

---

## Fine-Tuning Details

### Dataset

The dataset is sourced from Kaggle Classified_comment.csv . It contains 140000 labeled comments (Toxic or Non toxic).  
The original training and testing sets were merged, shuffled, and re-split using an 80/20 ratio.

### Training

- **Epochs:** 3  
- **Batch size:** 8  
- **Learning rate:** 2e-5  
- **Evaluation strategy:** `epoch`  

---

## Quantization

Post-training quantization was applied using PyTorch’s `half()` precision (FP16) to reduce model size and inference time.

---

## Repository Structure

```python
.
├── quantized-model/               # Contains the quantized model files
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer_config.json
│   ├── vocab.txt
│   └── special_tokens_map.json
├── README.md                      # Model documentation
```

---

## Limitations

- The model is trained specifically for binary sentiment classification on Toxic comments.
- FP16 quantization may result in slight numerical instability in edge cases.


---

## Contributing

Feel free to open issues or submit pull requests to improve the model or documentation.