# Roberta-Base Quantized Model for Toxic-Comment-Classification This repository hosts a quantized version of the Roberta model, fine-tuned for Toxic-comment classification . The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss. ## Model Details - **Model Architecture:** Roberta Base Uncased - **Task:** Binary Sentiment Classification (Positive/Negative) - **Dataset:** Classified_comments - **Quantization:** Float16 - **Fine-tuning Framework:** Hugging Face Transformers --- ## Installation ```bash pip install transformers datasets scikit-learn ``` --- ## Loading the Model ```python from transformers import RobertaTokenizer, RobertaForSequenceClassification import torch # Load tokenizer and model tokenizer = RobertaTokenizer.from_pretrained("roberta-base") model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2).to(device) # Define test sentences new_comments = [ "I hate you so much, you are disgusting.", "What a terrible idea. Just awful.", "You are looking beautiful today" ] # Tokenize and predict def predict_comments(texts, model, tokenizer): # If a single string is passed, convert to list if isinstance(texts, str): texts = [texts] # Preprocess (same as training) def preprocess(text): text = text.lower() text = re.sub(r"http\S+|www\S+|https\S+", '', text) text = re.sub(r'\@\w+|\#','', text) text = re.sub(r"[^a-zA-Z0-9\s.,!?']", '', text) text = re.sub(r'\s+', ' ', text).strip() return text cleaned_texts = [preprocess(text) for text in texts] # Tokenize inputs = tokenizer(cleaned_texts, padding=True, truncation=True, return_tensors="pt").to(device) # Move to model's device (CPU/GPU) model.eval() with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=1).tolist() # Map predictions label_map = {0: "Non-Toxic", 1: "Toxic"} return [label_map[pred] for pred in predictions] ``` --- ## Performance Metrics - **Accuracy:** 0.979737 - **Precision:** 0.976084 - **Recall:** 0.984133 - **F1 Score:** 0.980092 --- ## Fine-Tuning Details ### Dataset The dataset is sourced from Kaggle Classified_comment.csv . It contains 140000 labeled comments (Toxic or Non toxic). The original training and testing sets were merged, shuffled, and re-split using an 80/20 ratio. ### Training - **Epochs:** 3 - **Batch size:** 8 - **Learning rate:** 2e-5 - **Evaluation strategy:** `epoch` --- ## Quantization Post-training quantization was applied using PyTorch’s `half()` precision (FP16) to reduce model size and inference time. --- ## Repository Structure ```python . ├── quantized-model/ # Contains the quantized model files │ ├── config.json │ ├── model.safetensors │ ├── tokenizer_config.json │ ├── vocab.txt │ └── special_tokens_map.json ├── README.md # Model documentation ``` --- ## Limitations - The model is trained specifically for binary sentiment classification on Toxic comments. - FP16 quantization may result in slight numerical instability in edge cases. --- ## Contributing Feel free to open issues or submit pull requests to improve the model or documentation.