| # Roberta-Base Quantized Model for Toxic-Comment-Classification | |
| This repository hosts a quantized version of the Roberta model, fine-tuned for Toxic-comment classification . The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss. | |
| ## Model Details | |
| - **Model Architecture:** Roberta Base Uncased | |
| - **Task:** Binary Sentiment Classification (Positive/Negative) | |
| - **Dataset:** Classified_comments | |
| - **Quantization:** Float16 | |
| - **Fine-tuning Framework:** Hugging Face Transformers | |
| --- | |
| ## Installation | |
| ```bash | |
| pip install transformers datasets scikit-learn | |
| ``` | |
| --- | |
| ## Loading the Model | |
| ```python | |
| from transformers import RobertaTokenizer, RobertaForSequenceClassification | |
| import torch | |
| # Load tokenizer and model | |
| tokenizer = RobertaTokenizer.from_pretrained("roberta-base") | |
| model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2).to(device) | |
| # Define test sentences | |
| new_comments = [ | |
| "I hate you so much, you are disgusting.", | |
| "What a terrible idea. Just awful.", | |
| "You are looking beautiful today" | |
| ] | |
| # Tokenize and predict | |
| def predict_comments(texts, model, tokenizer): | |
| # If a single string is passed, convert to list | |
| if isinstance(texts, str): | |
| texts = [texts] | |
| # Preprocess (same as training) | |
| def preprocess(text): | |
| text = text.lower() | |
| text = re.sub(r"http\S+|www\S+|https\S+", '', text) | |
| text = re.sub(r'\@\w+|\#','', text) | |
| text = re.sub(r"[^a-zA-Z0-9\s.,!?']", '', text) | |
| text = re.sub(r'\s+', ' ', text).strip() | |
| return text | |
| cleaned_texts = [preprocess(text) for text in texts] | |
| # Tokenize | |
| inputs = tokenizer(cleaned_texts, padding=True, truncation=True, return_tensors="pt").to(device) | |
| # Move to model's device (CPU/GPU) | |
| model.eval() | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| predictions = torch.argmax(outputs.logits, dim=1).tolist() | |
| # Map predictions | |
| label_map = {0: "Non-Toxic", 1: "Toxic"} | |
| return [label_map[pred] for pred in predictions] | |
| ``` | |
| --- | |
| ## Performance Metrics | |
| - **Accuracy:** 0.979737 | |
| - **Precision:** 0.976084 | |
| - **Recall:** 0.984133 | |
| - **F1 Score:** 0.980092 | |
| --- | |
| ## Fine-Tuning Details | |
| ### Dataset | |
| The dataset is sourced from Kaggle Classified_comment.csv . It contains 140000 labeled comments (Toxic or Non toxic). | |
| The original training and testing sets were merged, shuffled, and re-split using an 80/20 ratio. | |
| ### Training | |
| - **Epochs:** 3 | |
| - **Batch size:** 8 | |
| - **Learning rate:** 2e-5 | |
| - **Evaluation strategy:** `epoch` | |
| --- | |
| ## Quantization | |
| Post-training quantization was applied using PyTorchβs `half()` precision (FP16) to reduce model size and inference time. | |
| --- | |
| ## Repository Structure | |
| ```python | |
| . | |
| βββ quantized-model/ # Contains the quantized model files | |
| β βββ config.json | |
| β βββ model.safetensors | |
| β βββ tokenizer_config.json | |
| β βββ vocab.txt | |
| β βββ special_tokens_map.json | |
| βββ README.md # Model documentation | |
| ``` | |
| --- | |
| ## Limitations | |
| - The model is trained specifically for binary sentiment classification on Toxic comments. | |
| - FP16 quantization may result in slight numerical instability in edge cases. | |
| --- | |
| ## Contributing | |
| Feel free to open issues or submit pull requests to improve the model or documentation. | |