YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Roberta-Base Quantized Model for Toxic-Comment-Classification

This repository hosts a quantized version of the Roberta model, fine-tuned for Toxic-comment classification . The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss.

Model Details

  • Model Architecture: Roberta Base Uncased
  • Task: Binary Sentiment Classification (Positive/Negative)
  • Dataset: Classified_comments
  • Quantization: Float16
  • Fine-tuning Framework: Hugging Face Transformers

Installation

pip install transformers datasets scikit-learn

Loading the Model

from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch

# Load tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2).to(device)

# Define test sentences
new_comments = [
    "I hate you so much, you are disgusting.",
    
    "What a terrible idea. Just awful.",
    "You are looking beautiful today"
]


# Tokenize and predict
def predict_comments(texts, model, tokenizer):
    # If a single string is passed, convert to list
    if isinstance(texts, str):
        texts = [texts]
    
    # Preprocess (same as training)
    def preprocess(text):
        text = text.lower()
        text = re.sub(r"http\S+|www\S+|https\S+", '', text)
        text = re.sub(r'\@\w+|\#','', text)
        text = re.sub(r"[^a-zA-Z0-9\s.,!?']", '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text

    cleaned_texts = [preprocess(text) for text in texts]

    # Tokenize
    inputs = tokenizer(cleaned_texts, padding=True, truncation=True, return_tensors="pt").to(device)

    # Move to model's device (CPU/GPU)
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=1).tolist()

    # Map predictions
    label_map = {0: "Non-Toxic", 1: "Toxic"}
    return [label_map[pred] for pred in predictions]

Performance Metrics

  • Accuracy: 0.979737
  • Precision: 0.976084
  • Recall: 0.984133
  • F1 Score: 0.980092

Fine-Tuning Details

Dataset

The dataset is sourced from Kaggle Classified_comment.csv . It contains 140000 labeled comments (Toxic or Non toxic).
The original training and testing sets were merged, shuffled, and re-split using an 80/20 ratio.

Training

  • Epochs: 3
  • Batch size: 8
  • Learning rate: 2e-5
  • Evaluation strategy: epoch

Quantization

Post-training quantization was applied using PyTorch’s half() precision (FP16) to reduce model size and inference time.


Repository Structure

.
β”œβ”€β”€ quantized-model/               # Contains the quantized model files
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   β”œβ”€β”€ vocab.txt
β”‚   └── special_tokens_map.json
β”œβ”€β”€ README.md                      # Model documentation

Limitations

  • The model is trained specifically for binary sentiment classification on Toxic comments.
  • FP16 quantization may result in slight numerical instability in edge cases.

Contributing

Feel free to open issues or submit pull requests to improve the model or documentation.

Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support