# Sarcasm Detection with BERT This repository contains a fine-tuned BERT model for detecting sarcasm in headlines and text. The model achieves high accuracy in distinguishing between sarcastic and non-sarcastic content using natural language processing techniques. --- ## Model Details - **Model Name:** BERT-Base-Uncased Fine-tuned for Sarcasm Detection - **Model Architecture:** BERT Base (110M parameters) - **Task:** Binary Classification (Sarcastic vs Non-Sarcastic) - **Dataset:** Sarcasm Headlines Dataset - **Quantization:** Float16 (for optimized deployment) - **Fine-tuning Framework:** Hugging Face Transformers --- ## Dataset The model was trained on the **Sarcasm Headlines Dataset** which contains: - **Total Samples:** 26,709 headlines - **Features:** - `headline`: The text content to classify - `is_sarcastic`: Binary label (1 for sarcastic, 0 for non-sarcastic) - **Train/Test Split:** 90% training, 10% evaluation --- ## Performance Metrics | Epoch | Training Loss | Validation Loss | Accuracy | |-------|---------------|-----------------|----------| | 1 | 0.2048 | 0.1821 | 92.96% | | 2 | 0.1138 | 0.2792 | 91.01% | | 3 | 0.0586 | 0.2372 | **93.86%** | **Final Model Performance:** - **Best Accuracy:** 93.86% - **Final Training Loss:** 0.146 --- ## Installation ```bash pip install transformers datasets evaluate scikit-learn torch ``` --- ## Usage ### Quick Start ```python from transformers import pipeline import torch # Load the trained model classifier = pipeline("text-classification", model="./sarcasm_model", tokenizer="./sarcasm_model") # Test examples test_inputs = [ "I'm absolutely thrilled to be stuck in traffic again.", "The weather is nice and sunny today.", "Oh great, another email from the boss with more tasks." ] for sentence in test_inputs: result = classifier(sentence)[0] label = "Sarcastic" if result["label"] == "LABEL_1" else "Not Sarcastic" print(f"'{sentence}' → {label} (Confidence: {result['score']:.2f})") ``` ### Manual Model Loading ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer model = AutoModelForSequenceClassification.from_pretrained("./sarcasm_model") tokenizer = AutoTokenizer.from_pretrained("./sarcasm_model") # Tokenize input text = "Oh wonderful, another Monday morning!" inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128) # Inference with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = outputs.logits.argmax(dim=1).item() label_mapping = {0: "Not Sarcastic", 1: "Sarcastic"} confidence = predictions[0][predicted_class].item() print(f"Prediction: {label_mapping[predicted_class]} (Confidence: {confidence:.2f})") ``` --- ## Training Configuration ### Model Parameters - **Base Model:** `bert-base-uncased` - **Number of Labels:** 2 (binary classification) - **Max Sequence Length:** 128 tokens - **Tokenization:** WordPiece with padding and truncation ### Training Arguments - **Learning Rate:** 2e-5 - **Batch Size:** 16 (training), 32 (evaluation) - **Epochs:** 3 - **Weight Decay:** 0.01 - **Evaluation Strategy:** Every epoch - **Optimizer:** AdamW (default) ### Hardware Requirements - **GPU:** NVIDIA Tesla T4 (or equivalent) - **Memory:** ~4GB GPU memory for training - **Training Time:** ~18 minutes for 3 epochs --- ## Model Architecture The model uses BERT's transformer architecture with: - **Encoder Layers:** 12 - **Attention Heads:** 12 - **Hidden Size:** 768 - **Vocabulary Size:** 30,522 - **Classification Head:** Linear layer (768 → 2) --- ## File Structure ``` sarcasm-detection/ ├── sarcasm_model/ # Main fine-tuned model │ ├── config.json │ ├── model.safetensors │ ├── tokenizer_config.json │ ├── special_tokens_map.json │ ├── vocab.txt │ └── tokenizer.json ├── quantized-model/ # Float16 quantized version │ ├── config.json │ ├── model.safetensors │ └── tokenizer files... ├── logs/ # Training logs ├── sarcasm-detection.ipynb # Training notebook └── README.md # This file ``` --- ## Quantization A quantized version of the model is available for deployment optimization: ```python # Load quantized model (Float16) quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized-model") quantized_model = quantized_model.to(dtype=torch.float16) ``` **Benefits of Quantization:** - **Reduced Memory Usage:** ~50% smaller model size - **Faster Inference:** Improved speed on compatible hardware - **Minimal Accuracy Loss:** Maintains classification performance --- ## Limitations - **Domain Specificity:** Trained primarily on headlines; may not generalize perfectly to other text types - **Context Dependency:** Sarcasm detection can be highly context-dependent and subjective - **Cultural Nuances:** May not capture sarcasm patterns from different cultural contexts - **Short Text Focus:** Optimized for headline-length text (typically under 128 tokens) --- ## Potential Improvements - **Data Augmentation:** Include more diverse sarcasm examples - **Ensemble Methods:** Combine multiple models for better accuracy - **Context Integration:** Incorporate additional context beyond the headline - **Multi-language Support:** Extend to other languages - **Real-time Processing:** Optimize for streaming applications --- ## Applications - **Social Media Monitoring:** Detect sarcastic comments and posts - **Content Moderation:** Identify potentially misleading sarcastic content - **Sentiment Analysis Enhancement:** Improve sentiment classification accuracy - **News Analysis:** Analyze editorial tone and bias in headlines - **Customer Feedback:** Better understand customer sentiment in reviews --- ## Citation If you use this model in your research, please cite: ```bibtex @misc{sarcasm_detection_bert, title={BERT-based Sarcasm Detection for Headlines}, author={Your Name}, year={2025}, note={Fine-tuned BERT model for binary sarcasm classification} } ``` --- ## Contributing Contributions are welcome! Please feel free to: - Report bugs or issues - Suggest improvements - Add new features - Improve documentation --- ## License This project is licensed under the MIT License. The underlying BERT model follows Google's Apache 2.0 license. --- ## Acknowledgments - **Hugging Face** for the Transformers library - **Google Research** for the original BERT model - **Kaggle** for providing the Sarcasm Headlines Dataset - **PyTorch** for the deep learning framework