Sarcasmdetection / README.md

Upload 7 files

6e35907 verified 8 months ago

7.13 kB

	# Sarcasm Detection with BERT

	This repository contains a fine-tuned BERT model for detecting sarcasm in headlines and text. The model achieves high accuracy in distinguishing between sarcastic and non-sarcastic content using natural language processing techniques.

	---

	## Model Details

	- Model Name: BERT-Base-Uncased Fine-tuned for Sarcasm Detection
	- Model Architecture: BERT Base (110M parameters)
	- Task: Binary Classification (Sarcastic vs Non-Sarcastic)
	- Dataset: Sarcasm Headlines Dataset
	- Quantization: Float16 (for optimized deployment)
	- Fine-tuning Framework: Hugging Face Transformers

	---

	## Dataset

	The model was trained on the Sarcasm Headlines Dataset which contains:
	- Total Samples: 26,709 headlines
	- Features:
	- `headline`: The text content to classify
	- `is_sarcastic`: Binary label (1 for sarcastic, 0 for non-sarcastic)
	- Train/Test Split: 90% training, 10% evaluation

	---

	## Performance Metrics

	\| Epoch \| Training Loss \| Validation Loss \| Accuracy \|
	\|-------\|---------------\|-----------------\|----------\|
	\| 1 \| 0.2048 \| 0.1821 \| 92.96% \|
	\| 2 \| 0.1138 \| 0.2792 \| 91.01% \|
	\| 3 \| 0.0586 \| 0.2372 \| 93.86% \|

	Final Model Performance:
	- Best Accuracy: 93.86%
	- Final Training Loss: 0.146

	---

	## Installation

	```bash
	pip install transformers datasets evaluate scikit-learn torch
	```

	---

	## Usage

	### Quick Start

	```python
	from transformers import pipeline
	import torch

	# Load the trained model
	classifier = pipeline("text-classification",
	model="./sarcasm_model",
	tokenizer="./sarcasm_model")

	# Test examples
	test_inputs = [
	"I'm absolutely thrilled to be stuck in traffic again.",
	"The weather is nice and sunny today.",
	"Oh great, another email from the boss with more tasks."
	]

	for sentence in test_inputs:
	result = classifier(sentence)[0]
	label = "Sarcastic" if result["label"] == "LABEL_1" else "Not Sarcastic"
	print(f"'{sentence}' → {label} (Confidence: {result['score']:.2f})")
	```

	### Manual Model Loading

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	model = AutoModelForSequenceClassification.from_pretrained("./sarcasm_model")
	tokenizer = AutoTokenizer.from_pretrained("./sarcasm_model")

	# Tokenize input
	text = "Oh wonderful, another Monday morning!"
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)

	# Inference
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class = outputs.logits.argmax(dim=1).item()

	label_mapping = {0: "Not Sarcastic", 1: "Sarcastic"}
	confidence = predictions[0][predicted_class].item()
	print(f"Prediction: {label_mapping[predicted_class]} (Confidence: {confidence:.2f})")
	```

	---

	## Training Configuration

	### Model Parameters
	- Base Model: `bert-base-uncased`
	- Number of Labels: 2 (binary classification)
	- Max Sequence Length: 128 tokens
	- Tokenization: WordPiece with padding and truncation

	### Training Arguments
	- Learning Rate: 2e-5
	- Batch Size: 16 (training), 32 (evaluation)
	- Epochs: 3
	- Weight Decay: 0.01
	- Evaluation Strategy: Every epoch
	- Optimizer: AdamW (default)

	### Hardware Requirements
	- GPU: NVIDIA Tesla T4 (or equivalent)
	- Memory: ~4GB GPU memory for training
	- Training Time: ~18 minutes for 3 epochs

	---

	## Model Architecture

	The model uses BERT's transformer architecture with:
	- Encoder Layers: 12
	- Attention Heads: 12
	- Hidden Size: 768
	- Vocabulary Size: 30,522
	- Classification Head: Linear layer (768 → 2)

	---

	## File Structure

	```
	sarcasm-detection/
	├── sarcasm_model/ # Main fine-tuned model
	│ ├── config.json
	│ ├── model.safetensors
	│ ├── tokenizer_config.json
	│ ├── special_tokens_map.json
	│ ├── vocab.txt
	│ └── tokenizer.json
	├── quantized-model/ # Float16 quantized version
	│ ├── config.json
	│ ├── model.safetensors
	│ └── tokenizer files...
	├── logs/ # Training logs
	├── sarcasm-detection.ipynb # Training notebook
	└── README.md # This file
	```

	---

	## Quantization

	A quantized version of the model is available for deployment optimization:

	```python
	# Load quantized model (Float16)
	quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized-model")
	quantized_model = quantized_model.to(dtype=torch.float16)
	```

	Benefits of Quantization:
	- Reduced Memory Usage: ~50% smaller model size
	- Faster Inference: Improved speed on compatible hardware
	- Minimal Accuracy Loss: Maintains classification performance

	---

	## Limitations

	- Domain Specificity: Trained primarily on headlines; may not generalize perfectly to other text types
	- Context Dependency: Sarcasm detection can be highly context-dependent and subjective
	- Cultural Nuances: May not capture sarcasm patterns from different cultural contexts
	- Short Text Focus: Optimized for headline-length text (typically under 128 tokens)

	---

	## Potential Improvements

	- Data Augmentation: Include more diverse sarcasm examples
	- Ensemble Methods: Combine multiple models for better accuracy
	- Context Integration: Incorporate additional context beyond the headline
	- Multi-language Support: Extend to other languages
	- Real-time Processing: Optimize for streaming applications

	---

	## Applications

	- Social Media Monitoring: Detect sarcastic comments and posts
	- Content Moderation: Identify potentially misleading sarcastic content
	- Sentiment Analysis Enhancement: Improve sentiment classification accuracy
	- News Analysis: Analyze editorial tone and bias in headlines
	- Customer Feedback: Better understand customer sentiment in reviews

	---

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{sarcasm_detection_bert,
	title={BERT-based Sarcasm Detection for Headlines},
	author={Your Name},
	year={2025},
	note={Fine-tuned BERT model for binary sarcasm classification}
	}
	```

	---

	## Contributing

	Contributions are welcome! Please feel free to:
	- Report bugs or issues
	- Suggest improvements
	- Add new features
	- Improve documentation

	---

	## License

	This project is licensed under the MIT License. The underlying BERT model follows Google's Apache 2.0 license.

	---

	## Acknowledgments

	- Hugging Face for the Transformers library
	- Google Research for the original BERT model
	- Kaggle for providing the Sarcasm Headlines Dataset
	- PyTorch for the deep learning framework