Email Classifier

A fine-tuned DistilBERT model for binary classification of emails as productive or unproductive. This model is designed to automatically categorize emails to help prioritize important communications.

Model Details

Model Description

  • Model Type: Text Classification (Binary)
  • Base Model: distilbert-base-uncased
  • Task: Email productivity classification
  • Language: Portuguese and English (multilingual)
  • Labels:
    • 0: Unproductive (emails that don't require action)
    • 1: Productive (emails that require action or response)

Model Architecture

  • Architecture: DistilBERT (Distilled BERT)
  • Max Sequence Length: 512 tokens
  • Number of Labels: 2
  • Output: Binary classification with confidence scores

Intended Use

Primary Use Cases

  • Email Prioritization: Automatically identify emails that require immediate attention
  • Productivity Tools: Integrate into email management systems to filter and organize messages
  • Auto-Reply Systems: Determine which emails should trigger automated responses
  • Email Analytics: Analyze email patterns and productivity metrics

Out-of-Scope Use Cases

  • Spam detection (this model focuses on productivity, not spam)
  • Sentiment analysis (positive/negative emotions)
  • Topic classification (specific email topics)
  • Language detection (assumes input language is known)

Training Details

Training Data

The model was trained on a synthetic dataset of ~6,000 emails (balanced between productive and unproductive) generated using templates that simulate real-world email scenarios. The training data includes:

  • Productive Emails: Technical support requests, meeting requests, information requests, urgent problems, project discussions, etc.
  • Unproductive Emails: Thank you messages, congratulations, holiday greetings, status updates without action required, confirmations, etc.

Training Procedure

  • Training Framework: Hugging Face Transformers
  • Optimizer: AdamW
  • Learning Rate: 2e-5
  • Batch Size: 8
  • Epochs: 5 (with early stopping)
  • Early Stopping Patience: 3 epochs
  • Evaluation Metric: F1 score
  • Train/Test Split: 80/20

Training Features

  • Data Augmentation: Template-based generation with variations
  • Anti-Overfitting Techniques:
    • Context shuffling (gratitude before/after requests)
    • Negation injection
    • Order inversion
    • Noise injection
  • Multilingual Support: Portuguese and English emails in training data

Evaluation

Metrics

The model was evaluated on a held-out test set with the following metrics:

  • Accuracy: ~0.95+
  • F1 Score: ~0.95+
  • Precision: ~0.95+
  • Recall: ~0.95+

Note: Exact metrics may vary. Please refer to the model card for specific evaluation results.

How to Use

Installation

pip install transformers torch

Basic Usage

Using Pipeline

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="MiguelJeronimoOliveira/email-classifier"
)

# Classify an email
result = classifier("Hi, I need urgent technical support. The system is down.")
print(result)
# [{'label': 'LABEL_1', 'score': 0.98}]

result = classifier("Thank you for the excellent work!")
print(result)
# [{'label': 'LABEL_0', 'score': 0.95}]

Using Model Directly

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "MiguelJeronimoOliveira/email-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare input
email_text = "Hi, I would like to schedule a meeting to discuss the project timeline."
inputs = tokenizer(
    email_text,
    truncation=True,
    padding=True,
    max_length=512,
    return_tensors="pt"
)

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = predictions.argmax(dim=-1).item()
    confidence = predictions[0][predicted_class].item()

# Interpret result
label = "productive" if predicted_class == 1 else "unproductive"
print(f"Classification: {label} (confidence: {confidence:.2f})")

Label Mapping

  • LABEL_0 or 0: Unproductive
  • LABEL_1 or 1: Productive

Limitations and Bias

Known Limitations

  1. Language Coverage: While trained on Portuguese and English, performance may vary for other languages
  2. Domain Specificity: Model is optimized for business/professional emails; may not perform well on personal emails
  3. Context Dependency: Classification is based on email content only; doesn't consider sender, subject line, or metadata
  4. Synthetic Training Data: Model was trained on synthetic data, which may not capture all real-world email patterns

Potential Biases

  • The model may have biases based on the training data distribution
  • Cultural and linguistic nuances may affect classification accuracy
  • Technical terminology may be over-represented in productive emails

Recommendations

  • Fine-tune on your specific email domain for best results
  • Consider combining with other signals (sender, subject, metadata)
  • Regularly evaluate and retrain with new data
  • Use confidence thresholds to filter uncertain predictions

Ethical Considerations

Privacy

  • This model processes email content; ensure compliance with privacy regulations (GDPR, etc.)
  • Consider data anonymization before processing
  • Be transparent about automated email classification to users

Fairness

  • Monitor for potential biases in classification across different email types
  • Ensure the model doesn't systematically misclassify emails from certain groups or domains
  • Provide mechanisms for users to correct misclassifications

Citation

If you use this model in your research or application, please cite:

@misc{email-classifier-2024,
  title={Email Classifier: A Fine-tuned DistilBERT for Productivity Classification},
  author={Miguel Jeronimo Oliveira},
  year={2024},
  howpublished={\url{https://huggingface.co/MiguelJeronimoOliveira/email-classifier}}
}

Model Card Contact

For questions, issues, or contributions, please contact:

License

This model is licensed under the Apache 2.0 License. See the LICENSE file for more details.

Acknowledgments

  • Built on top of DistilBERT by Hugging Face
  • Training infrastructure supported by Hugging Face Transformers
  • Part of the AutoU Case email management system

Model Version: 1.0.0
Last Updated: 2024
Base Model: distilbert-base-uncased
Framework: PyTorch

Downloads last month
77
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MiguelJeronimoOliveira/email-classifier

Finetuned
(10851)
this model