File size: 3,239 Bytes
1a525bb
 
 
53d9152
 
4962fe3
53d9152
4962fe3
53d9152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09d8291
53d9152
09d8291
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---

license: apache-2.0
---

# Email Classifier

This project implements an email classification model that assigns each email to a specific category using SBERT `all-minilm-l6-v2` for text embeddings, followed by a sequential neural network for final classification.
## Model Description
- **Architecture:** `SBERT (384‑d) → Dense(256, ReLU) → Dropout(0.4) → Dense(128, ReLU) → Dropout(0.4) → Softmax(5)`
- **Frameworks:** TensorFlow2.17, sentence‑transformer 

## Training Data & Preprocessing
- **Emails:** 4954 college emails, manually labeled into `[Academics, Clubs, Internships, Others, Talks]`
- **Split:** 80% train / 20% test 
- **Embedding & Labeling:**  
  1. Each email was embedded with `all‑MiniLM‑L6‑v2` (SBERT).  
  2. We created a small “prototype” set of example sentences for each category.  
  3. For every email, we computed cosine similarities between its SBERT embedding and each prototype embedding.  
  4. The email was assigned to the category whose prototype had the **highest** cosine score (threshold ≥ 0.4).  

## Evaluation

The model was tested on **991** college‑email samples. Below are the per‑class precision, recall, F1‑score and support:

| Class | label       | Support | Precision | Recall | F1‑Score |
|:-----:|-------------|--------:|----------:|-------:|---------:|
| 0     | Academics   |     200 |     0.92  |  0.97  |   0.94   |
| 1     | Clubs       |     236 |     0.94  |  0.96  |   0.95   |
| 2     | Internships |     143 |     0.95  |  0.98  |   0.97   |
| 3     | Others      |     200 |     0.95  |  0.83  |   0.89   |
| 4     | Takls       |     212 |     0.93  |  0.94  |   0.93   |

\
**Aggregate metrics**

| Metric       | Accuracy | Precision | Recall | F1‑Score |
|:-------------|---------:|----------:|-------:|---------:|
| Overall      |     0.94 |       —   |    —   |      —   |
| Macro avg    |      —   |     0.94  |  0.94  |   0.94   |
| Weighted avg |      —   |     0.94  |  0.94  |   0.93   |

### Confusion Matrix

![Confusion Matrix](cm.png)

## Usage

### 1. Install dependencies
```bash

pip install tensorflow sentence-transformers huggingface_hub

```
### 2. Load the model & embedder
``` python

from sentence_transformers import SentenceTransformer

import tensorflow as tf

from huggingface_hub import hf_hub_download



# 1) Load SBERT embedder

embedder = SentenceTransformer("all-MiniLM-L6-v2")



# 2) Load your fine‑tuned classifier

model_file = hf_hub_download(

    repo_id="skgezhil2005/email_classifier", 

    filename="model_v2.keras" #replace with your model file 

)

model = tf.keras.models.load_model(model_file)



# 3) Define label names (in the same order used during training)

labels = ["Academics", "Clubs", "Internships", "Others", "Talks"]

```
### 3. Inference Helper

``` python

def classify_email(text: str) -> str:

    # Compute a 1×384 SBERT embedding

    emb = embedder.encode(text, convert_to_tensor=False)

    emb = emb.reshape(1, -1)

    # Predict probabilities and pick the highest‐scoring class

    prediction = model.predict(emb)

    pred_idx = int(np.argmax(prediction[0]))



    return labels[pred_idx]



```