File size: 3,723 Bytes
c53c808
 
 
ae952dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: mit
---
---
language: en
tags:
  - text-classification
  - onnx
  - job-classification
  - it
license: mit
base_model: intfloat/e5-base-v2
---

# IT vs Non-IT Job Title Classifier

Binary classifier that determines whether a job title belongs to an IT/tech role or not. Built on top of [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2) embeddings with a logistic regression head, exported to ONNX for fast, lightweight inference with no heavy ML dependencies at runtime.

## Repository contents

| File | Description |
|---|---|
| `e5_it_classifier.onnx` | Logistic regression classifier head (ONNX) |

The encoder (`intfloat/e5-base-v2`) is loaded separately at inference time — it is not bundled here since it is a public model.

## How it works

1. The job title is prefixed with `"query: "` — required by the e5-v2 instruction format
2. The prefixed title is encoded by `intfloat/e5-base-v2` with mean pooling and L2 normalization, producing a 768-dim embedding
3. The embedding is passed through the logistic regression ONNX model
4. The output is a probability for class `1` (IT) and class `0` (Non-IT)

## Training

- **Encoder:** `intfloat/e5-base-v2` via `sentence-transformers`, embeddings L2-normalized
- **Classifier:** `sklearn.linear_model.LogisticRegression(C=1.0, max_iter=1000, class_weight='balanced')`
- **Input:** job title only
- **Labels:** `1` = IT role, `0` = Non-IT role
- **Class balancing:** enabled via `class_weight='balanced'` to handle uneven label distribution

## Inference

### Python

```python
from sentence_transformers import SentenceTransformer
import onnxruntime as ort
import numpy as np

encoder = SentenceTransformer("intfloat/e5-base-v2")
sess = ort.InferenceSession("e5_it_classifier.onnx")

def classify(title: str) -> dict:
    emb = encoder.encode(["query: " + title], normalize_embeddings=True)
    probs = sess.run(["probabilities"], {"input": emb.astype(np.float32)})[0]
    return {
        "label": "IT" if probs[0][1] > probs[0][0] else "Non-IT",
        "it_probability": float(probs[0][1]),
    }

print(classify("Senior Software Engineer"))  # IT
print(classify("Regional Sales Manager"))    # Non-IT
```

### JavaScript / TypeScript (Bun or Node)

```typescript
import { pipeline } from "@huggingface/transformers";
import * as ort from "onnxruntime-node";

const extractor = await pipeline("feature-extraction", "intfloat/e5-base-v2", { quantized: false });
const session = await ort.InferenceSession.create("./e5_it_classifier.onnx");

async function classify(title: string) {
  const output = await extractor("query: " + title, { pooling: "mean", normalize: true });

  const results = await session.run({
    input: new ort.Tensor("float32", output.data as Float32Array, [1, 768]),
  });

  const probs = results.probabilities.data as Float32Array;
  return {
    label: probs[1] > probs[0] ? "IT" : "Non-IT",
    it_probability: probs[1],
  };
}

console.log(await classify("Senior Software Engineer")); // IT
console.log(await classify("Regional Sales Manager"));   // Non-IT
```

```bash
bun add @huggingface/transformers onnxruntime-node
# or
npm install @huggingface/transformers onnxruntime-node
```

## Intended use

Designed for automated job pipeline filtering — quickly classifying job titles as IT or non-IT before downstream enrichment or processing steps. Works well as a lightweight pre-filter given that it only requires a job title with no description needed.

## Limitations

- Trained and evaluated on job title text only — unusual or highly abbreviated titles may score less reliably
- English job titles only
- Edge cases like hybrid roles (e.g. "IT Sales Manager") may produce probabilities close to 0.5