e5-it-classifier / README.md
jonathan-4a's picture
Update README.md
ae952dd verified
|
Raw
History Blame Contribute Delete
3.72 kB
---
license: mit
---
---
language: en
tags:
- text-classification
- onnx
- job-classification
- it
license: mit
base_model: intfloat/e5-base-v2
---
# IT vs Non-IT Job Title Classifier
Binary classifier that determines whether a job title belongs to an IT/tech role or not. Built on top of [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2) embeddings with a logistic regression head, exported to ONNX for fast, lightweight inference with no heavy ML dependencies at runtime.
## Repository contents
| File | Description |
|---|---|
| `e5_it_classifier.onnx` | Logistic regression classifier head (ONNX) |
The encoder (`intfloat/e5-base-v2`) is loaded separately at inference time — it is not bundled here since it is a public model.
## How it works
1. The job title is prefixed with `"query: "` — required by the e5-v2 instruction format
2. The prefixed title is encoded by `intfloat/e5-base-v2` with mean pooling and L2 normalization, producing a 768-dim embedding
3. The embedding is passed through the logistic regression ONNX model
4. The output is a probability for class `1` (IT) and class `0` (Non-IT)
## Training
- **Encoder:** `intfloat/e5-base-v2` via `sentence-transformers`, embeddings L2-normalized
- **Classifier:** `sklearn.linear_model.LogisticRegression(C=1.0, max_iter=1000, class_weight='balanced')`
- **Input:** job title only
- **Labels:** `1` = IT role, `0` = Non-IT role
- **Class balancing:** enabled via `class_weight='balanced'` to handle uneven label distribution
## Inference
### Python
```python
from sentence_transformers import SentenceTransformer
import onnxruntime as ort
import numpy as np
encoder = SentenceTransformer("intfloat/e5-base-v2")
sess = ort.InferenceSession("e5_it_classifier.onnx")
def classify(title: str) -> dict:
emb = encoder.encode(["query: " + title], normalize_embeddings=True)
probs = sess.run(["probabilities"], {"input": emb.astype(np.float32)})[0]
return {
"label": "IT" if probs[0][1] > probs[0][0] else "Non-IT",
"it_probability": float(probs[0][1]),
}
print(classify("Senior Software Engineer")) # IT
print(classify("Regional Sales Manager")) # Non-IT
```
### JavaScript / TypeScript (Bun or Node)
```typescript
import { pipeline } from "@huggingface/transformers";
import * as ort from "onnxruntime-node";
const extractor = await pipeline("feature-extraction", "intfloat/e5-base-v2", { quantized: false });
const session = await ort.InferenceSession.create("./e5_it_classifier.onnx");
async function classify(title: string) {
const output = await extractor("query: " + title, { pooling: "mean", normalize: true });
const results = await session.run({
input: new ort.Tensor("float32", output.data as Float32Array, [1, 768]),
});
const probs = results.probabilities.data as Float32Array;
return {
label: probs[1] > probs[0] ? "IT" : "Non-IT",
it_probability: probs[1],
};
}
console.log(await classify("Senior Software Engineer")); // IT
console.log(await classify("Regional Sales Manager")); // Non-IT
```
```bash
bun add @huggingface/transformers onnxruntime-node
# or
npm install @huggingface/transformers onnxruntime-node
```
## Intended use
Designed for automated job pipeline filtering — quickly classifying job titles as IT or non-IT before downstream enrichment or processing steps. Works well as a lightweight pre-filter given that it only requires a job title with no description needed.
## Limitations
- Trained and evaluated on job title text only — unusual or highly abbreviated titles may score less reliably
- English job titles only
- Edge cases like hybrid roles (e.g. "IT Sales Manager") may produce probabilities close to 0.5