| --- |
| license: mit |
| --- |
| --- |
| language: en |
| tags: |
| - text-classification |
| - onnx |
| - job-classification |
| - it |
| license: mit |
| base_model: intfloat/e5-base-v2 |
| --- |
| |
| # IT vs Non-IT Job Title Classifier |
| |
| Binary classifier that determines whether a job title belongs to an IT/tech role or not. Built on top of [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2) embeddings with a logistic regression head, exported to ONNX for fast, lightweight inference with no heavy ML dependencies at runtime. |
| |
| ## Repository contents |
| |
| | File | Description | |
| |---|---| |
| | `e5_it_classifier.onnx` | Logistic regression classifier head (ONNX) | |
| |
| The encoder (`intfloat/e5-base-v2`) is loaded separately at inference time — it is not bundled here since it is a public model. |
| |
| ## How it works |
| |
| 1. The job title is prefixed with `"query: "` — required by the e5-v2 instruction format |
| 2. The prefixed title is encoded by `intfloat/e5-base-v2` with mean pooling and L2 normalization, producing a 768-dim embedding |
| 3. The embedding is passed through the logistic regression ONNX model |
| 4. The output is a probability for class `1` (IT) and class `0` (Non-IT) |
| |
| ## Training |
| |
| - **Encoder:** `intfloat/e5-base-v2` via `sentence-transformers`, embeddings L2-normalized |
| - **Classifier:** `sklearn.linear_model.LogisticRegression(C=1.0, max_iter=1000, class_weight='balanced')` |
| - **Input:** job title only |
| - **Labels:** `1` = IT role, `0` = Non-IT role |
| - **Class balancing:** enabled via `class_weight='balanced'` to handle uneven label distribution |
| |
| ## Inference |
| |
| ### Python |
| |
| ```python |
| from sentence_transformers import SentenceTransformer |
| import onnxruntime as ort |
| import numpy as np |
| |
| encoder = SentenceTransformer("intfloat/e5-base-v2") |
| sess = ort.InferenceSession("e5_it_classifier.onnx") |
| |
| def classify(title: str) -> dict: |
| emb = encoder.encode(["query: " + title], normalize_embeddings=True) |
| probs = sess.run(["probabilities"], {"input": emb.astype(np.float32)})[0] |
| return { |
| "label": "IT" if probs[0][1] > probs[0][0] else "Non-IT", |
| "it_probability": float(probs[0][1]), |
| } |
| |
| print(classify("Senior Software Engineer")) # IT |
| print(classify("Regional Sales Manager")) # Non-IT |
| ``` |
| |
| ### JavaScript / TypeScript (Bun or Node) |
| |
| ```typescript |
| import { pipeline } from "@huggingface/transformers"; |
| import * as ort from "onnxruntime-node"; |
| |
| const extractor = await pipeline("feature-extraction", "intfloat/e5-base-v2", { quantized: false }); |
| const session = await ort.InferenceSession.create("./e5_it_classifier.onnx"); |
| |
| async function classify(title: string) { |
| const output = await extractor("query: " + title, { pooling: "mean", normalize: true }); |
| |
| const results = await session.run({ |
| input: new ort.Tensor("float32", output.data as Float32Array, [1, 768]), |
| }); |
| |
| const probs = results.probabilities.data as Float32Array; |
| return { |
| label: probs[1] > probs[0] ? "IT" : "Non-IT", |
| it_probability: probs[1], |
| }; |
| } |
| |
| console.log(await classify("Senior Software Engineer")); // IT |
| console.log(await classify("Regional Sales Manager")); // Non-IT |
| ``` |
| |
| ```bash |
| bun add @huggingface/transformers onnxruntime-node |
| # or |
| npm install @huggingface/transformers onnxruntime-node |
| ``` |
| |
| ## Intended use |
| |
| Designed for automated job pipeline filtering — quickly classifying job titles as IT or non-IT before downstream enrichment or processing steps. Works well as a lightweight pre-filter given that it only requires a job title with no description needed. |
| |
| ## Limitations |
| |
| - Trained and evaluated on job title text only — unusual or highly abbreviated titles may score less reliably |
| - English job titles only |
| - Edge cases like hybrid roles (e.g. "IT Sales Manager") may produce probabilities close to 0.5 |