--- license: mit --- --- language: en tags: - text-classification - onnx - job-classification - it license: mit base_model: intfloat/e5-base-v2 --- # IT vs Non-IT Job Title Classifier Binary classifier that determines whether a job title belongs to an IT/tech role or not. Built on top of [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2) embeddings with a logistic regression head, exported to ONNX for fast, lightweight inference with no heavy ML dependencies at runtime. ## Repository contents | File | Description | |---|---| | `e5_it_classifier.onnx` | Logistic regression classifier head (ONNX) | The encoder (`intfloat/e5-base-v2`) is loaded separately at inference time — it is not bundled here since it is a public model. ## How it works 1. The job title is prefixed with `"query: "` — required by the e5-v2 instruction format 2. The prefixed title is encoded by `intfloat/e5-base-v2` with mean pooling and L2 normalization, producing a 768-dim embedding 3. The embedding is passed through the logistic regression ONNX model 4. The output is a probability for class `1` (IT) and class `0` (Non-IT) ## Training - **Encoder:** `intfloat/e5-base-v2` via `sentence-transformers`, embeddings L2-normalized - **Classifier:** `sklearn.linear_model.LogisticRegression(C=1.0, max_iter=1000, class_weight='balanced')` - **Input:** job title only - **Labels:** `1` = IT role, `0` = Non-IT role - **Class balancing:** enabled via `class_weight='balanced'` to handle uneven label distribution ## Inference ### Python ```python from sentence_transformers import SentenceTransformer import onnxruntime as ort import numpy as np encoder = SentenceTransformer("intfloat/e5-base-v2") sess = ort.InferenceSession("e5_it_classifier.onnx") def classify(title: str) -> dict: emb = encoder.encode(["query: " + title], normalize_embeddings=True) probs = sess.run(["probabilities"], {"input": emb.astype(np.float32)})[0] return { "label": "IT" if probs[0][1] > probs[0][0] else "Non-IT", "it_probability": float(probs[0][1]), } print(classify("Senior Software Engineer")) # IT print(classify("Regional Sales Manager")) # Non-IT ``` ### JavaScript / TypeScript (Bun or Node) ```typescript import { pipeline } from "@huggingface/transformers"; import * as ort from "onnxruntime-node"; const extractor = await pipeline("feature-extraction", "intfloat/e5-base-v2", { quantized: false }); const session = await ort.InferenceSession.create("./e5_it_classifier.onnx"); async function classify(title: string) { const output = await extractor("query: " + title, { pooling: "mean", normalize: true }); const results = await session.run({ input: new ort.Tensor("float32", output.data as Float32Array, [1, 768]), }); const probs = results.probabilities.data as Float32Array; return { label: probs[1] > probs[0] ? "IT" : "Non-IT", it_probability: probs[1], }; } console.log(await classify("Senior Software Engineer")); // IT console.log(await classify("Regional Sales Manager")); // Non-IT ``` ```bash bun add @huggingface/transformers onnxruntime-node # or npm install @huggingface/transformers onnxruntime-node ``` ## Intended use Designed for automated job pipeline filtering — quickly classifying job titles as IT or non-IT before downstream enrichment or processing steps. Works well as a lightweight pre-filter given that it only requires a job title with no description needed. ## Limitations - Trained and evaluated on job title text only — unusual or highly abbreviated titles may score less reliably - English job titles only - Edge cases like hybrid roles (e.g. "IT Sales Manager") may produce probabilities close to 0.5