Update README.md

ae952dd verified about 1 month ago

3.72 kB

	---
	license: mit
	---
	---
	language: en
	tags:
	- text-classification
	- onnx
	- job-classification
	- it
	license: mit
	base_model: intfloat/e5-base-v2
	---

	# IT vs Non-IT Job Title Classifier

	Binary classifier that determines whether a job title belongs to an IT/tech role or not. Built on top of [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2) embeddings with a logistic regression head, exported to ONNX for fast, lightweight inference with no heavy ML dependencies at runtime.

	## Repository contents

	\| File \| Description \|
	\|---\|---\|
	\| `e5_it_classifier.onnx` \| Logistic regression classifier head (ONNX) \|

	The encoder (`intfloat/e5-base-v2`) is loaded separately at inference time — it is not bundled here since it is a public model.

	## How it works

	1. The job title is prefixed with `"query: "` — required by the e5-v2 instruction format
	2. The prefixed title is encoded by `intfloat/e5-base-v2` with mean pooling and L2 normalization, producing a 768-dim embedding
	3. The embedding is passed through the logistic regression ONNX model
	4. The output is a probability for class `1` (IT) and class `0` (Non-IT)

	## Training

	- Encoder: `intfloat/e5-base-v2` via `sentence-transformers`, embeddings L2-normalized
	- Classifier: `sklearn.linear_model.LogisticRegression(C=1.0, max_iter=1000, class_weight='balanced')`
	- Input: job title only
	- Labels: `1` = IT role, `0` = Non-IT role
	- Class balancing: enabled via `class_weight='balanced'` to handle uneven label distribution

	## Inference

	### Python

	```python
	from sentence_transformers import SentenceTransformer
	import onnxruntime as ort
	import numpy as np

	encoder = SentenceTransformer("intfloat/e5-base-v2")
	sess = ort.InferenceSession("e5_it_classifier.onnx")

	def classify(title: str) -> dict:
	emb = encoder.encode(["query: " + title], normalize_embeddings=True)
	probs = sess.run(["probabilities"], {"input": emb.astype(np.float32)})[0]
	return {
	"label": "IT" if probs[0][1] > probs[0][0] else "Non-IT",
	"it_probability": float(probs[0][1]),
	}

	print(classify("Senior Software Engineer")) # IT
	print(classify("Regional Sales Manager")) # Non-IT
	```

	### JavaScript / TypeScript (Bun or Node)

	```typescript
	import { pipeline } from "@huggingface/transformers";
	import * as ort from "onnxruntime-node";

	const extractor = await pipeline("feature-extraction", "intfloat/e5-base-v2", { quantized: false });
	const session = await ort.InferenceSession.create("./e5_it_classifier.onnx");

	async function classify(title: string) {
	const output = await extractor("query: " + title, { pooling: "mean", normalize: true });

	const results = await session.run({
	input: new ort.Tensor("float32", output.data as Float32Array, [1, 768]),
	});

	const probs = results.probabilities.data as Float32Array;
	return {
	label: probs[1] > probs[0] ? "IT" : "Non-IT",
	it_probability: probs[1],
	};
	}

	console.log(await classify("Senior Software Engineer")); // IT
	console.log(await classify("Regional Sales Manager")); // Non-IT
	```

	```bash
	bun add @huggingface/transformers onnxruntime-node
	# or
	npm install @huggingface/transformers onnxruntime-node
	```

	## Intended use

	Designed for automated job pipeline filtering — quickly classifying job titles as IT or non-IT before downstream enrichment or processing steps. Works well as a lightweight pre-filter given that it only requires a job title with no description needed.

	## Limitations

	- Trained and evaluated on job title text only — unusual or highly abbreviated titles may score less reliably
	- English job titles only
	- Edge cases like hybrid roles (e.g. "IT Sales Manager") may produce probabilities close to 0.5