Instructions to use AndreiTolmachev/it-vs-nonit-roles-tiny with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AndreiTolmachev/it-vs-nonit-roles-tiny with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="AndreiTolmachev/it-vs-nonit-roles-tiny")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("AndreiTolmachev/it-vs-nonit-roles-tiny") model = AutoModelForSequenceClassification.from_pretrained("AndreiTolmachev/it-vs-nonit-roles-tiny") - Notebooks
- Google Colab
- Kaggle
IT vs Non-IT TinyBERT Classifier, R98
Binary vacancy classifier for the first-stage gate: IT role vs Non-IT role.
This version is tuned for target recall IT >= 0.98, with a lower learning rate
than the R99 baseline.
Model
- Base model:
cointegrated/rubert-tiny2 - Architecture:
BertForSequenceClassification - Labels:
0:NonIT1:IT
- Input text:
title + " . " + description - Description truncation:
2000characters - Max sequence length:
384
Training Data
Dataset: data/labeled/it_nonit_train_28_05_v2.csv
Rows:
| Source | Rows |
|---|---|
| old gold labels | 7,085 |
| OpenAI-labeled TF-IDF candidates | 13,445 |
| OpenAI-labeled TF-IDF deferred non-IT | 6,617 |
| Total | 27,147 |
Binary balance:
| Class | Rows |
|---|---|
| IT | 9,836 |
| Non-IT | 17,311 |
Training Setup
Command:
python3 classifier_agent/train_it_bert.py \
--input data/labeled/it_nonit_train_28_05_v2.csv \
--output-dir classifier_agent/it_vs_nonit_tiny_r98_lr2e5 \
--device mps \
--epochs 6 \
--patience 2 \
--batch-size 32 \
--eval-batch-size 64 \
--lr 2e-5 \
--max-len 384 \
--desc-limit 2000 \
--target-recall 0.98 \
--pos-weight-mult 1.0
Best checkpoint:
epoch 3
Validation Metrics
Threshold selected for target recall IT >= 0.98.
| Metric | Value |
|---|---|
| ROC-AUC | 0.9952 |
| Threshold | 0.2934 |
| Precision IT | 0.9089 |
| Recall IT | 0.9804 |
| F1 IT | 0.9433 |
Confusion matrix at threshold 0.2934:
rows=true, cols=pred [NonIT, IT]
[[2452, 145],
[ 29, 1447]]
Error Analysis
False negatives: 29.
Main FN categories:
| Category | Count |
|---|---|
| Project Manager | 6 |
| Product Manager | 6 |
| Support / Сисадмин | 5 |
| Data Analyst | 5 |
| HR / Рекрутер | 2 |
| Mobile | 1 |
| ИБ / Security | 1 |
| Системный аналитик | 1 |
| Дизайнер | 1 |
| Embedded | 1 |
False positives: 145, all labeled Не IT.
Compared with the R99 baseline, this version trades recall for precision:
| Model | Precision IT | Recall IT | F1 IT | FN | FP |
|---|---|---|---|---|---|
it_vs_nonit_tiny R99 |
0.8739 | 0.9905 | 0.9285 | 14 | 211 |
it_vs_nonit_tiny_r98_lr2e5 |
0.9089 | 0.9804 | 0.9433 | 29 | 145 |
Usage
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
MODEL_DIR = "classifier_agent/it_vs_nonit_tiny_r98_lr2e5"
THRESHOLD = 0.2934192717075348
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR).eval()
def is_it(title: str, description: str = "") -> tuple[bool, float]:
text = f"{title.strip()} . {' '.join(description.split())[:2000]}"
enc = tokenizer(text, truncation=True, max_length=384, return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits
proba_it = torch.softmax(logits, dim=-1)[0, 1].item()
return proba_it >= THRESHOLD, proba_it
Recommendation
Use this version when reducing false positives is more important than catching the last 1% of weak or ambiguous IT roles. Keep the R99 model as a broader safety gate.
- Downloads last month
- 21
Model tree for AndreiTolmachev/it-vs-nonit-roles-tiny
Base model
cointegrated/rubert-tiny2Evaluation results
- roc_aucself-reported0.995
- precisionself-reported0.909
- recallself-reported0.980
- f1self-reported0.943