--- language: - ru - en license: mit library_name: transformers pipeline_tag: text-classification tags: - text-classification - bert - tiny-bert - rubert-tiny2 - binary-classification - jobs - developer-classification - data-analyst-classification - business-analyst-classification - dev-plus-da-plus-ba - r95 - v2 base_model: cointegrated/rubert-tiny2 metrics: - precision - recall - roc_auc model-index: - name: dev_da_roles_1 results: - task: type: text-classification name: Developer / Data Analyst / Business Analyst vs Other Binary Classification metrics: - type: roc_auc value: 0.9815 - type: precision value: 0.9219 - type: recall value: 0.9506 --- # dev_da_roles_1 — Developer + Data Analyst + Business Analyst Classifier Binary job-vacancy classifier: detects **developer, Data Analyst, or Business Analyst** roles (`tech`) versus **other** roles (`other`). Built on top of [`cointegrated/rubert-tiny2`](https://huggingface.co/cointegrated/rubert-tiny2), a compact BERT model for Russian and English text. > **v2** — extends v1 by adding Business Analyst to the positive class and using a longer input context (384 tokens / 2000 chars). Precision improved from 0.880 → 0.922. ## Task Definition The positive class (`tech`) is defined as: > `role_category in TECH_CLASSES AND team_lead == 0` `TECH_CLASSES`: - Backend - Desktop / Systems - Embedded - Frontend - Fullstack - ML / AI / Data Scientist - Mobile - Data Analyst - Бизнес аналитик (Business Analyst) Team leads and management roles are intentionally excluded from the positive class. ## Labels | id | label | |----|-------| | 0 | other | | 1 | tech | ## Validation Metrics | Metric | Value | |---|---:| | ROC AUC | 0.9815 | | Precision @ threshold | 0.9219 | | Recall @ threshold | 0.9506 | | Best threshold | 0.8791 | | Target recall | 0.95 | | Best epoch | 7 | **Recall by key category (held-out test set):** | Category | Recall | |---|---:| | Backend | 0.984 | | Frontend | 1.000 | | Mobile | 1.000 | | ML / AI / Data Scientist | 0.976 | | Data Analyst | 0.916 | | Business Analyst | 0.895 | ## Inference Parameters - `max_length`: **384** tokens - Vacancy text: `title + " . " + description`, description truncated to **2000 characters** - Decision threshold for class `tech`: **0.8791** ## Usage ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification MODEL_ID = "AndreiTolmachev/dev_da_roles_1" THRESHOLD = 0.8791 tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval() def is_tech_role(title: str, description: str = "") -> bool: text = f"{title.strip()} . {description[:2000].strip()}" enc = tokenizer(text, truncation=True, max_length=384, return_tensors="pt") with torch.no_grad(): logits = model(**enc).logits prob_tech = torch.softmax(logits, dim=-1)[0, 1].item() return prob_tech >= THRESHOLD # Developer print(is_tech_role("Backend Python Developer", "FastAPI, PostgreSQL, Docker, Kubernetes...")) # Data Analyst print(is_tech_role("Data Analyst", "SQL, Python, dashboards, product metrics, A/B tests...")) # Business Analyst print(is_tech_role("Бизнес аналитик", "Сбор требований, UML, BPMN, работа с командой разработки...")) # Manager — should return False print(is_tech_role("Project Manager", "Agile, управление командой, планирование спринтов...")) ``` ## Architecture - Model: `BertForSequenceClassification` - Base model: `cointegrated/rubert-tiny2` - Layers: 3, hidden size: 312, attention heads: 12 - Vocab size: 83,828 - Parameters: ~29M - `max_position_embeddings`: 2048 ## Training - Dataset: internal job-vacancy dataset (`vacancies_labeled.csv`), labeled by an LLM pipeline - Train/test split: 85% / 15%, stratified by role and team_lead flag - Loss: weighted cross-entropy (`pos_weight` = 2.115) - Optimizer: AdamW, `lr=2e-5`, linear warmup 10%, grad clip 1.0 - Early stopping: patience=3 on F1 at target recall ≥ 0.95 - Threshold selected to achieve target recall = **0.95** ## Limitations - Trained primarily on Russian-language IT job vacancies; quality on other domains/languages is not guaranteed. - Team lead and management roles are treated as `other` by design. - Description is truncated to 2000 characters before tokenization. - The model groups developers, Data Analysts, and Business Analysts into one positive class; it does not distinguish between them. - Data Analyst recall is ~0.92: vacancies with heavy business/finance framing may be missed. ## Version Hub tag: `v2.0-dev-da-ba-r95` **Changelog vs v1:** - Added Business Analyst (`Бизнес аналитик`) to positive class - Input context extended: `max_length` 256→384, description 1200→2000 chars - Precision improved: 0.880 → 0.922 - `lr` lowered to 2e-5, batch size 32→24 to accommodate longer sequences ## License MIT.