dev_da_roles_1 / README.md
AndreiTolmachev's picture
v2: update model card — add BA class, new metrics, usage example
79761d1 verified
---
language:
- ru
- en
license: mit
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- bert
- tiny-bert
- rubert-tiny2
- binary-classification
- jobs
- developer-classification
- data-analyst-classification
- business-analyst-classification
- dev-plus-da-plus-ba
- r95
- v2
base_model: cointegrated/rubert-tiny2
metrics:
- precision
- recall
- roc_auc
model-index:
- name: dev_da_roles_1
results:
- task:
type: text-classification
name: Developer / Data Analyst / Business Analyst vs Other Binary Classification
metrics:
- type: roc_auc
value: 0.9815
- type: precision
value: 0.9219
- type: recall
value: 0.9506
---
# dev_da_roles_1 — Developer + Data Analyst + Business Analyst Classifier
Binary job-vacancy classifier: detects **developer, Data Analyst, or Business Analyst** roles (`tech`) versus **other** roles (`other`).
Built on top of [`cointegrated/rubert-tiny2`](https://huggingface.co/cointegrated/rubert-tiny2), a compact BERT model for Russian and English text.
> **v2** — extends v1 by adding Business Analyst to the positive class and using a longer input context (384 tokens / 2000 chars). Precision improved from 0.880 → 0.922.
## Task Definition
The positive class (`tech`) is defined as:
> `role_category in TECH_CLASSES AND team_lead == 0`
`TECH_CLASSES`:
- Backend
- Desktop / Systems
- Embedded
- Frontend
- Fullstack
- ML / AI / Data Scientist
- Mobile
- Data Analyst
- Бизнес аналитик (Business Analyst)
Team leads and management roles are intentionally excluded from the positive class.
## Labels
| id | label |
|----|-------|
| 0 | other |
| 1 | tech |
## Validation Metrics
| Metric | Value |
|---|---:|
| ROC AUC | 0.9815 |
| Precision @ threshold | 0.9219 |
| Recall @ threshold | 0.9506 |
| Best threshold | 0.8791 |
| Target recall | 0.95 |
| Best epoch | 7 |
**Recall by key category (held-out test set):**
| Category | Recall |
|---|---:|
| Backend | 0.984 |
| Frontend | 1.000 |
| Mobile | 1.000 |
| ML / AI / Data Scientist | 0.976 |
| Data Analyst | 0.916 |
| Business Analyst | 0.895 |
## Inference Parameters
- `max_length`: **384** tokens
- Vacancy text: `title + " . " + description`, description truncated to **2000 characters**
- Decision threshold for class `tech`: **0.8791**
## Usage
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
MODEL_ID = "AndreiTolmachev/dev_da_roles_1"
THRESHOLD = 0.8791
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval()
def is_tech_role(title: str, description: str = "") -> bool:
text = f"{title.strip()} . {description[:2000].strip()}"
enc = tokenizer(text, truncation=True, max_length=384, return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits
prob_tech = torch.softmax(logits, dim=-1)[0, 1].item()
return prob_tech >= THRESHOLD
# Developer
print(is_tech_role("Backend Python Developer", "FastAPI, PostgreSQL, Docker, Kubernetes..."))
# Data Analyst
print(is_tech_role("Data Analyst", "SQL, Python, dashboards, product metrics, A/B tests..."))
# Business Analyst
print(is_tech_role("Бизнес аналитик", "Сбор требований, UML, BPMN, работа с командой разработки..."))
# Manager — should return False
print(is_tech_role("Project Manager", "Agile, управление командой, планирование спринтов..."))
```
## Architecture
- Model: `BertForSequenceClassification`
- Base model: `cointegrated/rubert-tiny2`
- Layers: 3, hidden size: 312, attention heads: 12
- Vocab size: 83,828
- Parameters: ~29M
- `max_position_embeddings`: 2048
## Training
- Dataset: internal job-vacancy dataset (`vacancies_labeled.csv`), labeled by an LLM pipeline
- Train/test split: 85% / 15%, stratified by role and team_lead flag
- Loss: weighted cross-entropy (`pos_weight` = 2.115)
- Optimizer: AdamW, `lr=2e-5`, linear warmup 10%, grad clip 1.0
- Early stopping: patience=3 on F1 at target recall ≥ 0.95
- Threshold selected to achieve target recall = **0.95**
## Limitations
- Trained primarily on Russian-language IT job vacancies; quality on other domains/languages is not guaranteed.
- Team lead and management roles are treated as `other` by design.
- Description is truncated to 2000 characters before tokenization.
- The model groups developers, Data Analysts, and Business Analysts into one positive class; it does not distinguish between them.
- Data Analyst recall is ~0.92: vacancies with heavy business/finance framing may be missed.
## Version
Hub tag: `v2.0-dev-da-ba-r95`
**Changelog vs v1:**
- Added Business Analyst (`Бизнес аналитик`) to positive class
- Input context extended: `max_length` 256→384, description 1200→2000 chars
- Precision improved: 0.880 → 0.922
- `lr` lowered to 2e-5, batch size 32→24 to accommodate longer sequences
## License
MIT.