Text Classification
Transformers
Safetensors
Russian
English
bert
tiny-bert
rubert-tiny2
binary-classification
jobs
developer-classification
data-analyst-classification
business-analyst-classification
dev-plus-da-plus-ba
r95
v2
Eval Results (legacy)
text-embeddings-inference
Instructions to use AndreiTolmachev/dev_da_roles_1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AndreiTolmachev/dev_da_roles_1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="AndreiTolmachev/dev_da_roles_1")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("AndreiTolmachev/dev_da_roles_1") model = AutoModelForSequenceClassification.from_pretrained("AndreiTolmachev/dev_da_roles_1") - Notebooks
- Google Colab
- Kaggle
File size: 5,056 Bytes
82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 79761d1 82109d9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | ---
language:
- ru
- en
license: mit
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- bert
- tiny-bert
- rubert-tiny2
- binary-classification
- jobs
- developer-classification
- data-analyst-classification
- business-analyst-classification
- dev-plus-da-plus-ba
- r95
- v2
base_model: cointegrated/rubert-tiny2
metrics:
- precision
- recall
- roc_auc
model-index:
- name: dev_da_roles_1
results:
- task:
type: text-classification
name: Developer / Data Analyst / Business Analyst vs Other Binary Classification
metrics:
- type: roc_auc
value: 0.9815
- type: precision
value: 0.9219
- type: recall
value: 0.9506
---
# dev_da_roles_1 — Developer + Data Analyst + Business Analyst Classifier
Binary job-vacancy classifier: detects **developer, Data Analyst, or Business Analyst** roles (`tech`) versus **other** roles (`other`).
Built on top of [`cointegrated/rubert-tiny2`](https://huggingface.co/cointegrated/rubert-tiny2), a compact BERT model for Russian and English text.
> **v2** — extends v1 by adding Business Analyst to the positive class and using a longer input context (384 tokens / 2000 chars). Precision improved from 0.880 → 0.922.
## Task Definition
The positive class (`tech`) is defined as:
> `role_category in TECH_CLASSES AND team_lead == 0`
`TECH_CLASSES`:
- Backend
- Desktop / Systems
- Embedded
- Frontend
- Fullstack
- ML / AI / Data Scientist
- Mobile
- Data Analyst
- Бизнес аналитик (Business Analyst)
Team leads and management roles are intentionally excluded from the positive class.
## Labels
| id | label |
|----|-------|
| 0 | other |
| 1 | tech |
## Validation Metrics
| Metric | Value |
|---|---:|
| ROC AUC | 0.9815 |
| Precision @ threshold | 0.9219 |
| Recall @ threshold | 0.9506 |
| Best threshold | 0.8791 |
| Target recall | 0.95 |
| Best epoch | 7 |
**Recall by key category (held-out test set):**
| Category | Recall |
|---|---:|
| Backend | 0.984 |
| Frontend | 1.000 |
| Mobile | 1.000 |
| ML / AI / Data Scientist | 0.976 |
| Data Analyst | 0.916 |
| Business Analyst | 0.895 |
## Inference Parameters
- `max_length`: **384** tokens
- Vacancy text: `title + " . " + description`, description truncated to **2000 characters**
- Decision threshold for class `tech`: **0.8791**
## Usage
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
MODEL_ID = "AndreiTolmachev/dev_da_roles_1"
THRESHOLD = 0.8791
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval()
def is_tech_role(title: str, description: str = "") -> bool:
text = f"{title.strip()} . {description[:2000].strip()}"
enc = tokenizer(text, truncation=True, max_length=384, return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits
prob_tech = torch.softmax(logits, dim=-1)[0, 1].item()
return prob_tech >= THRESHOLD
# Developer
print(is_tech_role("Backend Python Developer", "FastAPI, PostgreSQL, Docker, Kubernetes..."))
# Data Analyst
print(is_tech_role("Data Analyst", "SQL, Python, dashboards, product metrics, A/B tests..."))
# Business Analyst
print(is_tech_role("Бизнес аналитик", "Сбор требований, UML, BPMN, работа с командой разработки..."))
# Manager — should return False
print(is_tech_role("Project Manager", "Agile, управление командой, планирование спринтов..."))
```
## Architecture
- Model: `BertForSequenceClassification`
- Base model: `cointegrated/rubert-tiny2`
- Layers: 3, hidden size: 312, attention heads: 12
- Vocab size: 83,828
- Parameters: ~29M
- `max_position_embeddings`: 2048
## Training
- Dataset: internal job-vacancy dataset (`vacancies_labeled.csv`), labeled by an LLM pipeline
- Train/test split: 85% / 15%, stratified by role and team_lead flag
- Loss: weighted cross-entropy (`pos_weight` = 2.115)
- Optimizer: AdamW, `lr=2e-5`, linear warmup 10%, grad clip 1.0
- Early stopping: patience=3 on F1 at target recall ≥ 0.95
- Threshold selected to achieve target recall = **0.95**
## Limitations
- Trained primarily on Russian-language IT job vacancies; quality on other domains/languages is not guaranteed.
- Team lead and management roles are treated as `other` by design.
- Description is truncated to 2000 characters before tokenization.
- The model groups developers, Data Analysts, and Business Analysts into one positive class; it does not distinguish between them.
- Data Analyst recall is ~0.92: vacancies with heavy business/finance framing may be missed.
## Version
Hub tag: `v2.0-dev-da-ba-r95`
**Changelog vs v1:**
- Added Business Analyst (`Бизнес аналитик`) to positive class
- Input context extended: `max_length` 256→384, description 1200→2000 chars
- Precision improved: 0.880 → 0.922
- `lr` lowered to 2e-5, batch size 32→24 to accommodate longer sequences
## License
MIT.
|