File size: 5,056 Bytes
82109d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79761d1
 
82109d9
79761d1
82109d9
 
 
 
 
 
 
 
 
 
79761d1
82109d9
 
79761d1
82109d9
79761d1
82109d9
79761d1
82109d9
 
79761d1
82109d9
79761d1
82109d9
 
 
79761d1
 
82109d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79761d1
82109d9
 
 
 
 
 
 
 
 
 
 
 
 
 
79761d1
 
 
 
82109d9
79761d1
 
 
82109d9
79761d1
 
 
 
 
 
 
 
82109d9
 
 
79761d1
 
 
82109d9
 
 
 
 
 
 
 
79761d1
82109d9
 
 
 
79761d1
 
 
82109d9
 
 
 
 
79761d1
 
 
 
 
 
 
 
 
 
 
82109d9
 
 
 
 
 
 
 
 
 
 
 
 
 
79761d1
 
 
 
 
82109d9
 
 
 
 
79761d1
 
 
82109d9
 
 
79761d1
 
 
 
 
 
 
82109d9
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
language:
- ru
- en
license: mit
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- bert
- tiny-bert
- rubert-tiny2
- binary-classification
- jobs
- developer-classification
- data-analyst-classification
- business-analyst-classification
- dev-plus-da-plus-ba
- r95
- v2
base_model: cointegrated/rubert-tiny2
metrics:
- precision
- recall
- roc_auc
model-index:
- name: dev_da_roles_1
  results:
  - task:
      type: text-classification
      name: Developer / Data Analyst / Business Analyst vs Other Binary Classification
    metrics:
    - type: roc_auc
      value: 0.9815
    - type: precision
      value: 0.9219
    - type: recall
      value: 0.9506
---

# dev_da_roles_1 — Developer + Data Analyst + Business Analyst Classifier

Binary job-vacancy classifier: detects **developer, Data Analyst, or Business Analyst** roles (`tech`) versus **other** roles (`other`).

Built on top of [`cointegrated/rubert-tiny2`](https://huggingface.co/cointegrated/rubert-tiny2), a compact BERT model for Russian and English text.

> **v2** — extends v1 by adding Business Analyst to the positive class and using a longer input context (384 tokens / 2000 chars). Precision improved from 0.880 → 0.922.

## Task Definition

The positive class (`tech`) is defined as:

> `role_category in TECH_CLASSES AND team_lead == 0`

`TECH_CLASSES`:

- Backend
- Desktop / Systems
- Embedded
- Frontend
- Fullstack
- ML / AI / Data Scientist
- Mobile
- Data Analyst
- Бизнес аналитик (Business Analyst)

Team leads and management roles are intentionally excluded from the positive class.

## Labels

| id | label |
|----|-------|
| 0  | other |
| 1  | tech  |

## Validation Metrics

| Metric | Value |
|---|---:|
| ROC AUC | 0.9815 |
| Precision @ threshold | 0.9219 |
| Recall @ threshold | 0.9506 |
| Best threshold | 0.8791 |
| Target recall | 0.95 |
| Best epoch | 7 |

**Recall by key category (held-out test set):**

| Category | Recall |
|---|---:|
| Backend | 0.984 |
| Frontend | 1.000 |
| Mobile | 1.000 |
| ML / AI / Data Scientist | 0.976 |
| Data Analyst | 0.916 |
| Business Analyst | 0.895 |

## Inference Parameters

- `max_length`: **384** tokens
- Vacancy text: `title + " . " + description`, description truncated to **2000 characters**
- Decision threshold for class `tech`: **0.8791**

## Usage

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "AndreiTolmachev/dev_da_roles_1"
THRESHOLD = 0.8791

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval()

def is_tech_role(title: str, description: str = "") -> bool:
    text = f"{title.strip()} . {description[:2000].strip()}"
    enc = tokenizer(text, truncation=True, max_length=384, return_tensors="pt")
    with torch.no_grad():
        logits = model(**enc).logits
    prob_tech = torch.softmax(logits, dim=-1)[0, 1].item()
    return prob_tech >= THRESHOLD

# Developer
print(is_tech_role("Backend Python Developer", "FastAPI, PostgreSQL, Docker, Kubernetes..."))

# Data Analyst
print(is_tech_role("Data Analyst", "SQL, Python, dashboards, product metrics, A/B tests..."))

# Business Analyst
print(is_tech_role("Бизнес аналитик", "Сбор требований, UML, BPMN, работа с командой разработки..."))

# Manager — should return False
print(is_tech_role("Project Manager", "Agile, управление командой, планирование спринтов..."))
```

## Architecture

- Model: `BertForSequenceClassification`
- Base model: `cointegrated/rubert-tiny2`
- Layers: 3, hidden size: 312, attention heads: 12
- Vocab size: 83,828
- Parameters: ~29M
- `max_position_embeddings`: 2048

## Training

- Dataset: internal job-vacancy dataset (`vacancies_labeled.csv`), labeled by an LLM pipeline
- Train/test split: 85% / 15%, stratified by role and team_lead flag
- Loss: weighted cross-entropy (`pos_weight` = 2.115)
- Optimizer: AdamW, `lr=2e-5`, linear warmup 10%, grad clip 1.0
- Early stopping: patience=3 on F1 at target recall ≥ 0.95
- Threshold selected to achieve target recall = **0.95**

## Limitations

- Trained primarily on Russian-language IT job vacancies; quality on other domains/languages is not guaranteed.
- Team lead and management roles are treated as `other` by design.
- Description is truncated to 2000 characters before tokenization.
- The model groups developers, Data Analysts, and Business Analysts into one positive class; it does not distinguish between them.
- Data Analyst recall is ~0.92: vacancies with heavy business/finance framing may be missed.

## Version

Hub tag: `v2.0-dev-da-ba-r95`

**Changelog vs v1:**
- Added Business Analyst (`Бизнес аналитик`) to positive class
- Input context extended: `max_length` 256→384, description 1200→2000 chars
- Precision improved: 0.880 → 0.922
- `lr` lowered to 2e-5, batch size 32→24 to accommodate longer sequences

## License

MIT.