---
language:
- ru
- en
license: mit
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- bert
- tiny-bert
- rubert-tiny2
- binary-classification
- jobs
- developer-classification
- data-analyst-classification
- business-analyst-classification
- dev-plus-da-plus-ba
- r95
- v2
base_model: cointegrated/rubert-tiny2
metrics:
- precision
- recall
- roc_auc
model-index:
- name: dev_da_roles_1
  results:
  - task:
      type: text-classification
      name: Developer / Data Analyst / Business Analyst vs Other Binary Classification
    metrics:
    - type: roc_auc
      value: 0.9815
    - type: precision
      value: 0.9219
    - type: recall
      value: 0.9506
---

# dev_da_roles_1 — Developer + Data Analyst + Business Analyst Classifier

Binary job-vacancy classifier: detects **developer, Data Analyst, or Business Analyst** roles (`tech`) versus **other** roles (`other`).

Built on top of [`cointegrated/rubert-tiny2`](https://huggingface.co/cointegrated/rubert-tiny2), a compact BERT model for Russian and English text.

> **v2** — extends v1 by adding Business Analyst to the positive class and using a longer input context (384 tokens / 2000 chars). Precision improved from 0.880 → 0.922.

## Task Definition

The positive class (`tech`) is defined as:

> `role_category in TECH_CLASSES AND team_lead == 0`

`TECH_CLASSES`:

- Backend
- Desktop / Systems
- Embedded
- Frontend
- Fullstack
- ML / AI / Data Scientist
- Mobile
- Data Analyst
- Бизнес аналитик (Business Analyst)

Team leads and management roles are intentionally excluded from the positive class.

## Labels

| id | label |
|----|-------|
| 0  | other |
| 1  | tech  |

## Validation Metrics

| Metric | Value |
|---|---:|
| ROC AUC | 0.9815 |
| Precision @ threshold | 0.9219 |
| Recall @ threshold | 0.9506 |
| Best threshold | 0.8791 |
| Target recall | 0.95 |
| Best epoch | 7 |

**Recall by key category (held-out test set):**

| Category | Recall |
|---|---:|
| Backend | 0.984 |
| Frontend | 1.000 |
| Mobile | 1.000 |
| ML / AI / Data Scientist | 0.976 |
| Data Analyst | 0.916 |
| Business Analyst | 0.895 |

## Inference Parameters

- `max_length`: **384** tokens
- Vacancy text: `title + " . " + description`, description truncated to **2000 characters**
- Decision threshold for class `tech`: **0.8791**

## Usage

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "AndreiTolmachev/dev_da_roles_1"
THRESHOLD = 0.8791

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval()

def is_tech_role(title: str, description: str = "") -> bool:
    text = f"{title.strip()} . {description[:2000].strip()}"
    enc = tokenizer(text, truncation=True, max_length=384, return_tensors="pt")
    with torch.no_grad():
        logits = model(**enc).logits
    prob_tech = torch.softmax(logits, dim=-1)[0, 1].item()
    return prob_tech >= THRESHOLD

# Developer
print(is_tech_role("Backend Python Developer", "FastAPI, PostgreSQL, Docker, Kubernetes..."))

# Data Analyst
print(is_tech_role("Data Analyst", "SQL, Python, dashboards, product metrics, A/B tests..."))

# Business Analyst
print(is_tech_role("Бизнес аналитик", "Сбор требований, UML, BPMN, работа с командой разработки..."))

# Manager — should return False
print(is_tech_role("Project Manager", "Agile, управление командой, планирование спринтов..."))
```

## Architecture

- Model: `BertForSequenceClassification`
- Base model: `cointegrated/rubert-tiny2`
- Layers: 3, hidden size: 312, attention heads: 12
- Vocab size: 83,828
- Parameters: ~29M
- `max_position_embeddings`: 2048

## Training

- Dataset: internal job-vacancy dataset (`vacancies_labeled.csv`), labeled by an LLM pipeline
- Train/test split: 85% / 15%, stratified by role and team_lead flag
- Loss: weighted cross-entropy (`pos_weight` = 2.115)
- Optimizer: AdamW, `lr=2e-5`, linear warmup 10%, grad clip 1.0
- Early stopping: patience=3 on F1 at target recall ≥ 0.95
- Threshold selected to achieve target recall = **0.95**

## Limitations

- Trained primarily on Russian-language IT job vacancies; quality on other domains/languages is not guaranteed.
- Team lead and management roles are treated as `other` by design.
- Description is truncated to 2000 characters before tokenization.
- The model groups developers, Data Analysts, and Business Analysts into one positive class; it does not distinguish between them.
- Data Analyst recall is ~0.92: vacancies with heavy business/finance framing may be missed.

## Version

Hub tag: `v2.0-dev-da-ba-r95`

**Changelog vs v1:**
- Added Business Analyst (`Бизнес аналитик`) to positive class
- Input context extended: `max_length` 256→384, description 1200→2000 chars
- Precision improved: 0.880 → 0.922
- `lr` lowered to 2e-5, batch size 32→24 to accommodate longer sequences

## License

MIT.