Text Classification
Transformers
Safetensors
Russian
English
bert
tiny-bert
rubert-tiny2
binary-classification
jobs
developer-classification
data-analyst-classification
business-analyst-classification
dev-plus-da-plus-ba
r95
v2
Eval Results (legacy)
text-embeddings-inference
Instructions to use AndreiTolmachev/dev_da_roles_1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AndreiTolmachev/dev_da_roles_1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="AndreiTolmachev/dev_da_roles_1")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("AndreiTolmachev/dev_da_roles_1") model = AutoModelForSequenceClassification.from_pretrained("AndreiTolmachev/dev_da_roles_1") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - ru | |
| - en | |
| license: mit | |
| library_name: transformers | |
| pipeline_tag: text-classification | |
| tags: | |
| - text-classification | |
| - bert | |
| - tiny-bert | |
| - rubert-tiny2 | |
| - binary-classification | |
| - jobs | |
| - developer-classification | |
| - data-analyst-classification | |
| - business-analyst-classification | |
| - dev-plus-da-plus-ba | |
| - r95 | |
| - v2 | |
| base_model: cointegrated/rubert-tiny2 | |
| metrics: | |
| - precision | |
| - recall | |
| - roc_auc | |
| model-index: | |
| - name: dev_da_roles_1 | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Developer / Data Analyst / Business Analyst vs Other Binary Classification | |
| metrics: | |
| - type: roc_auc | |
| value: 0.9815 | |
| - type: precision | |
| value: 0.9219 | |
| - type: recall | |
| value: 0.9506 | |
| # dev_da_roles_1 — Developer + Data Analyst + Business Analyst Classifier | |
| Binary job-vacancy classifier: detects **developer, Data Analyst, or Business Analyst** roles (`tech`) versus **other** roles (`other`). | |
| Built on top of [`cointegrated/rubert-tiny2`](https://huggingface.co/cointegrated/rubert-tiny2), a compact BERT model for Russian and English text. | |
| > **v2** — extends v1 by adding Business Analyst to the positive class and using a longer input context (384 tokens / 2000 chars). Precision improved from 0.880 → 0.922. | |
| ## Task Definition | |
| The positive class (`tech`) is defined as: | |
| > `role_category in TECH_CLASSES AND team_lead == 0` | |
| `TECH_CLASSES`: | |
| - Backend | |
| - Desktop / Systems | |
| - Embedded | |
| - Frontend | |
| - Fullstack | |
| - ML / AI / Data Scientist | |
| - Mobile | |
| - Data Analyst | |
| - Бизнес аналитик (Business Analyst) | |
| Team leads and management roles are intentionally excluded from the positive class. | |
| ## Labels | |
| | id | label | | |
| |----|-------| | |
| | 0 | other | | |
| | 1 | tech | | |
| ## Validation Metrics | |
| | Metric | Value | | |
| |---|---:| | |
| | ROC AUC | 0.9815 | | |
| | Precision @ threshold | 0.9219 | | |
| | Recall @ threshold | 0.9506 | | |
| | Best threshold | 0.8791 | | |
| | Target recall | 0.95 | | |
| | Best epoch | 7 | | |
| **Recall by key category (held-out test set):** | |
| | Category | Recall | | |
| |---|---:| | |
| | Backend | 0.984 | | |
| | Frontend | 1.000 | | |
| | Mobile | 1.000 | | |
| | ML / AI / Data Scientist | 0.976 | | |
| | Data Analyst | 0.916 | | |
| | Business Analyst | 0.895 | | |
| ## Inference Parameters | |
| - `max_length`: **384** tokens | |
| - Vacancy text: `title + " . " + description`, description truncated to **2000 characters** | |
| - Decision threshold for class `tech`: **0.8791** | |
| ## Usage | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| MODEL_ID = "AndreiTolmachev/dev_da_roles_1" | |
| THRESHOLD = 0.8791 | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) | |
| model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval() | |
| def is_tech_role(title: str, description: str = "") -> bool: | |
| text = f"{title.strip()} . {description[:2000].strip()}" | |
| enc = tokenizer(text, truncation=True, max_length=384, return_tensors="pt") | |
| with torch.no_grad(): | |
| logits = model(**enc).logits | |
| prob_tech = torch.softmax(logits, dim=-1)[0, 1].item() | |
| return prob_tech >= THRESHOLD | |
| # Developer | |
| print(is_tech_role("Backend Python Developer", "FastAPI, PostgreSQL, Docker, Kubernetes...")) | |
| # Data Analyst | |
| print(is_tech_role("Data Analyst", "SQL, Python, dashboards, product metrics, A/B tests...")) | |
| # Business Analyst | |
| print(is_tech_role("Бизнес аналитик", "Сбор требований, UML, BPMN, работа с командой разработки...")) | |
| # Manager — should return False | |
| print(is_tech_role("Project Manager", "Agile, управление командой, планирование спринтов...")) | |
| ``` | |
| ## Architecture | |
| - Model: `BertForSequenceClassification` | |
| - Base model: `cointegrated/rubert-tiny2` | |
| - Layers: 3, hidden size: 312, attention heads: 12 | |
| - Vocab size: 83,828 | |
| - Parameters: ~29M | |
| - `max_position_embeddings`: 2048 | |
| ## Training | |
| - Dataset: internal job-vacancy dataset (`vacancies_labeled.csv`), labeled by an LLM pipeline | |
| - Train/test split: 85% / 15%, stratified by role and team_lead flag | |
| - Loss: weighted cross-entropy (`pos_weight` = 2.115) | |
| - Optimizer: AdamW, `lr=2e-5`, linear warmup 10%, grad clip 1.0 | |
| - Early stopping: patience=3 on F1 at target recall ≥ 0.95 | |
| - Threshold selected to achieve target recall = **0.95** | |
| ## Limitations | |
| - Trained primarily on Russian-language IT job vacancies; quality on other domains/languages is not guaranteed. | |
| - Team lead and management roles are treated as `other` by design. | |
| - Description is truncated to 2000 characters before tokenization. | |
| - The model groups developers, Data Analysts, and Business Analysts into one positive class; it does not distinguish between them. | |
| - Data Analyst recall is ~0.92: vacancies with heavy business/finance framing may be missed. | |
| ## Version | |
| Hub tag: `v2.0-dev-da-ba-r95` | |
| **Changelog vs v1:** | |
| - Added Business Analyst (`Бизнес аналитик`) to positive class | |
| - Input context extended: `max_length` 256→384, description 1200→2000 chars | |
| - Precision improved: 0.880 → 0.922 | |
| - `lr` lowered to 2e-5, batch size 32→24 to accommodate longer sequences | |
| ## License | |
| MIT. | |