v2: update model card — add BA class, new metrics, usage example

79761d1 verified 14 days ago

5.06 kB

	---
	language:
	- ru
	- en
	license: mit
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- text-classification
	- bert
	- tiny-bert
	- rubert-tiny2
	- binary-classification
	- jobs
	- developer-classification
	- data-analyst-classification
	- business-analyst-classification
	- dev-plus-da-plus-ba
	- r95
	- v2
	base_model: cointegrated/rubert-tiny2
	metrics:
	- precision
	- recall
	- roc_auc
	model-index:
	- name: dev_da_roles_1
	results:
	- task:
	type: text-classification
	name: Developer / Data Analyst / Business Analyst vs Other Binary Classification
	metrics:
	- type: roc_auc
	value: 0.9815
	- type: precision
	value: 0.9219
	- type: recall
	value: 0.9506
	---

	# dev_da_roles_1 — Developer + Data Analyst + Business Analyst Classifier

	Binary job-vacancy classifier: detects developer, Data Analyst, or Business Analyst roles (`tech`) versus other roles (`other`).

	Built on top of [`cointegrated/rubert-tiny2`](https://huggingface.co/cointegrated/rubert-tiny2), a compact BERT model for Russian and English text.

	> v2 — extends v1 by adding Business Analyst to the positive class and using a longer input context (384 tokens / 2000 chars). Precision improved from 0.880 → 0.922.

	## Task Definition

	The positive class (`tech`) is defined as:

	> `role_category in TECH_CLASSES AND team_lead == 0`

	`TECH_CLASSES`:

	- Backend
	- Desktop / Systems
	- Embedded
	- Frontend
	- Fullstack
	- ML / AI / Data Scientist
	- Mobile
	- Data Analyst
	- Бизнес аналитик (Business Analyst)

	Team leads and management roles are intentionally excluded from the positive class.

	## Labels

	\| id \| label \|
	\|----\|-------\|
	\| 0 \| other \|
	\| 1 \| tech \|

	## Validation Metrics

	\| Metric \| Value \|
	\|---\|---:\|
	\| ROC AUC \| 0.9815 \|
	\| Precision @ threshold \| 0.9219 \|
	\| Recall @ threshold \| 0.9506 \|
	\| Best threshold \| 0.8791 \|
	\| Target recall \| 0.95 \|
	\| Best epoch \| 7 \|

	Recall by key category (held-out test set):

	\| Category \| Recall \|
	\|---\|---:\|
	\| Backend \| 0.984 \|
	\| Frontend \| 1.000 \|
	\| Mobile \| 1.000 \|
	\| ML / AI / Data Scientist \| 0.976 \|
	\| Data Analyst \| 0.916 \|
	\| Business Analyst \| 0.895 \|

	## Inference Parameters

	- `max_length`: 384 tokens
	- Vacancy text: `title + " . " + description`, description truncated to 2000 characters
	- Decision threshold for class `tech`: 0.8791

	## Usage

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	MODEL_ID = "AndreiTolmachev/dev_da_roles_1"
	THRESHOLD = 0.8791

	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
	model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval()

	def is_tech_role(title: str, description: str = "") -> bool:
	text = f"{title.strip()} . {description[:2000].strip()}"
	enc = tokenizer(text, truncation=True, max_length=384, return_tensors="pt")
	with torch.no_grad():
	logits = model(**enc).logits
	prob_tech = torch.softmax(logits, dim=-1)[0, 1].item()
	return prob_tech >= THRESHOLD

	# Developer
	print(is_tech_role("Backend Python Developer", "FastAPI, PostgreSQL, Docker, Kubernetes..."))

	# Data Analyst
	print(is_tech_role("Data Analyst", "SQL, Python, dashboards, product metrics, A/B tests..."))

	# Business Analyst
	print(is_tech_role("Бизнес аналитик", "Сбор требований, UML, BPMN, работа с командой разработки..."))

	# Manager — should return False
	print(is_tech_role("Project Manager", "Agile, управление командой, планирование спринтов..."))
	```

	## Architecture

	- Model: `BertForSequenceClassification`
	- Base model: `cointegrated/rubert-tiny2`
	- Layers: 3, hidden size: 312, attention heads: 12
	- Vocab size: 83,828
	- Parameters: ~29M
	- `max_position_embeddings`: 2048

	## Training

	- Dataset: internal job-vacancy dataset (`vacancies_labeled.csv`), labeled by an LLM pipeline
	- Train/test split: 85% / 15%, stratified by role and team_lead flag
	- Loss: weighted cross-entropy (`pos_weight` = 2.115)
	- Optimizer: AdamW, `lr=2e-5`, linear warmup 10%, grad clip 1.0
	- Early stopping: patience=3 on F1 at target recall ≥ 0.95
	- Threshold selected to achieve target recall = 0.95

	## Limitations

	- Trained primarily on Russian-language IT job vacancies; quality on other domains/languages is not guaranteed.
	- Team lead and management roles are treated as `other` by design.
	- Description is truncated to 2000 characters before tokenization.
	- The model groups developers, Data Analysts, and Business Analysts into one positive class; it does not distinguish between them.
	- Data Analyst recall is ~0.92: vacancies with heavy business/finance framing may be missed.

	## Version

	Hub tag: `v2.0-dev-da-ba-r95`

	Changelog vs v1:
	- Added Business Analyst (`Бизнес аналитик`) to positive class
	- Input context extended: `max_length` 256→384, description 1200→2000 chars
	- Precision improved: 0.880 → 0.922
	- `lr` lowered to 2e-5, batch size 32→24 to accommodate longer sequences

	## License

	MIT.