README.md · russian-oracle/rubert-tiny-vacancy-section-classifier-coreml at main

rubert-tiny-vacancy-section-classifier-coreml / README.md

russian-oracle

feat: restore model-index eval results (standard format)

8bab2b6 verified 4 days ago

preview code

Raw

History Blame Contribute Delete

18.1 kB

	---
	language:
	- ru
	license: mit
	library_name: coreml
	pipeline_tag: text-classification
	base_model: cointegrated/rubert-tiny
	base_model_relation: finetune
	inference: false
	num_parameters: 11900000
	datasets:
	- russian-oracle/vacancy-section-classifier-ru
	tags:
	- coreml
	- core-ml
	- text-classification
	- russian
	- rubert
	- rubert-tiny
	- bert
	- ane
	- apple-neural-engine
	- apple-silicon
	- on-device
	- vacancy
	- hr
	- job-postings
	- sequence-classification
	model-index:
	- name: rubert-tiny-vacancy-section-classifier-coreml
	results:
	- task:
	type: text-classification
	name: Text Classification
	dataset:
	name: Vacancy Section Classifier Dataset (RU) — golden-281 (human-labeled)
	type: russian-oracle/vacancy-section-classifier-ru
	split: test
	metrics:
	- type: accuracy
	value: 0.765
	name: Content Accuracy (4 classes, golden-281)
	- type: accuracy
	value: 0.687
	name: 5-class Accuracy (incl. junk, golden-281)
	- type: recall
	value: 0.333
	name: Junk Recall (golden-281)
	- task:
	type: text-classification
	name: Text Classification
	dataset:
	name: Vacancy Section Classifier Dataset (RU) — in-domain test split
	type: russian-oracle/vacancy-section-classifier-ru
	split: test
	metrics:
	- type: accuracy
	value: 0.893
	name: Accuracy (in-domain)
	- type: f1
	value: 0.869
	name: Macro F1 (in-domain)
	- type: f1
	value: 0.789
	name: F1 responsibilities
	- type: f1
	value: 0.760
	name: F1 requirements
	- type: f1
	value: 0.795
	name: F1 terms
	- type: f1
	value: 0.684
	name: F1 notes
	- type: f1
	value: 0.374
	name: F1 junk
	---

	# rubert-tiny · Vacancy Section Classifier · CoreML

	On-device CoreML (Apple Neural Engine) classifier that labels fragments of
	Russian-language job postings into 5 structural sections. Built on
	[`cointegrated/rubert-tiny`](https://huggingface.co/cointegrated/rubert-tiny)
	(11.9M params), exported to a `float16` `.mlpackage` for Apple Silicon.

	> 🇬🇧 English card below · 🇷🇺 Русская версия ниже ([перейти](#-русская-версия))

	---

	## 🇬🇧 English

	### What it does

	Given one fragment of a Russian vacancy description, the model predicts which of
	5 sections it belongs to:

	\| id \| label \| meaning \|
	\|----\|----------------------\|------------------------------------------------------\|
	\| 0 \| `responsibilities` \| what the employee will do (задачи / обязанности) \|
	\| 1 \| `requirements` \| what the candidate must have (требования / навыки) \|
	\| 2 \| `terms` \| conditions of employment (условия / зарплата / ДМС) \|
	\| 3 \| `notes` \| meta / "about the company" / soft boilerplate \|
	\| 4 \| `junk` \| non-informative noise (routed out of structured data) \|

	It is the structured-extraction stage of an HH.ru vacancy-scouting pipeline,
	where it replaced a heavier Qwen-embedding + cosine + rerank approach at
	~1–10 ms per vacancy on Apple Silicon.

	### Artifact

	This repository ships the CoreML artifact only (no PyTorch weights):

	- `section_classifier.mlpackage` — `float16`, `ComputeUnit.ALL` (ANE-eligible),
	minimum deployment target macOS 13.
	- `tokenizer.json`, `tokenizer_config.json` — the matching BERT WordPiece
	tokenizer (vocab 29 564). Required — the `.mlpackage` consumes token ids,
	not raw text.

	#### CoreML I/O signature

	\| name \| dtype \| shape \| notes \|
	\|------------------\|--------\|---------\|-----------------------------------\|
	\| `input_ids` \| int32 \| [1, 128] \| padded to `max_length=128` \|
	\| `attention_mask` \| int32 \| [1, 128] \| 1 = real token, 0 = pad \|
	\| `token_type_ids` \| int32 \| [1, 128] \| all zeros (single segment) \|
	\| output `logits` \| float32 \| [1, 5] \| un-normalized; `argmax` → class \|

	`max_seq_len = 128` and the label names are embedded in the model's
	`user_defined_metadata`.

	### Metrics

	The numbers below were measured on the source PyTorch model. The CoreML
	export was then verified at 100% argmax parity against that source on a
	held-out set of probe texts (max absolute logit difference `0.0026`, expected
	for `float16`), so they carry over to this artifact.

	Headline — golden-281 (human-labeled, held-out):

	\| metric \| value \|
	\|----------------------------------------------\|---------------------\|
	\| Content accuracy (4 meaningful classes) \| 76.5% (176/230) \|
	\| Full 5-class accuracy (incl. junk routing) \| 68.7% \|
	\| Junk recall (noise correctly routed out) \| 33.3% (17/51) \|

	This is the metric to trust: 281 fragments labeled by a human, never seen in
	training.

	<details>
	<summary>In-domain test split (circular — NOT the headline)</summary>

	Evaluated on the internal test split, which shares the same Claude Opus
	relabeled distribution as the training data, so it overstates real-world
	performance. Reported for monitoring only:

	\| metric \| value \|
	\|------------------\|--------\|
	\| Accuracy \| 89.3% \|
	\| Macro-F1 \| 86.9% \|

	Per-class F1 (in-domain): responsibilities 0.789 · requirements 0.760 ·
	terms 0.795 · notes 0.684 · junk 0.374.
	</details>

	### Usage (Python · coremltools)

	```python
	import numpy as np
	import coremltools as ct
	from transformers import AutoTokenizer

	REPO = "russian-oracle/rubert-tiny-vacancy-section-classifier-coreml"
	LABELS = ["responsibilities", "requirements", "terms", "notes", "junk"]

	tok = AutoTokenizer.from_pretrained(REPO) # tokenizer.json shipped here
	mlmodel = ct.models.MLModel("section_classifier.mlpackage") # hf download ... locally

	text = "Опыт работы с Python от 3 лет, знание Django и PostgreSQL."
	enc = tok(text, return_tensors="np", padding="max_length",
	truncation=True, max_length=128)
	ids = enc["input_ids"].astype(np.int32)
	out = mlmodel.predict({
	"input_ids": ids,
	"attention_mask": enc["attention_mask"].astype(np.int32),
	"token_type_ids": enc.get("token_type_ids", np.zeros_like(ids)).astype(np.int32),
	})
	logits = np.asarray(out["logits"]).reshape(-1)
	print(LABELS[int(logits.argmax())]) # → requirements
	# probabilities: softmax(logits)
	```

	> `coremltools` needs its native bindings, which ship only with certain CPython
	> builds (a 3.12 wheel works reliably). Run prediction under such an interpreter.

	Recommended aggregation (how it is used in production): split a full
	description into sentence-level chunks (e.g. `razdel` + newline), classify each,
	take the majority label per chunk; `junk` fragments are routed to an "orphans"
	bucket instead of the structured output.

	### Usage (Swift · sketch)

	```swift
	import CoreML

	let model = try MLModel(contentsOf: url) // section_classifier.mlpackage (compiled)
	// Provide three [1,128] MLMultiArray(.int32) inputs: input_ids, attention_mask,
	// token_type_ids — produced by a BERT WordPiece tokenizer over the input text.
	// Output "logits" is [1,5]; argmax over the last axis gives the class id.
	```

	### Training

	- Base: `cointegrated/rubert-tiny` (BERT, 312 hidden, 3 layers, vocab 29 564).
	- Lineage: multi-stage fine-tune — rubert-tiny → intermediate extractor →
	4-class → 5-class → 5-class "rechunked" (this model). Warm-started from the
	previous 5-class checkpoint.
	- Data: ~12–13k fragments of Russian IT vacancies, **relabeled by Claude
	Opus** (silver → distilled), re-chunked with a `razdel` sentence splitter +
	newline boundaries.
	- Objective: class-weighted cross-entropy (balanced inverse-frequency) to
	counter section imbalance.
	- Schedule: 8 epochs with early stopping (patience 3, best ≈ epoch 3),
	batch 32, lr 3e-5, weight decay 0.01, warmup ratio 0.1, linear decay, seed 42,
	`max_length` 128, trained on Apple MPS in fp32.
	- Export: coremltools 9.0, `compute_precision=FLOAT16`,
	`compute_units=ALL`, `position_ids` baked as a constant buffer to work around a
	const-fold limitation; verified at 100% argmax parity with the PyTorch source.

	### Limitations & bias

	- Junk recall is low (33.3%). The model often keeps noise rather than
	dropping it; `notes` ↔ `junk` is the hardest boundary (junk F1 0.374). Add a
	downstream filter if clean routing matters.
	- Domain: trained on Russian IT vacancies. Other industries, other
	languages, or non-vacancy text are out of distribution.
	- Granularity: classifies a single fragment, not a whole posting. Use the
	chunk-then-vote pattern above for full descriptions.
	- Sequence length: fixed at 128 tokens; longer fragments are truncated.
	- Labels are distilled from an LLM (Claude Opus), so they inherit its biases.

	### License

	MIT — same as the base model `cointegrated/rubert-tiny`.

	### Citation

	```bibtex
	@misc{rubert_tiny_vacancy_section_classifier_coreml,
	title = {rubert-tiny Vacancy Section Classifier (CoreML)},
	author = {russian-oracle},
	year = {2026},
	note = {Fine-tuned from cointegrated/rubert-tiny; CoreML/ANE export},
	url = {https://huggingface.co/russian-oracle/rubert-tiny-vacancy-section-classifier-coreml}
	}
	```

	Base model:

	```bibtex
	@misc{dale2021rubert_tiny,
	title = {rubert-tiny},
	author = {Dale, David (cointegrated)},
	url = {https://huggingface.co/cointegrated/rubert-tiny}
	}
	```

	---

	## 🇷🇺 Русская версия

	### Что делает

	По одному фрагменту русскоязычного описания вакансии модель предсказывает, к
	какой из 5 структурных секций он относится:

	\| id \| метка \| смысл \|
	\|----\|----------------------\|------------------------------------------------------\|
	\| 0 \| `responsibilities` \| что сотрудник будет делать (задачи / обязанности) \|
	\| 1 \| `requirements` \| что требуется от кандидата (требования / навыки) \|
	\| 2 \| `terms` \| условия работы (зарплата / ДМС / график) \|
	\| 3 \| `notes` \| мета / «о компании» / мягкий boilerplate \|
	\| 4 \| `junk` \| неинформативный шум (выводится из структуры) \|

	Это этап структурной разметки в пайплайне скаутинга вакансий HH.ru, где модель
	заменила более тяжёлую связку Qwen-эмбеддинги + косинус + reranking — при
	~1–10 мс на вакансию на Apple Silicon.

	### Артефакт

	В репозитории — только CoreML-артефакт (без PyTorch-весов):

	- `section_classifier.mlpackage` — `float16`, `ComputeUnit.ALL` (с поддержкой
	ANE), минимальная цель развёртывания macOS 13.
	- `tokenizer.json`, `tokenizer_config.json` — соответствующий BERT WordPiece
	токенайзер (словарь 29 564). Обязателен — `.mlpackage` принимает id
	токенов, а не сырой текст.

	#### Сигнатура входов/выходов CoreML

	\| имя \| тип \| форма \| примечание \|
	\|------------------\|--------\|---------\|-----------------------------------\|
	\| `input_ids` \| int32 \| [1, 128] \| паддинг до `max_length=128` \|
	\| `attention_mask` \| int32 \| [1, 128] \| 1 — реальный токен, 0 — паддинг \|
	\| `token_type_ids` \| int32 \| [1, 128] \| все нули (один сегмент) \|
	\| выход `logits` \| float32 \| [1, 5] \| без нормализации; `argmax` → класс \|

	`max_seq_len = 128` и имена классов зашиты в `user_defined_metadata` модели.

	### Метрики

	Цифры ниже измерены на исходной PyTorch-модели. CoreML-экспорт затем
	проверен на 100% совпадение argmax с этим источником на отложенном наборе
	проб (макс. абс. разница логитов `0.0026`, что нормально для `float16`), поэтому
	они переносятся на этот артефакт.

	Headline — golden-281 (ручная разметка, held-out):

	\| метрика \| значение \|
	\|--------------------------------------------------\|---------------------\|
	\| Content-accuracy (4 содержательных класса) \| 76.5% (176/230) \|
	\| Полная 5-class accuracy (включая роутинг junk) \| 68.7% \|
	\| Junk recall (корректно отсеянный шум) \| 33.3% (17/51) \|

	Это и есть метрика, которой стоит доверять: 281 фрагмент, размеченный человеком
	и не виденный при обучении.

	<details>
	<summary>Внутренний test-split (циркулярный — НЕ headline)</summary>

	Оценка на внутреннем тестовом сплите, у которого то же Claude Opus-распределение
	разметки, что и у обучающих данных, поэтому он завышает реальное качество.
	Приведён только для мониторинга:

	\| метрика \| значение \|
	\|------------------\|--------\|
	\| Accuracy \| 89.3% \|
	\| Macro-F1 \| 86.9% \|

	Per-class F1 (in-domain): responsibilities 0.789 · requirements 0.760 ·
	terms 0.795 · notes 0.684 · junk 0.374.
	</details>

	### Использование (Python · coremltools)

	```python
	import numpy as np
	import coremltools as ct
	from transformers import AutoTokenizer

	REPO = "russian-oracle/rubert-tiny-vacancy-section-classifier-coreml"
	LABELS = ["responsibilities", "requirements", "terms", "notes", "junk"]

	tok = AutoTokenizer.from_pretrained(REPO) # tokenizer.json в этом репо
	mlmodel = ct.models.MLModel("section_classifier.mlpackage") # hf download ... локально

	text = "Опыт работы с Python от 3 лет, знание Django и PostgreSQL."
	enc = tok(text, return_tensors="np", padding="max_length",
	truncation=True, max_length=128)
	ids = enc["input_ids"].astype(np.int32)
	out = mlmodel.predict({
	"input_ids": ids,
	"attention_mask": enc["attention_mask"].astype(np.int32),
	"token_type_ids": enc.get("token_type_ids", np.zeros_like(ids)).astype(np.int32),
	})
	logits = np.asarray(out["logits"]).reshape(-1)
	print(LABELS[int(logits.argmax())]) # → requirements
	# вероятности: softmax(logits)
	```

	> `coremltools` требует нативных биндингов, которые есть только в части сборок
	> CPython (надёжно работает wheel под 3.12). Запускайте предсказание под таким
	> интерпретатором.

	Рекомендуемая агрегация (как используется в продакшене): разбейте полное
	описание на фрагменты по предложениям (например, `razdel` + переводы строк),
	классифицируйте каждый, возьмите мажоритарную метку на чанк; фрагменты `junk`
	отправляются в корзину «orphans», а не в структурированный вывод.

	### Обучение

	- База: `cointegrated/rubert-tiny` (BERT, 312 hidden, 3 слоя, словарь 29 564).
	- Происхождение: многоступенчатый файн-тюн — rubert-tiny → промежуточный
	extractor → 4-class → 5-class → 5-class «rechunked» (эта модель).
	Warm-start с предыдущего 5-class чекпойнта.
	- Данные: ~12–13 тыс. фрагментов русских IT-вакансий, **переразмечены
	Claude Opus** (silver → дистилляция), перечанкованы сплиттером `razdel` по
	предложениям + границам строк.
	- Лосс: взвешенная по классам кросс-энтропия (balanced inverse-frequency)
	против дисбаланса секций.
	- Расписание: 8 эпох с ранней остановкой (patience 3, лучшая ≈ эпоха 3),
	batch 32, lr 3e-5, weight decay 0.01, warmup ratio 0.1, линейный спад, seed 42,
	`max_length` 128, обучение на Apple MPS в fp32.
	- Экспорт: coremltools 9.0, `compute_precision=FLOAT16`,
	`compute_units=ALL`, `position_ids` зашит как константный буфер (обход
	ограничения const-fold); проверено на 100% argmax-parity с PyTorch-источником.

	### Ограничения и смещения

	- Низкий junk recall (33.3%). Модель чаще оставляет шум, чем отсеивает его;
	граница `notes` ↔ `junk` — самая сложная (junk F1 0.374). Если важен чистый
	роутинг — добавьте downstream-фильтр.
	- Домен: обучена на русских IT-вакансиях. Другие отрасли, языки или
	не-вакансионный текст — вне распределения.
	- Гранулярность: классифицирует отдельный фрагмент, а не вакансию целиком.
	Для полных описаний используйте схему chunk-then-vote выше.
	- Длина последовательности: фиксированные 128 токенов; длиннее — обрезается.
	- Метки дистиллированы из LLM (Claude Opus) и наследуют её смещения.

	### Лицензия

	MIT — как и у базовой модели `cointegrated/rubert-tiny`.

	### Цитирование

	См. BibTeX в английской секции выше.