rubert-tiny · Vacancy Section Classifier · CoreML

On-device CoreML (Apple Neural Engine) classifier that labels fragments of Russian-language job postings into 5 structural sections. Built on cointegrated/rubert-tiny (11.9M params), exported to a float16 .mlpackage for Apple Silicon.

🇬🇧 English card below · 🇷🇺 Русская версия ниже (перейти)

🇬🇧 English

What it does

Given one fragment of a Russian vacancy description, the model predicts which of 5 sections it belongs to:

id	label	meaning
0	`responsibilities`	what the employee will do (задачи / обязанности)
1	`requirements`	what the candidate must have (требования / навыки)
2	`terms`	conditions of employment (условия / зарплата / ДМС)
3	`notes`	meta / "about the company" / soft boilerplate
4	`junk`	non-informative noise (routed out of structured data)

It is the structured-extraction stage of an HH.ru vacancy-scouting pipeline, where it replaced a heavier Qwen-embedding + cosine + rerank approach at ~1–10 ms per vacancy on Apple Silicon.

Artifact

This repository ships the CoreML artifact only (no PyTorch weights):

section_classifier.mlpackage — float16, ComputeUnit.ALL (ANE-eligible), minimum deployment target macOS 13.
tokenizer.json, tokenizer_config.json — the matching BERT WordPiece tokenizer (vocab 29 564). Required — the .mlpackage consumes token ids, not raw text.

CoreML I/O signature

name	dtype	shape	notes
`input_ids`	int32	[1, 128]	padded to `max_length=128`
`attention_mask`	int32	[1, 128]	1 = real token, 0 = pad
`token_type_ids`	int32	[1, 128]	all zeros (single segment)
output `logits`	float32	[1, 5]	un-normalized; `argmax` → class

max_seq_len = 128 and the label names are embedded in the model's user_defined_metadata.

Metrics

The numbers below were measured on the source PyTorch model. The CoreML export was then verified at 100% argmax parity against that source on a held-out set of probe texts (max absolute logit difference 0.0026, expected for float16), so they carry over to this artifact.

Headline — golden-281 (human-labeled, held-out):

metric	value
Content accuracy (4 meaningful classes)	76.5% (176/230)
Full 5-class accuracy (incl. junk routing)	68.7%
Junk recall (noise correctly routed out)	33.3% (17/51)

This is the metric to trust: 281 fragments labeled by a human, never seen in training.

In-domain test split (circular — NOT the headline)

Evaluated on the internal test split, which shares the same Claude Opus relabeled distribution as the training data, so it overstates real-world performance. Reported for monitoring only:

metric	value
Accuracy	89.3%
Macro-F1	86.9%

Per-class F1 (in-domain): responsibilities 0.789 · requirements 0.760 · terms 0.795 · notes 0.684 · junk 0.374.

Usage (Python · coremltools)

import numpy as np
import coremltools as ct
from transformers import AutoTokenizer

REPO = "russian-oracle/rubert-tiny-vacancy-section-classifier-coreml"
LABELS = ["responsibilities", "requirements", "terms", "notes", "junk"]

tok = AutoTokenizer.from_pretrained(REPO)                  # tokenizer.json shipped here
mlmodel = ct.models.MLModel("section_classifier.mlpackage")  # hf download ... locally

text = "Опыт работы с Python от 3 лет, знание Django и PostgreSQL."
enc = tok(text, return_tensors="np", padding="max_length",
          truncation=True, max_length=128)
ids = enc["input_ids"].astype(np.int32)
out = mlmodel.predict({
    "input_ids": ids,
    "attention_mask": enc["attention_mask"].astype(np.int32),
    "token_type_ids": enc.get("token_type_ids", np.zeros_like(ids)).astype(np.int32),
})
logits = np.asarray(out["logits"]).reshape(-1)
print(LABELS[int(logits.argmax())])        # → requirements
# probabilities: softmax(logits)

coremltools needs its native bindings, which ship only with certain CPython builds (a 3.12 wheel works reliably). Run prediction under such an interpreter.

Recommended aggregation (how it is used in production): split a full description into sentence-level chunks (e.g. razdel + newline), classify each, take the majority label per chunk; junk fragments are routed to an "orphans" bucket instead of the structured output.

Usage (Swift · sketch)

import CoreML

let model = try MLModel(contentsOf: url)  // section_classifier.mlpackage (compiled)
// Provide three [1,128] MLMultiArray(.int32) inputs: input_ids, attention_mask,
// token_type_ids — produced by a BERT WordPiece tokenizer over the input text.
// Output "logits" is [1,5]; argmax over the last axis gives the class id.

Training

Base: cointegrated/rubert-tiny (BERT, 312 hidden, 3 layers, vocab 29 564).
Lineage: multi-stage fine-tune — rubert-tiny → intermediate extractor → 4-class → 5-class → 5-class "rechunked" (this model). Warm-started from the previous 5-class checkpoint.
Data: ~12–13k fragments of Russian IT vacancies, relabeled by Claude Opus (silver → distilled), re-chunked with a razdel sentence splitter + newline boundaries.
Objective: class-weighted cross-entropy (balanced inverse-frequency) to counter section imbalance.
Schedule: 8 epochs with early stopping (patience 3, best ≈ epoch 3), batch 32, lr 3e-5, weight decay 0.01, warmup ratio 0.1, linear decay, seed 42, max_length 128, trained on Apple MPS in fp32.
Export: coremltools 9.0, compute_precision=FLOAT16, compute_units=ALL, position_ids baked as a constant buffer to work around a const-fold limitation; verified at 100% argmax parity with the PyTorch source.

Limitations & bias

Junk recall is low (33.3%). The model often keeps noise rather than dropping it; notes ↔ junk is the hardest boundary (junk F1 0.374). Add a downstream filter if clean routing matters.
Domain: trained on Russian IT vacancies. Other industries, other languages, or non-vacancy text are out of distribution.
Granularity: classifies a single fragment, not a whole posting. Use the chunk-then-vote pattern above for full descriptions.
Sequence length: fixed at 128 tokens; longer fragments are truncated.
Labels are distilled from an LLM (Claude Opus), so they inherit its biases.

License

MIT — same as the base model cointegrated/rubert-tiny.

Citation

@misc{rubert_tiny_vacancy_section_classifier_coreml,
  title  = {rubert-tiny Vacancy Section Classifier (CoreML)},
  author = {russian-oracle},
  year   = {2026},
  note   = {Fine-tuned from cointegrated/rubert-tiny; CoreML/ANE export},
  url    = {https://huggingface.co/russian-oracle/rubert-tiny-vacancy-section-classifier-coreml}
}

Base model:

@misc{dale2021rubert_tiny,
  title  = {rubert-tiny},
  author = {Dale, David (cointegrated)},
  url    = {https://huggingface.co/cointegrated/rubert-tiny}
}

🇷🇺 Русская версия

Что делает

По одному фрагменту русскоязычного описания вакансии модель предсказывает, к какой из 5 структурных секций он относится:

id	метка	смысл
0	`responsibilities`	что сотрудник будет делать (задачи / обязанности)
1	`requirements`	что требуется от кандидата (требования / навыки)
2	`terms`	условия работы (зарплата / ДМС / график)
3	`notes`	мета / «о компании» / мягкий boilerplate
4	`junk`	неинформативный шум (выводится из структуры)

Это этап структурной разметки в пайплайне скаутинга вакансий HH.ru, где модель заменила более тяжёлую связку Qwen-эмбеддинги + косинус + reranking — при ~1–10 мс на вакансию на Apple Silicon.

Артефакт

В репозитории — только CoreML-артефакт (без PyTorch-весов):

section_classifier.mlpackage — float16, ComputeUnit.ALL (с поддержкой ANE), минимальная цель развёртывания macOS 13.
tokenizer.json, tokenizer_config.json — соответствующий BERT WordPiece токенайзер (словарь 29 564). Обязателен — .mlpackage принимает id токенов, а не сырой текст.

Сигнатура входов/выходов CoreML

имя	тип	форма	примечание
`input_ids`	int32	[1, 128]	паддинг до `max_length=128`
`attention_mask`	int32	[1, 128]	1 — реальный токен, 0 — паддинг
`token_type_ids`	int32	[1, 128]	все нули (один сегмент)
выход `logits`	float32	[1, 5]	без нормализации; `argmax` → класс

max_seq_len = 128 и имена классов зашиты в user_defined_metadata модели.

Метрики

Цифры ниже измерены на исходной PyTorch-модели. CoreML-экспорт затем проверен на 100% совпадение argmax с этим источником на отложенном наборе проб (макс. абс. разница логитов 0.0026, что нормально для float16), поэтому они переносятся на этот артефакт.

Headline — golden-281 (ручная разметка, held-out):

метрика	значение
Content-accuracy (4 содержательных класса)	76.5% (176/230)
Полная 5-class accuracy (включая роутинг junk)	68.7%
Junk recall (корректно отсеянный шум)	33.3% (17/51)

Это и есть метрика, которой стоит доверять: 281 фрагмент, размеченный человеком и не виденный при обучении.

Внутренний test-split (циркулярный — НЕ headline)

Оценка на внутреннем тестовом сплите, у которого то же Claude Opus-распределение разметки, что и у обучающих данных, поэтому он завышает реальное качество. Приведён только для мониторинга:

метрика	значение
Accuracy	89.3%
Macro-F1	86.9%

Per-class F1 (in-domain): responsibilities 0.789 · requirements 0.760 · terms 0.795 · notes 0.684 · junk 0.374.

Использование (Python · coremltools)

import numpy as np
import coremltools as ct
from transformers import AutoTokenizer

REPO = "russian-oracle/rubert-tiny-vacancy-section-classifier-coreml"
LABELS = ["responsibilities", "requirements", "terms", "notes", "junk"]

tok = AutoTokenizer.from_pretrained(REPO)                  # tokenizer.json в этом репо
mlmodel = ct.models.MLModel("section_classifier.mlpackage")  # hf download ... локально

text = "Опыт работы с Python от 3 лет, знание Django и PostgreSQL."
enc = tok(text, return_tensors="np", padding="max_length",
          truncation=True, max_length=128)
ids = enc["input_ids"].astype(np.int32)
out = mlmodel.predict({
    "input_ids": ids,
    "attention_mask": enc["attention_mask"].astype(np.int32),
    "token_type_ids": enc.get("token_type_ids", np.zeros_like(ids)).astype(np.int32),
})
logits = np.asarray(out["logits"]).reshape(-1)
print(LABELS[int(logits.argmax())])        # → requirements
# вероятности: softmax(logits)

coremltools требует нативных биндингов, которые есть только в части сборок CPython (надёжно работает wheel под 3.12). Запускайте предсказание под таким интерпретатором.

Рекомендуемая агрегация (как используется в продакшене): разбейте полное описание на фрагменты по предложениям (например, razdel + переводы строк), классифицируйте каждый, возьмите мажоритарную метку на чанк; фрагменты junk отправляются в корзину «orphans», а не в структурированный вывод.

Обучение

База: cointegrated/rubert-tiny (BERT, 312 hidden, 3 слоя, словарь 29 564).
Происхождение: многоступенчатый файн-тюн — rubert-tiny → промежуточный extractor → 4-class → 5-class → 5-class «rechunked» (эта модель). Warm-start с предыдущего 5-class чекпойнта.
Данные: ~12–13 тыс. фрагментов русских IT-вакансий, переразмечены Claude Opus (silver → дистилляция), перечанкованы сплиттером razdel по предложениям + границам строк.
Лосс: взвешенная по классам кросс-энтропия (balanced inverse-frequency) против дисбаланса секций.
Расписание: 8 эпох с ранней остановкой (patience 3, лучшая ≈ эпоха 3), batch 32, lr 3e-5, weight decay 0.01, warmup ratio 0.1, линейный спад, seed 42, max_length 128, обучение на Apple MPS в fp32.
Экспорт: coremltools 9.0, compute_precision=FLOAT16, compute_units=ALL, position_ids зашит как константный буфер (обход ограничения const-fold); проверено на 100% argmax-parity с PyTorch-источником.

Ограничения и смещения

Низкий junk recall (33.3%). Модель чаще оставляет шум, чем отсеивает его; граница notes ↔ junk — самая сложная (junk F1 0.374). Если важен чистый роутинг — добавьте downstream-фильтр.
Домен: обучена на русских IT-вакансиях. Другие отрасли, языки или не-вакансионный текст — вне распределения.
Гранулярность: классифицирует отдельный фрагмент, а не вакансию целиком. Для полных описаний используйте схему chunk-then-vote выше.
Длина последовательности: фиксированные 128 токенов; длиннее — обрезается.
Метки дистиллированы из LLM (Claude Opus) и наследуют её смещения.

Лицензия

MIT — как и у базовой модели cointegrated/rubert-tiny.

Цитирование

См. BibTeX в английской секции выше.

Downloads last month: -

Model tree for russian-oracle/rubert-tiny-vacancy-section-classifier-coreml

Base model

cointegrated/rubert-tiny

Finetuned

(9)

this model

Dataset used to train russian-oracle/rubert-tiny-vacancy-section-classifier-coreml

Evaluation results

Content Accuracy (4 classes, golden-281) on Vacancy Section Classifier Dataset (RU) — golden-281 (human-labeled)
test set self-reported

0.765
5-class Accuracy (incl. junk, golden-281) on Vacancy Section Classifier Dataset (RU) — golden-281 (human-labeled)
test set self-reported

0.687
Junk Recall (golden-281) on Vacancy Section Classifier Dataset (RU) — golden-281 (human-labeled)
test set self-reported

0.333
Accuracy (in-domain) on Vacancy Section Classifier Dataset (RU) — in-domain test split
test set self-reported

0.893
Macro F1 (in-domain) on Vacancy Section Classifier Dataset (RU) — in-domain test split
test set self-reported

0.869
F1 responsibilities on Vacancy Section Classifier Dataset (RU) — in-domain test split
test set self-reported

0.789
F1 requirements on Vacancy Section Classifier Dataset (RU) — in-domain test split
test set self-reported

0.760
F1 terms on Vacancy Section Classifier Dataset (RU) — in-domain test split
test set self-reported

0.795