README.md · russian-oracle/rubert-tiny-vacancy-section-classifier-coreml at main

rubert-tiny-vacancy-section-classifier-coreml

File size: 18,115 Bytes

---
language:
- ru
license: mit
library_name: coreml
pipeline_tag: text-classification
base_model: cointegrated/rubert-tiny
base_model_relation: finetune
inference: false
num_parameters: 11900000
datasets:
- russian-oracle/vacancy-section-classifier-ru
tags:
- coreml
- core-ml
- text-classification
- russian
- rubert
- rubert-tiny
- bert
- ane
- apple-neural-engine
- apple-silicon
- on-device
- vacancy
- hr
- job-postings
- sequence-classification
model-index:
- name: rubert-tiny-vacancy-section-classifier-coreml
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      name: Vacancy Section Classifier Dataset (RU) — golden-281 (human-labeled)
      type: russian-oracle/vacancy-section-classifier-ru
      split: test
    metrics:
    - type: accuracy
      value: 0.765
      name: Content Accuracy (4 classes, golden-281)
    - type: accuracy
      value: 0.687
      name: 5-class Accuracy (incl. junk, golden-281)
    - type: recall
      value: 0.333
      name: Junk Recall (golden-281)
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      name: Vacancy Section Classifier Dataset (RU) — in-domain test split
      type: russian-oracle/vacancy-section-classifier-ru
      split: test
    metrics:
    - type: accuracy
      value: 0.893
      name: Accuracy (in-domain)
    - type: f1
      value: 0.869
      name: Macro F1 (in-domain)
    - type: f1
      value: 0.789
      name: F1 responsibilities
    - type: f1
      value: 0.760
      name: F1 requirements
    - type: f1
      value: 0.795
      name: F1 terms
    - type: f1
      value: 0.684
      name: F1 notes
    - type: f1
      value: 0.374
      name: F1 junk
---

# rubert-tiny · Vacancy Section Classifier · CoreML

On-device CoreML (Apple Neural Engine) classifier that labels fragments of
Russian-language job postings into 5 structural sections. Built on
[`cointegrated/rubert-tiny`](https://huggingface.co/cointegrated/rubert-tiny)
(11.9M params), exported to a `float16` `.mlpackage` for Apple Silicon.

> 🇬🇧 English card below · 🇷🇺 Русская версия ниже ([перейти](#-русская-версия))

---

## 🇬🇧 English

### What it does

Given one fragment of a Russian vacancy description, the model predicts which of
5 sections it belongs to:

| id | label | meaning |
|----|----------------------|------------------------------------------------------|
| 0 | `responsibilities` | what the employee will do (задачи / обязанности) |
| 1 | `requirements` | what the candidate must have (требования / навыки) |
| 2 | `terms` | conditions of employment (условия / зарплата / ДМС) |
| 3 | `notes` | meta / "about the company" / soft boilerplate |
| 4 | `junk` | non-informative noise (routed out of structured data) |

It is the structured-extraction stage of an HH.ru vacancy-scouting pipeline,
where it replaced a heavier Qwen-embedding + cosine + rerank approach at
~1–10 ms per vacancy on Apple Silicon.

### Artifact

This repository ships the **CoreML artifact only** (no PyTorch weights):

- `section_classifier.mlpackage` — `float16`, `ComputeUnit.ALL` (ANE-eligible),
  minimum deployment target macOS 13.
- `tokenizer.json`, `tokenizer_config.json` — the matching BERT WordPiece
  tokenizer (vocab 29 564). **Required** — the `.mlpackage` consumes token ids,
  not raw text.

#### CoreML I/O signature

| name | dtype | shape | notes |
|------------------|--------|---------|-----------------------------------|
| `input_ids` | int32 | [1, 128] | padded to `max_length=128` |
| `attention_mask` | int32 | [1, 128] | 1 = real token, 0 = pad |
| `token_type_ids` | int32 | [1, 128] | all zeros (single segment) |
| **output** `logits` | float32 | [1, 5] | un-normalized; `argmax` → class |

`max_seq_len = 128` and the label names are embedded in the model's
`user_defined_metadata`.

### Metrics

The numbers below were measured on the **source PyTorch model**. The CoreML
export was then verified at **100% argmax parity** against that source on a
held-out set of probe texts (max absolute logit difference `0.0026`, expected
for `float16`), so they carry over to this artifact.

**Headline — golden-281 (human-labeled, held-out):**

| metric | value |
|----------------------------------------------|---------------------|
| Content accuracy (4 meaningful classes) | **76.5%** (176/230) |
| Full 5-class accuracy (incl. junk routing) | **68.7%** |
| Junk recall (noise correctly routed out) | **33.3%** (17/51) |

This is the metric to trust: 281 fragments labeled by a human, never seen in
training.

<details>
<summary>In-domain test split (circular — NOT the headline)</summary>

Evaluated on the internal test split, which shares the same Claude Opus
relabeled distribution as the training data, so it overstates real-world
performance. Reported for monitoring only:

| metric | value |
|------------------|--------|
| Accuracy | 89.3% |
| Macro-F1 | 86.9% |

Per-class F1 (in-domain): responsibilities 0.789 · requirements 0.760 ·
terms 0.795 · notes 0.684 · **junk 0.374**.
</details>

### Usage (Python · coremltools)

```python
import numpy as np
import coremltools as ct
from transformers import AutoTokenizer

REPO = "russian-oracle/rubert-tiny-vacancy-section-classifier-coreml"
LABELS = ["responsibilities", "requirements", "terms", "notes", "junk"]

tok = AutoTokenizer.from_pretrained(REPO)                  # tokenizer.json shipped here
mlmodel = ct.models.MLModel("section_classifier.mlpackage")  # hf download ... locally

text = "Опыт работы с Python от 3 лет, знание Django и PostgreSQL."
enc = tok(text, return_tensors="np", padding="max_length",
          truncation=True, max_length=128)
ids = enc["input_ids"].astype(np.int32)
out = mlmodel.predict({
    "input_ids": ids,
    "attention_mask": enc["attention_mask"].astype(np.int32),
    "token_type_ids": enc.get("token_type_ids", np.zeros_like(ids)).astype(np.int32),
})
logits = np.asarray(out["logits"]).reshape(-1)
print(LABELS[int(logits.argmax())])        # → requirements
# probabilities: softmax(logits)
```

> `coremltools` needs its native bindings, which ship only with certain CPython
> builds (a 3.12 wheel works reliably). Run prediction under such an interpreter.

**Recommended aggregation (how it is used in production):** split a full
description into sentence-level chunks (e.g. `razdel` + newline), classify each,
take the majority label per chunk; `junk` fragments are routed to an "orphans"
bucket instead of the structured output.

### Usage (Swift · sketch)

```swift
import CoreML

let model = try MLModel(contentsOf: url)  // section_classifier.mlpackage (compiled)
// Provide three [1,128] MLMultiArray(.int32) inputs: input_ids, attention_mask,
// token_type_ids — produced by a BERT WordPiece tokenizer over the input text.
// Output "logits" is [1,5]; argmax over the last axis gives the class id.
```

### Training

- **Base:** `cointegrated/rubert-tiny` (BERT, 312 hidden, 3 layers, vocab 29 564).
- **Lineage:** multi-stage fine-tune — rubert-tiny → intermediate extractor →
  4-class → 5-class → **5-class "rechunked"** (this model). Warm-started from the
  previous 5-class checkpoint.
- **Data:** ~12–13k fragments of Russian IT vacancies, **relabeled by Claude
  Opus** (silver → distilled), re-chunked with a `razdel` sentence splitter +
  newline boundaries.
- **Objective:** class-weighted cross-entropy (balanced inverse-frequency) to
  counter section imbalance.
- **Schedule:** 8 epochs with early stopping (patience 3, best ≈ epoch 3),
  batch 32, lr 3e-5, weight decay 0.01, warmup ratio 0.1, linear decay, seed 42,
  `max_length` 128, trained on Apple MPS in fp32.
- **Export:** coremltools 9.0, `compute_precision=FLOAT16`,
  `compute_units=ALL`, `position_ids` baked as a constant buffer to work around a
  const-fold limitation; verified at 100% argmax parity with the PyTorch source.

### Limitations & bias

- **Junk recall is low (33.3%).** The model often keeps noise rather than
  dropping it; `notes` ↔ `junk` is the hardest boundary (junk F1 0.374). Add a
  downstream filter if clean routing matters.
- **Domain:** trained on Russian **IT** vacancies. Other industries, other
  languages, or non-vacancy text are out of distribution.
- **Granularity:** classifies a *single fragment*, not a whole posting. Use the
  chunk-then-vote pattern above for full descriptions.
- **Sequence length:** fixed at 128 tokens; longer fragments are truncated.
- Labels are distilled from an LLM (Claude Opus), so they inherit its biases.

### License

**MIT** — same as the base model `cointegrated/rubert-tiny`.

### Citation

```bibtex
@misc{rubert_tiny_vacancy_section_classifier_coreml,
  title  = {rubert-tiny Vacancy Section Classifier (CoreML)},
  author = {russian-oracle},
  year   = {2026},
  note   = {Fine-tuned from cointegrated/rubert-tiny; CoreML/ANE export},
  url    = {https://huggingface.co/russian-oracle/rubert-tiny-vacancy-section-classifier-coreml}
}
```

Base model:

```bibtex
@misc{dale2021rubert_tiny,
  title  = {rubert-tiny},
  author = {Dale, David (cointegrated)},
  url    = {https://huggingface.co/cointegrated/rubert-tiny}
}
```

---

## 🇷🇺 Русская версия

### Что делает

По одному фрагменту русскоязычного описания вакансии модель предсказывает, к
какой из 5 структурных секций он относится:

| id | метка | смысл |
|----|----------------------|------------------------------------------------------|
| 0 | `responsibilities` | что сотрудник будет делать (задачи / обязанности) |
| 1 | `requirements` | что требуется от кандидата (требования / навыки) |
| 2 | `terms` | условия работы (зарплата / ДМС / график) |
| 3 | `notes` | мета / «о компании» / мягкий boilerplate |
| 4 | `junk` | неинформативный шум (выводится из структуры) |

Это этап структурной разметки в пайплайне скаутинга вакансий HH.ru, где модель
заменила более тяжёлую связку Qwen-эмбеддинги + косинус + reranking — при
~1–10 мс на вакансию на Apple Silicon.

### Артефакт

В репозитории — **только CoreML-артефакт** (без PyTorch-весов):

- `section_classifier.mlpackage` — `float16`, `ComputeUnit.ALL` (с поддержкой
  ANE), минимальная цель развёртывания macOS 13.
- `tokenizer.json`, `tokenizer_config.json` — соответствующий BERT WordPiece
  токенайзер (словарь 29 564). **Обязателен** — `.mlpackage` принимает id
  токенов, а не сырой текст.

#### Сигнатура входов/выходов CoreML

| имя | тип | форма | примечание |
|------------------|--------|---------|-----------------------------------|
| `input_ids` | int32 | [1, 128] | паддинг до `max_length=128` |
| `attention_mask` | int32 | [1, 128] | 1 — реальный токен, 0 — паддинг |
| `token_type_ids` | int32 | [1, 128] | все нули (один сегмент) |
| **выход** `logits` | float32 | [1, 5] | без нормализации; `argmax` → класс |

`max_seq_len = 128` и имена классов зашиты в `user_defined_metadata` модели.

### Метрики

Цифры ниже измерены на **исходной PyTorch-модели**. CoreML-экспорт затем
проверен на **100% совпадение argmax** с этим источником на отложенном наборе
проб (макс. абс. разница логитов `0.0026`, что нормально для `float16`), поэтому
они переносятся на этот артефакт.

**Headline — golden-281 (ручная разметка, held-out):**

| метрика | значение |
|--------------------------------------------------|---------------------|
| Content-accuracy (4 содержательных класса) | **76.5%** (176/230) |
| Полная 5-class accuracy (включая роутинг junk) | **68.7%** |
| Junk recall (корректно отсеянный шум) | **33.3%** (17/51) |

Это и есть метрика, которой стоит доверять: 281 фрагмент, размеченный человеком
и не виденный при обучении.

<details>
<summary>Внутренний test-split (циркулярный — НЕ headline)</summary>

Оценка на внутреннем тестовом сплите, у которого то же Claude Opus-распределение
разметки, что и у обучающих данных, поэтому он завышает реальное качество.
Приведён только для мониторинга:

| метрика | значение |
|------------------|--------|
| Accuracy | 89.3% |
| Macro-F1 | 86.9% |

Per-class F1 (in-domain): responsibilities 0.789 · requirements 0.760 ·
terms 0.795 · notes 0.684 · **junk 0.374**.
</details>

### Использование (Python · coremltools)

```python
import numpy as np
import coremltools as ct
from transformers import AutoTokenizer

REPO = "russian-oracle/rubert-tiny-vacancy-section-classifier-coreml"
LABELS = ["responsibilities", "requirements", "terms", "notes", "junk"]

tok = AutoTokenizer.from_pretrained(REPO)                  # tokenizer.json в этом репо
mlmodel = ct.models.MLModel("section_classifier.mlpackage")  # hf download ... локально

text = "Опыт работы с Python от 3 лет, знание Django и PostgreSQL."
enc = tok(text, return_tensors="np", padding="max_length",
          truncation=True, max_length=128)
ids = enc["input_ids"].astype(np.int32)
out = mlmodel.predict({
    "input_ids": ids,
    "attention_mask": enc["attention_mask"].astype(np.int32),
    "token_type_ids": enc.get("token_type_ids", np.zeros_like(ids)).astype(np.int32),
})
logits = np.asarray(out["logits"]).reshape(-1)
print(LABELS[int(logits.argmax())])        # → requirements
# вероятности: softmax(logits)
```

> `coremltools` требует нативных биндингов, которые есть только в части сборок
> CPython (надёжно работает wheel под 3.12). Запускайте предсказание под таким
> интерпретатором.

**Рекомендуемая агрегация (как используется в продакшене):** разбейте полное
описание на фрагменты по предложениям (например, `razdel` + переводы строк),
классифицируйте каждый, возьмите мажоритарную метку на чанк; фрагменты `junk`
отправляются в корзину «orphans», а не в структурированный вывод.

### Обучение

- **База:** `cointegrated/rubert-tiny` (BERT, 312 hidden, 3 слоя, словарь 29 564).
- **Происхождение:** многоступенчатый файн-тюн — rubert-tiny → промежуточный
  extractor → 4-class → 5-class → **5-class «rechunked»** (эта модель).
  Warm-start с предыдущего 5-class чекпойнта.
- **Данные:** ~12–13 тыс. фрагментов русских IT-вакансий, **переразмечены
  Claude Opus** (silver → дистилляция), перечанкованы сплиттером `razdel` по
  предложениям + границам строк.
- **Лосс:** взвешенная по классам кросс-энтропия (balanced inverse-frequency)
  против дисбаланса секций.
- **Расписание:** 8 эпох с ранней остановкой (patience 3, лучшая ≈ эпоха 3),
  batch 32, lr 3e-5, weight decay 0.01, warmup ratio 0.1, линейный спад, seed 42,
  `max_length` 128, обучение на Apple MPS в fp32.
- **Экспорт:** coremltools 9.0, `compute_precision=FLOAT16`,
  `compute_units=ALL`, `position_ids` зашит как константный буфер (обход
  ограничения const-fold); проверено на 100% argmax-parity с PyTorch-источником.

### Ограничения и смещения

- **Низкий junk recall (33.3%).** Модель чаще оставляет шум, чем отсеивает его;
  граница `notes` ↔ `junk` — самая сложная (junk F1 0.374). Если важен чистый
  роутинг — добавьте downstream-фильтр.
- **Домен:** обучена на русских **IT**-вакансиях. Другие отрасли, языки или
  не-вакансионный текст — вне распределения.
- **Гранулярность:** классифицирует *отдельный фрагмент*, а не вакансию целиком.
  Для полных описаний используйте схему chunk-then-vote выше.
- **Длина последовательности:** фиксированные 128 токенов; длиннее — обрезается.
- Метки дистиллированы из LLM (Claude Opus) и наследуют её смещения.

### Лицензия

**MIT** — как и у базовой модели `cointegrated/rubert-tiny`.

### Цитирование

См. BibTeX в английской секции выше.