OccCANINE: I-CeM Occupational Classification (OCCICEM)

Overview

OccCANINE_OCCICEM is a version of OccCANINE fine-tuned to automatically convert English occupational descriptions into I-CeM (Integrated Census Microdata) occupational codes. It uses a CANINE encoder with a sequential decoder trained using a mixed loss, fine-tuned from the OccCANINE_s2s_mix base model on IPUMS UK census data.

See more on: GitHub.com/christianvedels/OccCANINE

Read the paper on arXiv: https://arxiv.org/abs/2402.13604

Key Features

  • English: Trained and evaluated on English occupational descriptions.
  • Sequential decoding: Outputs I-CeM codes digit-by-digit.
  • Mixed loss training: Combines sequence-level and flat classification losses.
  • Fine-tuned: Initialized from OccCANINE_s2s_mix and fine-tuned on IPUMS UK I-CeM data.

Usage

from histocc import OccCANINE

model = OccCANINE(name="OccCANINE_OCCICEM", system="OCCICEM", hf=True)

result = model.predict("blacksmith", lang="en")

Contribution and Support

Developed at the University of Southern Denmark by Christian Møller Dahl, Torben Johansen and Christian Vedel.


Model Details:

  • Task: Text Classification / Sequence Generation
  • Base Model: CANINE (fine-tuned from OccCANINE_s2s_mix)
  • Target system: I-CeM (OCCICEM)
  • Language: English
  • Framework: Transformers / PyTorch
  • License: Apache 2.0
  • Paper: arXiv 2402.13604
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Christianvedel/OccCANINE_OCCICEM