metadata
license: apache-2.0
OccCANINE: I-CeM Occupational Classification (OCCICEM)
Overview
OccCANINE_OCCICEM is a version of OccCANINE fine-tuned to automatically convert English occupational descriptions into I-CeM (Integrated Census Microdata) occupational codes. It uses a CANINE encoder with a sequential decoder trained using a mixed loss, fine-tuned from the OccCANINE_s2s_mix base model on IPUMS UK census data.
See more on: GitHub.com/christianvedels/OccCANINE
Read the paper on arXiv: https://arxiv.org/abs/2402.13604
Key Features
- English: Trained and evaluated on English occupational descriptions.
- Sequential decoding: Outputs I-CeM codes digit-by-digit.
- Mixed loss training: Combines sequence-level and flat classification losses.
- Fine-tuned: Initialized from OccCANINE_s2s_mix and fine-tuned on IPUMS UK I-CeM data.
Usage
from histocc import OccCANINE
model = OccCANINE(name="OccCANINE_OCCICEM", system="OCCICEM", hf=True)
result = model.predict("blacksmith", lang="en")
Contribution and Support
Developed at the University of Southern Denmark by Christian Møller Dahl, Torben Johansen and Christian Vedel.
Model Details:
- Task: Text Classification / Sequence Generation
- Base Model: CANINE (fine-tuned from OccCANINE_s2s_mix)
- Target system: I-CeM (OCCICEM)
- Language: English
- Framework: Transformers / PyTorch
- License: Apache 2.0
- Paper: arXiv 2402.13604