SaulSosaDiaz's picture
Add initial data files and Streamlit app for CNO classification
5920cea
metadata
license: other
license_name: proprietary-license
license_link: LICENSE
language:
  - es
base_model:
  - intfloat/multilingual-e5-large
pipeline_tag: text-classification
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false

Model Card for Model ID

This model card aims to be a base template for new models. It has been generated using this raw template.

Model Details

Model Description

  • Developed by: Cátedra Cajasiete de Big Data, Open Data y Blockchain de la Universidad de La Laguna
  • Funded by: Cajasiete y la Universidad de La Laguna
  • Shared by: Cátedra Cajasiete de Big Data, Open Data y Blockchain de la Universidad de La Laguna and Instituto Canario de Estadística
  • Model type: text-classification
  • Language(s) (NLP): Spanish
  • License: Proprietary
  • Finetuned from model: intfloat/multilingual-e5-large

Model Sources

  • Paper [TODO]: TODO

Uses

This model has been trained to classify text into CNOs (Código Nacional de Ocupaciones) in Spanish. It is intended to be used by researchers, developers, and organizations interested in analyzing and classifying occupational data in the Spanish language.

Direct Use

[More Information Needed]

Out-of-Scope Use

The model will not work well if used to classify non-Spanish text, as it was trained exclusively on it.

Bias, Risks, and Limitations

Because of the model has been trained with data from socioeconomic surveys, it may have inherent biases in the training data. These biases may manifest themselves in the classification of occupations, especially those that are less well represented in the data. In addition, the model may not generalize well to occupations that are not well represented in the training set.

Another limitation to consider is that since the CNO was created, which is the national occupational classification system used in Spain, as will be explained later, new occupations have appeared that are not included in the model. Therefore, the model may not be able to correctly classify these new occupations. Such as, for example, Streamer, Influencer, etc.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

Install necessary libraries

pip install torch
pip install transformers

[TODO: Check if it's necessary to do anything more than this]

Load model

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bob-nlp/A5-CNO-BOB-ISTAC-D12")

Load tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bob-nlp/A5-CNO-BOB-ISTAC-D12")

Using the model

import torch
from torch.nn.functional import softmax

text_to_predict = []
text = "text to classify"
is_single_item = isinstance(text, str)
if is_single_item: # Data to predict must be a list of strings, even if it's only one string
    text_to_predict = [text]

inputs = tokenizer(text_to_predict, padding=True, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
logits = outputs.logits
probabilities = softmax(logits, dim=1)
id2label = model.config.id2label

sorted_predictions = []
for i in range(logits.shape[0]):
        single_probs = probabilities[i]
        scores_dict = {id2label[j]: single_probs[j].item() for j in range(len(id2label))}
        sorted_prediction = sorted(scores_dict.items(), key=lambda item: item[1], reverse=True)
        sorted_predictions.append(sorted_prediction)
best_prediction_info = sorted_predictions[0]
best_label, best_label_prob = best_prediction_info[0]

Convert result to CNO

The result given by the model will be in a "LABEL_(NUMBER)" format. In order to translate it to a CNO, you must follow these steps:

  1. Download the file cno_utils.py in the utils folder of this repository.
  2. Add the following to in your code:
from cno_utils import convert_to_cno

cno_predicted_code = convert_to_cno(best_label)

You must have previously installed pandas and huggingface_hub for it to work:

pip install huggingface_hub
pip install pandas

Alternatively, download the idxs.csv file found in the data folder of this repository and copy the following into your code:

import pandas

def _load_label_mapping():
    csv_path = hf_hub_download(repo_id="bob-nlp/A5-CNO-BOB-ISTAC-D12", filename="LOCAL/PATH/TO/idxs.csv")
    df = pd.read_csv(csv_path)
    _label_mapping = dict(zip(df['label'], df['CNO']))
    return _label_mapping

def convert_to_cno(output_label):
    mapping = _load_label_mapping()
    return mapping.get(output_label, output_label)

And then simply call convert_to_cno().

Get description of the CNO

  1. Download the file cno_utils.py in the utils folder of this repository.
  2. Add the following to in your code:
from cno_utils import get_cno_description

cno_description = get_cno_description(cno_predicted_code)

You must have previously installed pandas and huggingface_hub for it to work:

pip install huggingface_hub
pip install pandas

Alternatively, download the cno11_notas.csv file found in the data folder of this repository and copy the following into your code:

import pandas

def _load_description_mapping():
    csv_path = hf_hub_download(repo_id="bob-nlp/A5-CNO-BOB-ISTAC-D12", filename="LOCAL/PATH/TO/cno11_notas.csv")
    df = pd.read_csv(csv_path)
    _description_mapping = dict(zip(df['CNO'], df['DN4']))
    return _description_mapping

def get_cno_description(cno):
    mapping = _load_description_mapping()
    return mapping.get(cno, 'Unknown')

And then simply call get_cno_description().

Training Details

Training Data

This model has been trained using an aggregated data set from various socioeconomic surveys conducted by the Instituto Canario de Estadística (ISTAC). The ISTAC is the official statistical agency of the autonomous community of the Canary Islands, in charge of producing and disseminating statistical information of public interest.

The training dataset is composed of individual responses to surveys designed to capture a representative picture of the social and economic situation of the population in the Canary Islands.
Although the specific dataset used for this model cannot be directly redistributed, the original ISTAC surveys, such as the Survey of Income and Living Conditions of Canarian Households (EICVHC) or the Survey of Socioeconomic Habits and Confidence (ECOSOC), provide insight into the type of information collected. You can consult the microdata and documentation of these and other surveys in the ISTAC data portal.
The variables included in the training dataset are fundamental to the task of occupational classification and reflect a variety of demographic and socioeconomic factors.

The variables used are:

  • EDAD_RANGO: Age range of the respondent.
  • SEXO: Sex of the respondent.
  • INGRESO: Income level of the household or individual.
  • ESTUDIOS: Level of education attained.
  • SITUACION: Employment status (e.g., employed, unemployed, inactive).
  • ACTIVIDAD: Sector of economic activity.
  • TAREA: Description of the main task performed at work.
  • CNO: National Code of Occupations.

The target variable of the model is CNO. The CNO is the national classification system of occupations used in Spain, managed by the National Statistics Institute (INE). This system organizes occupations in a hierarchical structure that facilitates the grouping and analysis of labor data. The model has been trained with the CNO-11 version of this classification.

Training Procedure

Preprocessing

The main challenge of the training data was the class imbalance in the target variable CNO, as the most common occupations in the Canary Islands (e.g., "restaurant services and commerce") were overrepresented. To mitigate the bias towards the majority classes, a data augmentation technique was applied by generating synthetic entries for the underrepresented occupations. This process balances the distribution of classes, improving the generalizability of the model. In addition, categorical variables were coded into numerical format and null values were managed to ensure data quality.

Preprocessing also included the following standard steps:

  • Coding of categorical variables: Variables such as AGE_RANGE, SEX, STUDIES, STATUS, and ACTIVITY were converted to a numerical format (e.g., by One-Hot Encoding) so that they could be processed by the model.
  • Null Value Handling: A strategy was implemented to deal with inputs with missing values.

Training Hyperparameters

The model was fine-tuned from intfloat/multilingual-e5-large using the following configuration:

Parameter Value Description
Base Model intfloat/multilingual-e5-large Pre-trained model used as a starting point.
TEST_SIZE 0.3 Proportion of the dataset reserved for testing.
RANDOM_STATE 42 Seed for reproducible data splitting.
NUM_TRAIN_EPOCHS 16 Maximum number of training epochs.
BATCH_SIZE 24 Batch size per device.
LEARNING_RATE 2e-05 Learning rate for the optimizer.
EARLY_STOPPING_PATIENCE 2 Epochs to wait for improvement before stopping training.
EARLY_STOPPING_THRESHOLD 0.01 Minimum change to be considered an improvement.
LOGGING_STEPS 500 Logging frequency (in steps).
  • Training regime: fp32 (Full Precision)

Evaluation

Testing Data, Factors & Metrics

Testing Data

The evaluation of the model was performed using a test set that was not used during training. This test set is composed of a representative sample of the population of the Canary Islands, ensuring that the model's performance is evaluated on data that reflects the diversity and complexity of real-world scenarios.

Factors

[More Information Needed]

Metrics

The model's performance was assessed using a set of metrics carefully chosen to reflect the challenges of this classification task, namely the class imbalance and the hierarchical nature of the CNO labels.

  • Accuracy: This is the most straightforward metric, representing the overall percentage of correctly predicted occupations. While it provides a general overview of performance, it can be misleading in imbalanced datasets. A model could achieve high accuracy by simply predicting the most common occupations well, while failing on rarer ones. It is included as a baseline reference.

  • Balanced Accuracy: This metric was chosen specifically to counteract the weakness of standard accuracy. It calculates the average recall across all classes, giving equal weight to each one regardless of how frequently it appears. A high Balanced Accuracy score indicates that the model is performing well on both common and rare occupations, making it a much fairer assessment of a model's true generalization capability on this dataset.

  • Recall (macro): Recall measures the model's ability to correctly identify all relevant instances of a class ("What proportion of actual positives was identified correctly?"). The macro average calculates recall independently for each class and then takes the unweighted mean. This is crucial because it treats a failure to identify a rare occupation as equally important as a failure to identify a common one. It directly measures how well the model "finds" examples from every single category.

  • F1-score (macro): The F1-score is the harmonic mean of precision and recall. By using the macro average, we get a single, balanced measure of performance across all classes. It is one of the most important metrics for this task because a high macro F1-score requires the model to have both good precision (not mislabeling other occupations as the target class) and good recall (finding all instances of the target class), and to do so for rare and common classes alike.

  • H-F1-score (Hierarchical F1-score): This metric was chosen because the CNO classification is inherently hierarchical. A standard F1-score treats all errors equally; for instance, mistaking a "Web Developer" for a "Farmer" is just as bad as mistaking it for a "Software Engineer". The Hierarchical F1-score is more nuanced. It gives partial credit for predictions that are incorrect but "close" in the occupational hierarchy. This provides a more practical measure of the model's utility, as a prediction within the correct professional group is significantly more useful than one that is completely unrelated.

Results

The model achieved the following performance on the test set:

Metric Score
Accuracy 0.81
Balanced Accuracy 0.69
Recall (macro) 0.65
F1-score (macro) 0.64
H-F1-score 0.85

Note: Recall and F1-score were calculated using a macro average to provide a fair performance measure across all classes, including the underrepresented ones.

Hardware

This model is a fine-tuned version of intfloat/multilingual-e5-large, a large-sized transformer. As such, the hardware requirements depend on whether you are running the model for inference or for training.

Inference (Using the Model) For running inference, a GPU is highly recommended for optimal performance, especially for batch processing.

  • CPU: While it is possible to run this model on a multi-core CPU, expect significant latency. This may be acceptable for offline, low-volume tasks, but it is not suitable for real-time applications.

  • GPU (Recommended): For efficient inference, a modern GPU with at least 6-8 GB of VRAM is recommended (e.g., NVIDIA Tesla T4, RTX 3060). This will allow for reasonably fast predictions and the processing of multiple requests in batches.

Training (Reproducing the Fine-Tuning) Fine-tuning a large-sized model is computationally intensive and requires a high-end GPU.

Citation

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Model Card Contact