Eurovoc Multilabel Classifer 🇪🇺

EuroVoc is a large multidisciplinary multilingual (24 languages of 🇪🇺) hierarchical thesaurus of more than 7000 classes covering the activities of EU institutions. Given the number of legal documents produced every day and the huge mass of pre-existing documents to be classiﬁed high quality automated or semi-automated classiﬁcation methods are most welcome in this domain.

This model based on BERT Deep Neural Network was trained on more than 3, 200,000 documents to achieve that task and is used in a production environment via the huggingface inference endpoint. This model support the 24 languages of the European Union.

Architecture

This classification model is built on top of EUBERT with 7331 Eurovoc labels

With less than 100 million parameters, it can be deployed on commodity hardware without GPU acceleration (around 200 ms per inference for 2000 characters).

Parameters :

Number of epochs 16
Batch size 10
Max lenght 512
Learning Rate 5e-05

Usage

from eurovoc import EurovocTagger
model = EurovocTagger.from_pretrained("EuropeanParliament/eurovoc_eu")

see the source code also

Author(s)

Sébastien Campion sebastien.campion@europarl.europa.eu

Andreas Papagiannis andreas.papagiannis@europarl.europa.eu

Downloads last month: 47

Safetensors

Model size

0.1B params

Tensor type

F32