Different local results from model card?

#1
by stefanmau - opened

Hi,

I am trying to run the model locally to play around with its functionality but seem to get significantly different results from what you show on your model card. Using the English sample sentence from the card I get the following output:

{'results': [{'label': '005536', 'score': 0.6633682250976562},
             {'label': '006373', 'score': 0.6612246632575989},
             {'label': '005403', 'score': 0.6566219329833984},
             {'label': '006009', 'score': 0.6548881530761719},
             {'label': '004286', 'score': 0.6541185975074768},
             {'label': '004702', 'score': 0.6540753245353699},
             {'label': 'c_98d1408a', 'score': 0.6531801223754883},
             {'label': '001570', 'score': 0.653095006942749},
             {'label': '008473', 'score': 0.6530815362930298},
             {'label': '16', 'score': 0.6523226499557495}]}

I use the tokenizer from EUBERT_2025. As the pickle in the repository does not include text labels associated with the codes I am referring to the EuroVoc dataset.

As these probabilities are not only siginficantly lower than what is shown on the model card but also for labels that seem nonsensical, I wanted to ask:

  1. Could you confirm the checkpoint that used for the model card results?
  2. Was any scaling applied to the outputs to get the results in the model card?

Any guidance would be much appreciated to make use of the model. I can also provide any additional debugging details if helpful

European Parliament org
edited Oct 7, 2025

Hi,
You are indeed right, there is a big discrepancy between the actual model output and the model card. The latter was copied from previous iterations of the eurovoc models and thus the discrepancies.
The current model outputs eurovoc IDs instead of eurovoc labels. Labels are easier to read but unfortunately can be duplicated and arbitrarily removed or get different IDs.
The model card will be updated accordingly.

Sign up or log in to comment