Critical requirement for upcoming releases: Prioritization of European linguistic variants

#1
by galisep - opened

Hi,

As you prepare for the upcoming model releases, I must highlight a critical issue observed in the current versions that needs to be addressed.

Currently, the models consistently default to Brazilian Portuguese (pt-BR) regarding syntax, vocabulary, and grammar.

It is crucial to emphasize that pt-BR is not an official language of the European Union. European Portuguese (pt-PT) is.

For a project explicitly branded as a "European LLM" with the mission of preserving European linguistic sovereignty, it is imperative that the models prioritize European linguistic variants (e.g., pt-PT) over non-European ones.

Please ensure that the training data and alignment processes for the new models are corrected to reflect this. The models should default to European Portuguese standards, as defaulting to a non-EU variant contradicts the core purpose of this initiative.

We look forward to seeing this corrected in the next release.

UTTER - Unified Transcription and Translation for Extended Reality org

Dear @galisep

Thanks for your feedback. This is not a critical issue - European and Brazilian Portuguese are two variants of the same language (Portuguese) which is one of the 24 official EU languages. However, we agree that it is useful for the model for distinguish between language variants. In fact, we are working on a separate project whose aim is precisely tuning EuroLLM for European Portuguese. Stay tuned as we plan to release that model soon. In the meantime, we recommend that you request EuroLLM to answer in European Portuguese in the system prompt if this is your intended use, which doesn't completely solve but mitigates the issue you point out.

André

andre-martins changed discussion status to closed

Hi André, thanks for the reply.

While I appreciate that a specific fine-tune is in the works, I must respectfully disagree that the current behavior is "not a critical issue" for a project named EuroLLM.

The problem isn't that the model knows Brazilian Portuguese (which is great); the problem is that an EU-funded/centric model treats a non-EU variant as the default.

Logic dictates that a model's alignment should reflect its origin and purpose. It is inconceivable that a South American LLM initiative (e.g., Mercosur) would release a foundational model that defaults to European Portuguese syntax and grammar. They would rightfully prioritize their own linguistic reality.

If this project aims to represent European digital sovereignty, the default variant for any pluricentric language should be the European one. European Portuguese shouldn't be relegated to a "separate project" or require specific prompting to override a non-European default—it should be the baseline standard for EuroLLM.

Sign up or log in to comment