You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Multilingual ModernBERT Base Cased 128k

Pretrained multilingual language model using a masked language modeling (MLM) objective.

Model description

ModernBERT is a transformers model pretrained on 3.2 billions of French Wikipedia and OPUS tokens in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.

This model has the following configuration:

  • 768 embedding dimension
  • 12 hidden layers
  • 1152 hidden dimension
  • 6 attention heads
  • 64M parameters
  • 129k of vocabulary size

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT ones.

How to use

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import ModernBERTTokenizer, ModernBERTModel
tokenizer = ModernBERTTokenizer.from_pretrained('cservan/multilingual-modernbert-small')
model = ModernBERTModel.from_pretrained("cservan/multilingual-modernbert-small")
text = "Replace me by the text you want."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Training data

The ModernBERT model was pretrained on 3.2 billion of token from Multilingual Wikipedia (excluding lists, tables and headers) and OPUS.

Training procedure

Preprocessing

The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 128,000 tokens plus 1,000 unused token for downstream adataption. The inputs of the model are then of the form:

[CLS] Sentence A [SEP] Sentence B [SEP]

Tools

The tools used to pre-train the model are available here

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train cservan/multilingual-modernbert-small

Collection including cservan/multilingual-modernbert-small