File size: 2,223 Bytes
bc03bf5
 
 
 
 
 
ddc06e7
bc03bf5
ddc06e7
bc03bf5
 
d800382
 
 
 
 
 
 
 
61c6d57
d800382
61c6d57
d800382
 
 
 
 
 
 
 
 
8fc9d35
d800382
 
3b571fb
d800382
 
 
 
 
 
 
b0ebba5
d800382
 
 
 
 
 
 
 
 
 
61c6d57
d800382
3b571fb
d800382
 
 
 
61c6d57
d800382
 
bc03bf5
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: apache-2.0
---

This model is a derivation from ModernBERT specialized in the Brazilian Portuguese language, **pre-trained** from data in this language scope. 

The data used were gathered from **BrWac Corpus** and **Wikipedia Dataset Portuguese subset**.

In addition, a custom tokenizer was implemented to support ModBERTBr. This tokenizer used the **Unigram Algorithm** as backbone model to generate the vocabulary.


## Usage

You can use these models directly with the `transformers` library starting from v4.48.0:

```sh
pip install -U transformers>=4.48.0
```

Since ModBERTBr is a Masked Language Model (MLM), you can use the `fill-mask` pipeline or load it via `AutoModelForMaskedLM`. To use ModBERTBr for downstream tasks like classification, retrieval, or QA, fine-tune it following standard BERT fine-tuning recipes.

**⚠️ If your GPU supports it, we recommend using ModBERTBr with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:**

```bash
pip install flash-attn
```

Using `AutoModelForMaskedLM`:

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "wallacelw/ModBERTBr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "A capital do Brasil é [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)

```

Using a pipeline:

```python
import torch
from transformers import pipeline
from pprint import pprint
pipe = pipeline(
    "fill-mask",
    model="wallacelw/ModBERTBr",
)
input_text = "O planeta [MASK] é o terceiro do sistema solar."
results = pipe(input_text)
pprint(results)
```

**Note:** ModBERTBr does not use token type IDs, unlike some earlier BERT models. Most downstream usage is identical to standard BERT models on the Hugging Face Hub, except you can omit the `token_type_ids` parameter.

#