Update README.md
Browse files
README.md
CHANGED
|
@@ -9,5 +9,58 @@ The data used were gathered from BrWac Corpus and Wikipedia Dataset Portuguese s
|
|
| 9 |
In addition, a custom tokenizer was implemented to support ModBERTBr. This tokenizer used the Unigram Algorithm as backbone model to generate the vocabulary.
|
| 10 |
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
|
|
|
|
| 9 |
In addition, a custom tokenizer was implemented to support ModBERTBr. This tokenizer used the Unigram Algorithm as backbone model to generate the vocabulary.
|
| 10 |
|
| 11 |
|
| 12 |
+
## Usage
|
| 13 |
+
|
| 14 |
+
You can use these models directly with the `transformers` library starting from v4.48.0:
|
| 15 |
+
|
| 16 |
+
```sh
|
| 17 |
+
pip install -U transformers>=4.48.0
|
| 18 |
+
```
|
| 19 |
+
|
| 20 |
+
Since ModernBERT is a Masked Language Model (MLM), you can use the `fill-mask` pipeline or load it via `AutoModelForMaskedLM`. To use ModernBERT for downstream tasks like classification, retrieval, or QA, fine-tune it following standard BERT fine-tuning recipes.
|
| 21 |
+
|
| 22 |
+
**⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:**
|
| 23 |
+
|
| 24 |
+
```bash
|
| 25 |
+
pip install flash-attn
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
Using `AutoModelForMaskedLM`:
|
| 29 |
+
|
| 30 |
+
```python
|
| 31 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 32 |
+
model_id = "answerdotai/ModernBERT-base"
|
| 33 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 34 |
+
model = AutoModelForMaskedLM.from_pretrained(model_id)
|
| 35 |
+
text = "The capital of France is [MASK]."
|
| 36 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 37 |
+
outputs = model(**inputs)
|
| 38 |
+
# To get predictions for the mask:
|
| 39 |
+
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
|
| 40 |
+
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
|
| 41 |
+
predicted_token = tokenizer.decode(predicted_token_id)
|
| 42 |
+
print("Predicted token:", predicted_token)
|
| 43 |
+
# Predicted token: Paris
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
Using a pipeline:
|
| 47 |
+
|
| 48 |
+
```python
|
| 49 |
+
import torch
|
| 50 |
+
from transformers import pipeline
|
| 51 |
+
from pprint import pprint
|
| 52 |
+
pipe = pipeline(
|
| 53 |
+
"fill-mask",
|
| 54 |
+
model="answerdotai/ModernBERT-base",
|
| 55 |
+
torch_dtype=torch.bfloat16,
|
| 56 |
+
)
|
| 57 |
+
input_text = "He walked to the [MASK]."
|
| 58 |
+
results = pipe(input_text)
|
| 59 |
+
pprint(results)
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
**Note:** ModernBERT does not use token type IDs, unlike some earlier BERT models. Most downstream usage is identical to standard BERT models on the Hugging Face Hub, except you can omit the `token_type_ids` parameter.
|
| 63 |
+
|
| 64 |
+
#
|
| 65 |
|
| 66 |
|