Instructions to use answerdotai/ModernBERT-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use answerdotai/ModernBERT-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="answerdotai/ModernBERT-base")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base") model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base") - Notebooks
- Google Colab
- Kaggle
Add `"add_prefix_space": true,`; this allows for much stronger token-level performance (e.g. NER, ColBERT)
Browse filesHello!
## Pull Request overview
* Add `"add_prefix_space": true,` to the tokenizer config
## Details
This allows for much stronger token-level performance (e.g. NER, ColBERT), because otherwise each token will not be prepended by a space, while our model is trained with data where each token is prepended by a space.
We will need to explain that users can set `add_prefix_space` to False in the model card somewhere.
cc @bclavie @bwarner @NohTow could one of you take care of that?
P.s. feel free to hold off on merging for now, this PR can also be used to run some tests first (with `revision="refs/pr/..."`).
Note that you need to use `transformers` after https://github.com/huggingface/transformers/pull/35593 was merged.
- Tom Aarsen
- tokenizer_config.json +1 -0
|
@@ -1,4 +1,5 @@
|
|
| 1 |
{
|
|
|
|
| 2 |
"added_tokens_decoder": {
|
| 3 |
"0": {
|
| 4 |
"content": "|||IP_ADDRESS|||",
|
|
|
|
| 1 |
{
|
| 2 |
+
"add_prefix_space": true,
|
| 3 |
"added_tokens_decoder": {
|
| 4 |
"0": {
|
| 5 |
"content": "|||IP_ADDRESS|||",
|