Improve model card: Add pipeline tag, paper/code links, and usage example
Browse filesThis PR significantly improves the model card for `finerweb-binary-classifier-mdeberta-gemma3` by:
- Adding the `pipeline_tag: text-classification` to the metadata, enabling easier discovery on the Hub and activating the inference widget.
- Expanding the `tags` to include `text-classification`, `named-entity-recognition`, and `deberta-v2` for better categorization.
- Adding a direct link to the paper [FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition](https://huggingface.co/papers/2512.13884).
- Providing a link to the associated GitHub code repository: [https://github.com/whoisjones/FiNERweb-code](https://github.com/whoisjones/FiNERweb-code).
- Including a link to the Hugging Face Collection for the project: [https://huggingface.co/collections/whoisjones/finerweb](https://huggingface.co/collections/whoisjones/finerweb).
- Populating the "Model description", "Intended uses & limitations", and "Training and evaluation data" sections with detailed information from the paper abstract and GitHub README.
- Including a detailed Python sample usage snippet directly from the GitHub repository, demonstrating how to load and use the model with `transformers`.
- Adding the BibTeX citation for the paper.
Please review and merge if everything looks good.
|
@@ -1,38 +1,96 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
license: mit
|
| 4 |
-
base_model: microsoft/mdeberta-v3-base
|
| 5 |
-
tags:
|
| 6 |
-
- generated_from_trainer
|
| 7 |
metrics:
|
| 8 |
- precision
|
| 9 |
- recall
|
| 10 |
- accuracy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
model-index:
|
| 12 |
- name: gemma-fineweb-edu-scorer-mdeberta-binary-lr5e-05-20250411_140230
|
| 13 |
results: []
|
| 14 |
---
|
| 15 |
|
| 16 |
-
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 17 |
-
should probably proofread and complete it, then remove this comment. -->
|
| 18 |
-
|
| 19 |
# finerweb-binary-classifier-mdeberta-gemma3
|
| 20 |
|
| 21 |
-
This model is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
It achieves the following results on the evaluation set:
|
| 23 |
- Loss: 0.1614
|
| 24 |
|
| 25 |
## Model description
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
## Intended uses & limitations
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Training and evaluation data
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
## Training procedure
|
| 38 |
|
|
@@ -53,3 +111,17 @@ The following hyperparameters were used during training:
|
|
| 53 |
- Pytorch 2.6.0+cu124
|
| 54 |
- Datasets 3.3.2
|
| 55 |
- Tokenizers 0.21.1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model: microsoft/mdeberta-v3-base
|
| 3 |
library_name: transformers
|
| 4 |
license: mit
|
|
|
|
|
|
|
|
|
|
| 5 |
metrics:
|
| 6 |
- precision
|
| 7 |
- recall
|
| 8 |
- accuracy
|
| 9 |
+
tags:
|
| 10 |
+
- generated_from_trainer
|
| 11 |
+
- text-classification
|
| 12 |
+
- named-entity-recognition
|
| 13 |
+
- deberta-v2
|
| 14 |
+
pipeline_tag: text-classification
|
| 15 |
model-index:
|
| 16 |
- name: gemma-fineweb-edu-scorer-mdeberta-binary-lr5e-05-20250411_140230
|
| 17 |
results: []
|
| 18 |
---
|
| 19 |
|
|
|
|
|
|
|
|
|
|
| 20 |
# finerweb-binary-classifier-mdeberta-gemma3
|
| 21 |
|
| 22 |
+
This model is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) for identifying Named Entity Recognition (NER)-relevant passages. It is part of the work presented in the paper [FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition](https://huggingface.co/papers/2512.13884).
|
| 23 |
+
|
| 24 |
+
- **Paper**: [FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition](https://huggingface.co/papers/2512.13884)
|
| 25 |
+
- **Code**: [https://github.com/whoisjones/FiNERweb-code](https://github.com/whoisjones/FiNERweb-code)
|
| 26 |
+
- **Hugging Face Collection**: [https://huggingface.co/collections/whoisjones/finerweb](https://huggingface.co/collections/whoisjones/finerweb)
|
| 27 |
+
|
| 28 |
It achieves the following results on the evaluation set:
|
| 29 |
- Loss: 0.1614
|
| 30 |
|
| 31 |
## Model description
|
| 32 |
|
| 33 |
+
This model, `finerweb-binary-classifier-mdeberta-gemma3`, is a binary classifier fine-tuned on `microsoft/mdeberta-v3-base`. It is designed to identify text passages that are relevant for Named Entity Recognition (NER). This model is a core component of the FiNERweb project, which focuses on creating scalable multilingual NER datasets. The FiNERweb pipeline scales the teacher-student paradigm to 91 languages and 25 scripts, training regression models like this one to identify NER-relevant passages for subsequent annotation by multilingual Large Language Models (LLMs).
|
| 34 |
|
| 35 |
## Intended uses & limitations
|
| 36 |
|
| 37 |
+
This model is primarily intended as a passage scorer to pre-select text segments likely to contain named entities. It can be used as an efficient filter within larger NER data annotation pipelines to streamline the creation of high-quality multilingual NER datasets.
|
| 38 |
+
|
| 39 |
+
**Intended Uses:**
|
| 40 |
+
- Identifying NER-relevant text passages in large corpora.
|
| 41 |
+
- Supporting scalable teacher-student training paradigms for multilingual NER.
|
| 42 |
+
- As a component in pipelines for generating synthetic NER training data.
|
| 43 |
+
|
| 44 |
+
**Limitations:**
|
| 45 |
+
- The model performs binary classification (NER-relevant vs. not relevant) and does not directly output specific entity types or spans.
|
| 46 |
+
- While designed for multilingual applicability (91 languages, 25 scripts), performance may vary across different languages and text domains. The original paper notes that performance might drop when evaluating with target language labels compared to English labels.
|
| 47 |
|
| 48 |
## Training and evaluation data
|
| 49 |
|
| 50 |
+
This model was fine-tuned on the FiNERweb dataset, which is a large-scale multilingual NER dataset generated using a teacher-student paradigm. Building on FineWeb-Edu, the dataset creation process involves training regression models (like this one) to identify NER-relevant passages, which are then annotated with multilingual LLMs. The resulting FiNERweb dataset comprises approximately 225k passages with 235k distinct entity labels across 91 languages and 25 scripts.
|
| 51 |
+
|
| 52 |
+
The FiNERweb dataset can be loaded using the `datasets` library:
|
| 53 |
+
```python
|
| 54 |
+
from datasets import load_dataset
|
| 55 |
+
|
| 56 |
+
finerweb = load_dataset('whoisjones/finerweb')
|
| 57 |
+
finerweb_de = load_dataset('whoisjones/finerweb', split='deu')
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
## How to use
|
| 61 |
+
|
| 62 |
+
You can load and use this model with the `transformers` library to classify text passages for NER relevance:
|
| 63 |
+
|
| 64 |
+
```python
|
| 65 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
| 66 |
+
import torch
|
| 67 |
+
|
| 68 |
+
model = AutoModelForSequenceClassification.from_pretrained("whoisjones/finerweb-binary-classifier-mdeberta-gemma3")
|
| 69 |
+
tokenizer = AutoTokenizer.from_pretrained("whoisjones/finerweb-binary-classifier-mdeberta-gemma3")
|
| 70 |
+
|
| 71 |
+
good_example = """'Kraft Foods has taken the Cadbury chocolate brand in a new direction, by combining it with cheese for the first time.
|
| 72 |
+
The company is bringing together two of its brands and launching Philadelphia with Cadbury, a chilled chocolate spread made from Philadelphia Light and Cadbury chocolate.
|
| 73 |
+
Kraft believes the new product has the potential to do very well and is targeting £10m in sales in the first year.
|
| 74 |
+
The new cheese and chocolate spread is being launched on 1 February and will be appear in the chilled dairy aisle next to plain Philadelphia Light.
|
| 75 |
+
It is launching in a 160g tub and a 120g four-pack of mini tubs, both with an rsp of £1.62.
|
| 76 |
+
Kraft is supporting the launch with a £3.2m marketing budget in 2012 and is targeting 2,000 tonnes in volume sales – equivalent to about £10m – in the first year.
|
| 77 |
+
If they reached this volume of sales, the new Philadelphia with Cadbury would have the same market value as Garlic & Herb, currently the biggest-selling flavour in the Philadelphia portfolio.
|
| 78 |
+
Kraft already offers chocolate variants of Philadelphia in Italy and Germany, using Milka chocolate and targeting the breakfast occasion.
|
| 79 |
+
In Germany, Philadelphia with Milka has generated €22.2m in sales since its October 2010 launch and has a 6.6% value share of the chocolate spread market.
|
| 80 |
+
Kraft Foods UK marketing manager Bruce Newman said:
|
| 81 |
+
“The UK product would be positioned as a snack.
|
| 82 |
+
“The breakfast market in countries such as Germany is more developed, and our consumer research firmly identified Philadelphia with Cadbury as a snack.”'"""
|
| 83 |
+
|
| 84 |
+
bad_example = """'|Viewing Single Post From: Spoilers for the Week of February 11th| |Lil||Feb 1 2013, 09:58 AM| Don\\'t care about Chloe/Taniel/Jen-Jen . Don\\'t care about Sami, really, but hoping that we get some good "SAMANTHA GENE!!" Marlena Death-Stares out of it . And "newfound" feelings . Please . If only . STEFANO!! STEFANO, STEFANO, STEFANO!!!!: cheer: |Spoilers for the Week of February 11th · DAYS: News, Spoilers & Discussion|'"""
|
| 85 |
+
|
| 86 |
+
with torch.no_grad():
|
| 87 |
+
good_example_inputs = tokenizer(good_example, return_tensors='pt')
|
| 88 |
+
bad_example_inputs = tokenizer(bad_example, return_tensors="pt")
|
| 89 |
+
good_example_outputs = model(**good_example_inputs)
|
| 90 |
+
bad_example_outputs = model(**bad_example_inputs)
|
| 91 |
+
print("Good Example Logits:", good_example_outputs.logits)
|
| 92 |
+
print("Bad Example Logits:", bad_example_outputs.logits)
|
| 93 |
+
```
|
| 94 |
|
| 95 |
## Training procedure
|
| 96 |
|
|
|
|
| 111 |
- Pytorch 2.6.0+cu124
|
| 112 |
- Datasets 3.3.2
|
| 113 |
- Tokenizers 0.21.1
|
| 114 |
+
|
| 115 |
+
## Citation
|
| 116 |
+
If you find our work useful, please consider citing our [paper](https://arxiv.org/abs/2512.13884)!
|
| 117 |
+
```bibtex
|
| 118 |
+
@misc{golde2025finerwebdatasetsartifactsscalable,
|
| 119 |
+
title={FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition},
|
| 120 |
+
author={Jonas Golde and Patrick Haller and Alan Akbik},
|
| 121 |
+
year={2025},
|
| 122 |
+
eprint={2512.13884},
|
| 123 |
+
archivePrefix={arXiv},
|
| 124 |
+
primaryClass={cs.CL},
|
| 125 |
+
url={https://arxiv.org/abs/2512.13884},
|
| 126 |
+
}
|
| 127 |
+
```
|