nielsr HF Staff commited on
Commit
cc552bc
·
verified ·
1 Parent(s): b38e1c1

Improve model card: Add pipeline tag, paper/code links, and usage example

Browse files

This PR significantly improves the model card for `finerweb-binary-classifier-mdeberta-gemma3` by:

- Adding the `pipeline_tag: text-classification` to the metadata, enabling easier discovery on the Hub and activating the inference widget.
- Expanding the `tags` to include `text-classification`, `named-entity-recognition`, and `deberta-v2` for better categorization.
- Adding a direct link to the paper [FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition](https://huggingface.co/papers/2512.13884).
- Providing a link to the associated GitHub code repository: [https://github.com/whoisjones/FiNERweb-code](https://github.com/whoisjones/FiNERweb-code).
- Including a link to the Hugging Face Collection for the project: [https://huggingface.co/collections/whoisjones/finerweb](https://huggingface.co/collections/whoisjones/finerweb).
- Populating the "Model description", "Intended uses & limitations", and "Training and evaluation data" sections with detailed information from the paper abstract and GitHub README.
- Including a detailed Python sample usage snippet directly from the GitHub repository, demonstrating how to load and use the model with `transformers`.
- Adding the BibTeX citation for the paper.

Please review and merge if everything looks good.

Files changed (1) hide show
  1. README.md +82 -10
README.md CHANGED
@@ -1,38 +1,96 @@
1
  ---
 
2
  library_name: transformers
3
  license: mit
4
- base_model: microsoft/mdeberta-v3-base
5
- tags:
6
- - generated_from_trainer
7
  metrics:
8
  - precision
9
  - recall
10
  - accuracy
 
 
 
 
 
 
11
  model-index:
12
  - name: gemma-fineweb-edu-scorer-mdeberta-binary-lr5e-05-20250411_140230
13
  results: []
14
  ---
15
 
16
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
- should probably proofread and complete it, then remove this comment. -->
18
-
19
  # finerweb-binary-classifier-mdeberta-gemma3
20
 
21
- This model is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) on the None dataset.
 
 
 
 
 
22
  It achieves the following results on the evaluation set:
23
  - Loss: 0.1614
24
 
25
  ## Model description
26
 
27
- More information needed
28
 
29
  ## Intended uses & limitations
30
 
31
- More information needed
 
 
 
 
 
 
 
 
 
32
 
33
  ## Training and evaluation data
34
 
35
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ## Training procedure
38
 
@@ -53,3 +111,17 @@ The following hyperparameters were used during training:
53
  - Pytorch 2.6.0+cu124
54
  - Datasets 3.3.2
55
  - Tokenizers 0.21.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: microsoft/mdeberta-v3-base
3
  library_name: transformers
4
  license: mit
 
 
 
5
  metrics:
6
  - precision
7
  - recall
8
  - accuracy
9
+ tags:
10
+ - generated_from_trainer
11
+ - text-classification
12
+ - named-entity-recognition
13
+ - deberta-v2
14
+ pipeline_tag: text-classification
15
  model-index:
16
  - name: gemma-fineweb-edu-scorer-mdeberta-binary-lr5e-05-20250411_140230
17
  results: []
18
  ---
19
 
 
 
 
20
  # finerweb-binary-classifier-mdeberta-gemma3
21
 
22
+ This model is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) for identifying Named Entity Recognition (NER)-relevant passages. It is part of the work presented in the paper [FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition](https://huggingface.co/papers/2512.13884).
23
+
24
+ - **Paper**: [FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition](https://huggingface.co/papers/2512.13884)
25
+ - **Code**: [https://github.com/whoisjones/FiNERweb-code](https://github.com/whoisjones/FiNERweb-code)
26
+ - **Hugging Face Collection**: [https://huggingface.co/collections/whoisjones/finerweb](https://huggingface.co/collections/whoisjones/finerweb)
27
+
28
  It achieves the following results on the evaluation set:
29
  - Loss: 0.1614
30
 
31
  ## Model description
32
 
33
+ This model, `finerweb-binary-classifier-mdeberta-gemma3`, is a binary classifier fine-tuned on `microsoft/mdeberta-v3-base`. It is designed to identify text passages that are relevant for Named Entity Recognition (NER). This model is a core component of the FiNERweb project, which focuses on creating scalable multilingual NER datasets. The FiNERweb pipeline scales the teacher-student paradigm to 91 languages and 25 scripts, training regression models like this one to identify NER-relevant passages for subsequent annotation by multilingual Large Language Models (LLMs).
34
 
35
  ## Intended uses & limitations
36
 
37
+ This model is primarily intended as a passage scorer to pre-select text segments likely to contain named entities. It can be used as an efficient filter within larger NER data annotation pipelines to streamline the creation of high-quality multilingual NER datasets.
38
+
39
+ **Intended Uses:**
40
+ - Identifying NER-relevant text passages in large corpora.
41
+ - Supporting scalable teacher-student training paradigms for multilingual NER.
42
+ - As a component in pipelines for generating synthetic NER training data.
43
+
44
+ **Limitations:**
45
+ - The model performs binary classification (NER-relevant vs. not relevant) and does not directly output specific entity types or spans.
46
+ - While designed for multilingual applicability (91 languages, 25 scripts), performance may vary across different languages and text domains. The original paper notes that performance might drop when evaluating with target language labels compared to English labels.
47
 
48
  ## Training and evaluation data
49
 
50
+ This model was fine-tuned on the FiNERweb dataset, which is a large-scale multilingual NER dataset generated using a teacher-student paradigm. Building on FineWeb-Edu, the dataset creation process involves training regression models (like this one) to identify NER-relevant passages, which are then annotated with multilingual LLMs. The resulting FiNERweb dataset comprises approximately 225k passages with 235k distinct entity labels across 91 languages and 25 scripts.
51
+
52
+ The FiNERweb dataset can be loaded using the `datasets` library:
53
+ ```python
54
+ from datasets import load_dataset
55
+
56
+ finerweb = load_dataset('whoisjones/finerweb')
57
+ finerweb_de = load_dataset('whoisjones/finerweb', split='deu')
58
+ ```
59
+
60
+ ## How to use
61
+
62
+ You can load and use this model with the `transformers` library to classify text passages for NER relevance:
63
+
64
+ ```python
65
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
66
+ import torch
67
+
68
+ model = AutoModelForSequenceClassification.from_pretrained("whoisjones/finerweb-binary-classifier-mdeberta-gemma3")
69
+ tokenizer = AutoTokenizer.from_pretrained("whoisjones/finerweb-binary-classifier-mdeberta-gemma3")
70
+
71
+ good_example = """'Kraft Foods has taken the Cadbury chocolate brand in a new direction, by combining it with cheese for the first time.
72
+ The company is bringing together two of its brands and launching Philadelphia with Cadbury, a chilled chocolate spread made from Philadelphia Light and Cadbury chocolate.
73
+ Kraft believes the new product has the potential to do very well and is targeting £10m in sales in the first year.
74
+ The new cheese and chocolate spread is being launched on 1 February and will be appear in the chilled dairy aisle next to plain Philadelphia Light.
75
+ It is launching in a 160g tub and a 120g four-pack of mini tubs, both with an rsp of £1.62.
76
+ Kraft is supporting the launch with a £3.2m marketing budget in 2012 and is targeting 2,000 tonnes in volume sales – equivalent to about £10m – in the first year.
77
+ If they reached this volume of sales, the new Philadelphia with Cadbury would have the same market value as Garlic & Herb, currently the biggest-selling flavour in the Philadelphia portfolio.
78
+ Kraft already offers chocolate variants of Philadelphia in Italy and Germany, using Milka chocolate and targeting the breakfast occasion.
79
+ In Germany, Philadelphia with Milka has generated €22.2m in sales since its October 2010 launch and has a 6.6% value share of the chocolate spread market.
80
+ Kraft Foods UK marketing manager Bruce Newman said:
81
+ “The UK product would be positioned as a snack.
82
+ “The breakfast market in countries such as Germany is more developed, and our consumer research firmly identified Philadelphia with Cadbury as a snack.”'"""
83
+
84
+ bad_example = """'|Viewing Single Post From: Spoilers for the Week of February 11th| |Lil||Feb 1 2013, 09:58 AM| Don\\'t care about Chloe/Taniel/Jen-Jen . Don\\'t care about Sami, really, but hoping that we get some good "SAMANTHA GENE!!" Marlena Death-Stares out of it . And "newfound" feelings . Please . If only . STEFANO!! STEFANO, STEFANO, STEFANO!!!!: cheer: |Spoilers for the Week of February 11th · DAYS: News, Spoilers & Discussion|'"""
85
+
86
+ with torch.no_grad():
87
+ good_example_inputs = tokenizer(good_example, return_tensors='pt')
88
+ bad_example_inputs = tokenizer(bad_example, return_tensors="pt")
89
+ good_example_outputs = model(**good_example_inputs)
90
+ bad_example_outputs = model(**bad_example_inputs)
91
+ print("Good Example Logits:", good_example_outputs.logits)
92
+ print("Bad Example Logits:", bad_example_outputs.logits)
93
+ ```
94
 
95
  ## Training procedure
96
 
 
111
  - Pytorch 2.6.0+cu124
112
  - Datasets 3.3.2
113
  - Tokenizers 0.21.1
114
+
115
+ ## Citation
116
+ If you find our work useful, please consider citing our [paper](https://arxiv.org/abs/2512.13884)!
117
+ ```bibtex
118
+ @misc{golde2025finerwebdatasetsartifactsscalable,
119
+ title={FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition},
120
+ author={Jonas Golde and Patrick Haller and Alan Akbik},
121
+ year={2025},
122
+ eprint={2512.13884},
123
+ archivePrefix={arXiv},
124
+ primaryClass={cs.CL},
125
+ url={https://arxiv.org/abs/2512.13884},
126
+ }
127
+ ```