Token Classification
Transformers
Safetensors
English
roberta
feature-extraction
entity-recognition
foundation-model
RoBERTa
generic
Instructions to use numind/NuNER-v2.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use numind/NuNER-v2.0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="numind/NuNER-v2.0")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("numind/NuNER-v2.0") model = AutoModel.from_pretrained("numind/NuNER-v2.0") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,97 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
license: mit
|
| 5 |
+
tags:
|
| 6 |
+
- token-classification
|
| 7 |
+
- entity-recognition
|
| 8 |
+
- foundation-model
|
| 9 |
+
- feature-extraction
|
| 10 |
+
- RoBERTa
|
| 11 |
+
- generic
|
| 12 |
+
datasets:
|
| 13 |
+
- numind/NuNER
|
| 14 |
+
pipeline_tag: token-classification
|
| 15 |
+
inference: false
|
| 16 |
---
|
| 17 |
+
|
| 18 |
+
# SOTA Entity Recognition English Foundation Model by NuMind 🔥
|
| 19 |
+
|
| 20 |
+
This model provides the best embedding for the Entity Recognition task in English. It is an improved version of the model from our [**paper**](https://arxiv.org/abs/2402.15343).
|
| 21 |
+
|
| 22 |
+
**Checkout other models by NuMind:**
|
| 23 |
+
* SOTA Multilingual Entity Recognition Foundation Model: [link](https://huggingface.co/numind/entity-recognition-multilingual-general-sota-v1)
|
| 24 |
+
* SOTA Sentiment Analysis Foundation Model: [English](https://huggingface.co/numind/generic-sentiment-v1), [Multilingual](https://huggingface.co/numind/generic-sentiment-multi-v1)
|
| 25 |
+
|
| 26 |
+
## About
|
| 27 |
+
|
| 28 |
+
[Roberta-base](https://huggingface.co/roberta-base) fine-tuned on the expanded version of [NuNER data](https://huggingface.co/datasets/numind/NuNER) using contrastive learning from [**NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data**](https://arxiv.org/abs/2402.15343).
|
| 29 |
+
|
| 30 |
+
**Metrics:**
|
| 31 |
+
|
| 32 |
+
Read more about evaluation protocol & datasets in our [NuNER data](https://huggingface.co/datasets/numind/NuNER) using contrastive learning from [**paper**](https://arxiv.org/abs/2402.15343).
|
| 33 |
+
|
| 34 |
+
Here is the aggregated performance of the models over several datasets:
|
| 35 |
+
|
| 36 |
+
k=X means that as training data, we took only X examples for each class, trained the model, and evaluated it on the full test set.
|
| 37 |
+
|
| 38 |
+
| Model | k=1 | k=4 | k=16 | k=64 |
|
| 39 |
+
|----------|----------|----------|----------|----------|
|
| 40 |
+
| RoBERTa-base | 24.5 | 44.7 | 58.1 | 65.4
|
| 41 |
+
| RoBERTa-base + NER-BERT pre-training | 32.3 | 50.9 | 61.9 | 67.6 |
|
| 42 |
+
| NuNER v1.0 | 39.4 | 59.6 | 67.8 | 71.5 |
|
| 43 |
+
| NuNER v2.0 | **43.6** | **60.1** | **68.2** | **72.0** |
|
| 44 |
+
|
| 45 |
+
NuNER v1.0 has similar performance to 7B LLMs (70 times bigger than NuNER v1.0) created specifically for the NER task. Thus NuNER v2.0 should be even better than the 7b LLM.
|
| 46 |
+
|
| 47 |
+
| Model | k=8~16| k=64~128 |
|
| 48 |
+
|----------|----------|----------|
|
| 49 |
+
| UniversalNER (7B) | 57.89 ± 4.34 | 71.02 ± 1.53 |
|
| 50 |
+
| NuNER v1.0 (100M) | 58.75 ± 0.93 | 70.30 ± 0.35 |
|
| 51 |
+
|
| 52 |
+
## Usage
|
| 53 |
+
|
| 54 |
+
Embeddings can be used out of the box or fine-tuned on specific datasets.
|
| 55 |
+
|
| 56 |
+
Get embeddings:
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
```python
|
| 60 |
+
import torch
|
| 61 |
+
import transformers
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
model = transformers.AutoModel.from_pretrained(
|
| 65 |
+
'numind/NuNER-v2.0',
|
| 66 |
+
output_hidden_states=True
|
| 67 |
+
)
|
| 68 |
+
tokenizer = transformers.AutoTokenizer.from_pretrained(
|
| 69 |
+
'numind/NuNER-v2.0'
|
| 70 |
+
)
|
| 71 |
+
|
| 72 |
+
text = [
|
| 73 |
+
"NuMind is an AI company based in Paris and USA.",
|
| 74 |
+
"See other models from us on https://huggingface.co/numind"
|
| 75 |
+
]
|
| 76 |
+
encoded_input = tokenizer(
|
| 77 |
+
text,
|
| 78 |
+
return_tensors='pt',
|
| 79 |
+
padding=True,
|
| 80 |
+
truncation=True
|
| 81 |
+
)
|
| 82 |
+
output = model(**encoded_input)
|
| 83 |
+
|
| 84 |
+
emb = output.hidden_states[-1]
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
## Citation
|
| 88 |
+
```
|
| 89 |
+
@misc{bogdanov2024nuner,
|
| 90 |
+
title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data},
|
| 91 |
+
author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard},
|
| 92 |
+
year={2024},
|
| 93 |
+
eprint={2402.15343},
|
| 94 |
+
archivePrefix={arXiv},
|
| 95 |
+
primaryClass={cs.CL}
|
| 96 |
+
}
|
| 97 |
+
```
|