File size: 2,428 Bytes
bf7c5ed
 
a672b6f
 
 
 
a01df0a
786ed9f
 
 
3a31eec
 
 
3c06495
 
 
b95e360
 
bf7c5ed
 
219f878
bf7c5ed
219f878
bf7c5ed
 
 
614b5db
831f9cf
a672b6f
080053f
bf7c5ed
a672b6f
219f878
 
bf7c5ed
080053f
 
 
 
bf7c5ed
 
219f878
 
 
 
bf7c5ed
a672b6f
4c35891
 
bf7c5ed
a672b6f
 
bf7c5ed
a672b6f
 
 
 
3c06495
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
library_name: transformers
license: mit
language:
- nl
pipeline_tag: token-classification
widget:
- text: >-
    Vandaag bespreken we Turks Fruit, een meesterwerk van de Nederlandse auteur Jan Wolkers. 
    Dit boek, dat oorspronkelijk werd gepubliceerd in 1969, is een van de meest iconische en controversiële werken in de Nederlandse literatuur.
- text: >-
    Gisteren heb ik het boek Nijntje in de dierentuin gelezen. Ik kan niet
    anders zeggen dat dit boek fantastisch was!
metrics:
- f1
tags:
- Literature
- PyTorch
---

# Model Card for Dutch Book Title Extraction

This Named Entity Recognition (NER) model is designed to extract book titles from Dutch texts.

## Model Details

The model has been fine-tuned and evaluated on a Dutch dataset consisting of 12,535 book reviews from the Leeuwarder Courant, identifying 23,529 book titles. The dataset utilizes the IO Tagging Schema. The data was divided into a training set (70%), validation set (15%), and test set (15%). Training involved the Majority or Minority loss function, achieving an F1 score of 84.3%, Precision of 83.4%, and Recall of 85.2% on the test set.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/661fcac6ccc447675983951b/Ap95lefSlrwJGDg6eupVF.png)

## Model Description

- **Model type:** XML-RoBERTa
- **Language(s):** Dutch
- **Fine-tuned from model:** [FacebookAI/xlm-roberta-large-finetuned-conll03-english](https://huggingface.co/FacebookAI/xlm-roberta-large-finetuned-conll03-english)

## Model Flaws
- Struggles with accurately identifying subtitles of book titles.
- When a book title is mentioned multiple times within the same review, the model tends to mark it only once, missing subsequent occurrences.

## Uses

This model is intended for extracting book titles from Dutch texts, particularly useful for applications involving text analysis in the literary domain.

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("nielsaxe/BookTitleNERDutch")
model = AutoModelForTokenClassification.from_pretrained("nielsaxe/BookTitleNERDutch")

# Create a NER pipeline
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

# Example usage
text = "Gisteren heb ik het boek Nijntje in de dierentuin gelezen. Ik kan niet anders zeggen dat dit boek fantastisch was!"
entities = nlp(text)
print(entities)
```