File size: 4,264 Bytes
d367806
eeffb50
 
 
 
 
 
 
d367806
eeffb50
 
 
 
 
d367806
 
6423566
d7a9d39
eeffb50
d7a9d39
eeffb50
d367806
eeffb50
d367806
37ffd63
eeffb50
 
37ffd63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2168ec7
 
377d493
2168ec7
 
eeffb50
2168ec7
eeffb50
 
d367806
 
 
 
4262db7
d367806
 
 
 
 
 
 
 
d7a9d39
3b2c930
 
d367806
d7a9d39
d367806
3b2c930
eeffb50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
library_name: transformers
language:
- fr
- de
- en
- it
- lb
license: agpl-3.0
tags:
- language-identification
- multilingual
- historical
- impresso
---

# Model Card for `impresso-project/language-identifier`

## Overview

`impresso-project/language-identifier` is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports **German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb)** — the core languages of the [Impresso Project](https://impresso-project.ch), which focuses on analyzing historical media across national and linguistic borders.

This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.


## Model Details

This model is a supervised [floret model](https://github.com/explosion/floret), trained with the following parameters:
```
{'bucket': 200000,
 'dimension': 40,
 'hash_function': 'N/A',
 'loss': 'softmax',
 'maxn': 4,
 'minn': 1,
 'model_type': 'supervised',
 'vocab_size': 3}
```

On the [impresso language identification challenge test set](https://github.com/impresso/dataset-challenge-lid) it achieves the following performance:

```
      de   en    fr   it  la   lb  nl
de  2854    0    79    3   0   38   0
en     0  156     1    0   0    0   0
fr    14   11  1515    1   7    9   0
it     0    0     0  136   0    0   0
la     0    0     0    0   0    0   0
lb     6    1    20    0   0  775   1
nl     0    0     0    0   0    0   0

Detailed Classification Report:

              precision    recall  f1-score   support

          de       0.99      0.96      0.98      2974
          en       0.93      0.99      0.96       157
          fr       0.94      0.97      0.96      1557
          it       0.97      1.00      0.99       136
          la       0.00      0.00      0.00         0
          lb       0.94      0.97      0.95       803
          nl       0.00      0.00      0.00         0

    accuracy                           0.97      5627
   macro avg       0.68      0.70      0.69      5627
weighted avg       0.97      0.97      0.97      5627
```
### Model Description

- **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).
- **Model type:** Language identification using a transformer-based classification architecture
- **Languages:** French, German, English, Italian, Luxembourgish
- **License:** AGPL-3.0
- **Finetuned from:** Custom model trained on historical newspaper data from the Impresso corpus

## How to Use

```python
from transformers import pipeline

MODEL_NAME = "impresso-project/language-identifier"

lang_pipeline = pipeline(
    "langident",
    model=MODEL_NAME,
    trust_remote_code=True,
    device="cpu",
)

text = """En l'an 1348, au plus fort des ravages de la peste noire à travers
l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et
face à une opportunité."""

langs = lang_pipeline(text)
print(langs)
```

## Output Format

The output is a single dictionary with the predicted language and confidence score:

```python
{
  "language": "fr",
  "score": 1.0
}
```


## Use Cases

- Preprocessing for OCR and NLP tasks on historical corpora
- Document and segment-level language tagging
- Filtering and sorting multilingual newspaper archives

## Limitations

- Works best on **sentence- or paragraph-length** texts
- May struggle with code-switching or OCR-degraded text that mixes languages
- Primarily optimized for **Impresso-like sources** (19th–20th century newspapers)

## Installation

```bash
pip install transformers floret
```

## Contact

- Website: [https://impresso-project.ch](https://impresso-project.ch)

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
</p>