File size: 5,404 Bytes
4e9e7f8
d15c350
 
4e9e7f8
 
 
d15c350
 
 
 
 
 
 
 
 
 
4e9e7f8
 
76bbfcb
4e9e7f8
d15c350
4e9e7f8
 
 
d15c350
4e9e7f8
d15c350
4e9e7f8
d15c350
4e9e7f8
d15c350
4e9e7f8
 
 
 
 
d15c350
 
4e9e7f8
 
d15c350
4e9e7f8
092541e
4e9e7f8
d15c350
4e9e7f8
 
 
d15c350
4e9e7f8
d15c350
4e9e7f8
d15c350
4e9e7f8
d15c350
4e9e7f8
d15c350
 
4e9e7f8
d15c350
4e9e7f8
d15c350
 
 
 
4e9e7f8
d15c350
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e9e7f8
 
d15c350
 
 
 
 
 
 
 
4e9e7f8
 
 
 
 
d15c350
 
 
 
4e9e7f8
d15c350
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e9e7f8
d15c350
 
 
 
4e9e7f8
 
d15c350
 
 
4e9e7f8
d15c350
 
 
 
 
 
4e9e7f8
 
 
d15c350
 
 
 
 
4e9e7f8
d15c350
4e9e7f8
d15c350
4e9e7f8
d15c350
4e9e7f8
d15c350
 
 
4e9e7f8
d15c350
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- multilingual
license: agpl-3.0
language:
- de
- fr
- en
- lb
base_model:
- Alibaba-NLP/gte-multilingual-base
---

# THIS IS A PREVIEW MODEL for the IMPRESSO HALLOWEEN WORKSHOP

This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

## Model Details

This model that was adapted to be more robust to OCR Noise in German and French. This model would be particularly useful for libraries and archives in Central Europe that want to perform semantic search and longitudinal studies within their collections.

This is an [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) model that was further adapted by (Michail et al., 2025)

## Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('impresso-project/halloween_workshop_ocr_robust_preview')
embeddings = model.encode(sentences)
print(embeddings)
```


## Evaluation Results

I will add the model specific evaluation results once the instance is running again.

## Training Details

### Training Dataset

### Contrastive Training
The model was trained with the parameters:

**Loss**:

`sentence_transformers.losses.MultipleNegativesRankingLoss` with parameters:
  ```
  {'scale': 20.0, 'similarity_fct': 'cos_sim'}
  ```

Parameters of the fit()-Method:
```
{
    "epochs": 1,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 250,
    "weight_decay": 0.01
}
```


## Full Model Architecture
```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
```

## Citation

### BibTeX

#### Cheap Character Noise for OCR-Robust Multilingual Embeddings (introducing paper)

For details on the adaptation methodology please refer to our paper (published in ACL2025 Findings). If you use our models or methodology, please cite our work. 

```bibtex
@inproceedings{michail-etal-2025-cheap,
    title = "Cheap Character Noise for {OCR}-Robust Multilingual Embeddings",
    author = "Michail, Andrianos  and
      Opitz, Juri  and
      Wang, Yining  and
      Meister, Robin  and
      Sennrich, Rico  and
      Clematide, Simon",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.609/",
    doi = "10.18653/v1/2025.findings-acl.609",
    pages = "11705--11716",
    ISBN = "979-8-89176-256-5",
```


#### Original Multilingual GTE Model

```bibtex
@inproceedings{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  pages={1393--1412},
  year={2024}
}
```

## About Impresso

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2025 The Impresso team.

### License

This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>