emanuelaboros commited on
Commit
0a08c88
·
verified ·
1 Parent(s): 3d7d5dd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -73
README.md CHANGED
@@ -33,10 +33,20 @@ The model architecture consists of the following components:
33
 
34
  These additional Transformer layers help in mitigating the effects of OCR noise, spelling variation, and non-standard linguistic usage found in historical documents. The entire stack is fine-tuned end-to-end for token classification.
35
 
36
- ## Training and Evaluation Results (v2)
37
 
38
  This evaluation corresponds to the **HIPE-2020 dataset (v2.1)**, using **French and German** combined for training,
39
  **German (`dev-de`)** for validation, and **French (`test-fr`)** for testing.
 
 
 
 
 
 
 
 
 
 
40
  The results below show performance on the **French test set** across multiple evaluation settings.
41
 
42
  | **Evaluation** | **Label** | **P** | **R** | **F1** |
@@ -161,78 +171,6 @@ print(entities)
161
  ]
162
  ```
163
 
164
- ## Training Details
165
-
166
- ### Training Data
167
-
168
- The model was trained on the Impresso HIPE-2020 dataset, a subset of the [HIPE-2022 corpus](https://github.com/hipe-eval/HIPE-2022-data), which includes richly annotated OCR-transcribed historical newspaper content.
169
-
170
- ### Training Procedure
171
-
172
- #### Preprocessing
173
-
174
- OCR content was cleaned and segmented. Entity types follow the HIPE-2020 typology.
175
-
176
- #### Training Hyperparameters
177
-
178
- - **Training regime:** Mixed precision (fp16)
179
- - **Epochs:** 5
180
- - **Max sequence length:** 512
181
- - **Base model:** `dbmdz/bert-medium-historic-multilingual-cased`
182
- - **Stacked Transformer layers:** 2
183
-
184
- #### Speeds, Sizes, Times
185
-
186
- - **Model size:** ~500MB
187
- - **Training time:** ~1h on 1 GPU (NVIDIA TITAN X)
188
-
189
- ## Evaluation
190
-
191
- #### Testing Data
192
-
193
- Held-out portion of HIPE-2020 (French, German)
194
-
195
- #### Metrics
196
-
197
- - F1-score (micro, macro)
198
- - Entity-level precision/recall
199
-
200
- ### Results
201
-
202
- | Language | Precision | Recall | F1-score |
203
- |----------|-----------|--------|----------|
204
- | French | 84.2 | 81.6 | 82.9 |
205
- | German | 82.0 | 78.7 | 80.3 |
206
-
207
- #### Summary
208
-
209
- The model performs robustly across noisy OCR historical content with support for fine-grained entity typologies.
210
-
211
- ## Environmental Impact
212
-
213
- - **Hardware Type:** NVIDIA TITAN X (Pascal, 12GB)
214
- - **Hours used:** ~1 hour
215
- - **Cloud Provider:** EPFL, Switzerland
216
- - **Carbon Emitted:** ~0.022 kg CO₂eq (estimated)
217
-
218
- ## Technical Specifications
219
-
220
- ### Model Architecture and Objective
221
-
222
- Stacked BERT architecture with multitask token classification head supporting HIPE-type entity labels.
223
-
224
- ### Compute Infrastructure
225
-
226
- #### Hardware
227
-
228
- 1x NVIDIA TITAN X (Pascal, 12GB)
229
-
230
- #### Software
231
-
232
- - Python 3.11
233
- - PyTorch 2.0
234
- - Transformers 4.36
235
-
236
  ## Citation
237
 
238
  **BibTeX:**
 
33
 
34
  These additional Transformer layers help in mitigating the effects of OCR noise, spelling variation, and non-standard linguistic usage found in historical documents. The entire stack is fine-tuned end-to-end for token classification.
35
 
36
+ ## Training and Evaluation Results
37
 
38
  This evaluation corresponds to the **HIPE-2020 dataset (v2.1)**, using **French and German** combined for training,
39
  **German (`dev-de`)** for validation, and **French (`test-fr`)** for testing.
40
+
41
+ #### Training Hyperparameters
42
+
43
+ - **Training regime:** Mixed precision (fp16)
44
+ - **Epochs:** 5
45
+ - **Max sequence length:** 512
46
+ - **Base model:** `dbmdz/bert-medium-historic-multilingual-cased`
47
+ - **Stacked Transformer layers:** 2
48
+
49
+ #### Results
50
  The results below show performance on the **French test set** across multiple evaluation settings.
51
 
52
  | **Evaluation** | **Label** | **P** | **R** | **F1** |
 
171
  ]
172
  ```
173
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
  ## Citation
175
 
176
  **BibTeX:**