Update README.md
Browse files
README.md
CHANGED
|
@@ -33,10 +33,20 @@ The model architecture consists of the following components:
|
|
| 33 |
|
| 34 |
These additional Transformer layers help in mitigating the effects of OCR noise, spelling variation, and non-standard linguistic usage found in historical documents. The entire stack is fine-tuned end-to-end for token classification.
|
| 35 |
|
| 36 |
-
## Training and Evaluation Results
|
| 37 |
|
| 38 |
This evaluation corresponds to the **HIPE-2020 dataset (v2.1)**, using **French and German** combined for training,
|
| 39 |
**German (`dev-de`)** for validation, and **French (`test-fr`)** for testing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
The results below show performance on the **French test set** across multiple evaluation settings.
|
| 41 |
|
| 42 |
| **Evaluation** | **Label** | **P** | **R** | **F1** |
|
|
@@ -161,78 +171,6 @@ print(entities)
|
|
| 161 |
]
|
| 162 |
```
|
| 163 |
|
| 164 |
-
## Training Details
|
| 165 |
-
|
| 166 |
-
### Training Data
|
| 167 |
-
|
| 168 |
-
The model was trained on the Impresso HIPE-2020 dataset, a subset of the [HIPE-2022 corpus](https://github.com/hipe-eval/HIPE-2022-data), which includes richly annotated OCR-transcribed historical newspaper content.
|
| 169 |
-
|
| 170 |
-
### Training Procedure
|
| 171 |
-
|
| 172 |
-
#### Preprocessing
|
| 173 |
-
|
| 174 |
-
OCR content was cleaned and segmented. Entity types follow the HIPE-2020 typology.
|
| 175 |
-
|
| 176 |
-
#### Training Hyperparameters
|
| 177 |
-
|
| 178 |
-
- **Training regime:** Mixed precision (fp16)
|
| 179 |
-
- **Epochs:** 5
|
| 180 |
-
- **Max sequence length:** 512
|
| 181 |
-
- **Base model:** `dbmdz/bert-medium-historic-multilingual-cased`
|
| 182 |
-
- **Stacked Transformer layers:** 2
|
| 183 |
-
|
| 184 |
-
#### Speeds, Sizes, Times
|
| 185 |
-
|
| 186 |
-
- **Model size:** ~500MB
|
| 187 |
-
- **Training time:** ~1h on 1 GPU (NVIDIA TITAN X)
|
| 188 |
-
|
| 189 |
-
## Evaluation
|
| 190 |
-
|
| 191 |
-
#### Testing Data
|
| 192 |
-
|
| 193 |
-
Held-out portion of HIPE-2020 (French, German)
|
| 194 |
-
|
| 195 |
-
#### Metrics
|
| 196 |
-
|
| 197 |
-
- F1-score (micro, macro)
|
| 198 |
-
- Entity-level precision/recall
|
| 199 |
-
|
| 200 |
-
### Results
|
| 201 |
-
|
| 202 |
-
| Language | Precision | Recall | F1-score |
|
| 203 |
-
|----------|-----------|--------|----------|
|
| 204 |
-
| French | 84.2 | 81.6 | 82.9 |
|
| 205 |
-
| German | 82.0 | 78.7 | 80.3 |
|
| 206 |
-
|
| 207 |
-
#### Summary
|
| 208 |
-
|
| 209 |
-
The model performs robustly across noisy OCR historical content with support for fine-grained entity typologies.
|
| 210 |
-
|
| 211 |
-
## Environmental Impact
|
| 212 |
-
|
| 213 |
-
- **Hardware Type:** NVIDIA TITAN X (Pascal, 12GB)
|
| 214 |
-
- **Hours used:** ~1 hour
|
| 215 |
-
- **Cloud Provider:** EPFL, Switzerland
|
| 216 |
-
- **Carbon Emitted:** ~0.022 kg CO₂eq (estimated)
|
| 217 |
-
|
| 218 |
-
## Technical Specifications
|
| 219 |
-
|
| 220 |
-
### Model Architecture and Objective
|
| 221 |
-
|
| 222 |
-
Stacked BERT architecture with multitask token classification head supporting HIPE-type entity labels.
|
| 223 |
-
|
| 224 |
-
### Compute Infrastructure
|
| 225 |
-
|
| 226 |
-
#### Hardware
|
| 227 |
-
|
| 228 |
-
1x NVIDIA TITAN X (Pascal, 12GB)
|
| 229 |
-
|
| 230 |
-
#### Software
|
| 231 |
-
|
| 232 |
-
- Python 3.11
|
| 233 |
-
- PyTorch 2.0
|
| 234 |
-
- Transformers 4.36
|
| 235 |
-
|
| 236 |
## Citation
|
| 237 |
|
| 238 |
**BibTeX:**
|
|
|
|
| 33 |
|
| 34 |
These additional Transformer layers help in mitigating the effects of OCR noise, spelling variation, and non-standard linguistic usage found in historical documents. The entire stack is fine-tuned end-to-end for token classification.
|
| 35 |
|
| 36 |
+
## Training and Evaluation Results
|
| 37 |
|
| 38 |
This evaluation corresponds to the **HIPE-2020 dataset (v2.1)**, using **French and German** combined for training,
|
| 39 |
**German (`dev-de`)** for validation, and **French (`test-fr`)** for testing.
|
| 40 |
+
|
| 41 |
+
#### Training Hyperparameters
|
| 42 |
+
|
| 43 |
+
- **Training regime:** Mixed precision (fp16)
|
| 44 |
+
- **Epochs:** 5
|
| 45 |
+
- **Max sequence length:** 512
|
| 46 |
+
- **Base model:** `dbmdz/bert-medium-historic-multilingual-cased`
|
| 47 |
+
- **Stacked Transformer layers:** 2
|
| 48 |
+
|
| 49 |
+
#### Results
|
| 50 |
The results below show performance on the **French test set** across multiple evaluation settings.
|
| 51 |
|
| 52 |
| **Evaluation** | **Label** | **P** | **R** | **F1** |
|
|
|
|
| 171 |
]
|
| 172 |
```
|
| 173 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
## Citation
|
| 175 |
|
| 176 |
**BibTeX:**
|