GEODE
/

bert-base-multilingual-cased-geography-entry-classification

Text Classification

Model card Files Files and versions

lmoncla commited on Apr 15, 2025

Commit

ef97fd1

·

verified ·

1 Parent(s): addc834

Update README.md

Files changed (1) hide show

README.md +18 -1

README.md CHANGED Viewed

@@ -44,6 +44,24 @@ The tagset is as follows:
 - **Misc**: encyclopedia entry describing any other type of entity (such as abstract geographic concepts, cross-references to other entries, etc.)
 ## How to Get Started with the Model
@@ -66,7 +84,6 @@ samples = [
 for sample in samples:
     print(pipe(sample))
 # Output
 [{'label': 'Place', 'score': 0.9984947443008423}]
 [{'label': 'Person', 'score': 0.9661000370979309}]

 - **Misc**: encyclopedia entry describing any other type of entity (such as abstract geographic concepts, cross-references to other entries, etc.)
+## Dataset
+The model was trained using a set of 2200 paragraphs randomly selected out of 2001 Encyclopédie's entries.
+All paragraphs were written in French and are distributed as follows among the Encyclopédie knowledge domains:
+The spans/entities were labeled by the project team along with using pre-labelling with early models to speed up the labelling process.
+A train/val/test split was used.
+Validation and test sets are composed of 200 paragraphs each: 100 classified as 'Géographie' and 100 from another knowledge domain.
+The datasets have the following breakdown of tokens and spans/entities.
+|   | Train | Validation | Test|
+|---|:---:|:---:|:---:|
+| Place | 707 | 125 | 147|
+| Person | 123 | 22 | 26 |
+| Misc | 197 | 35 | 41 |
 ## How to Get Started with the Model
 for sample in samples:
     print(pipe(sample))
 # Output
 [{'label': 'Place', 'score': 0.9984947443008423}]
 [{'label': 'Person', 'score': 0.9661000370979309}]