lmoncla commited on
Commit
ef97fd1
·
verified ·
1 Parent(s): addc834

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -1
README.md CHANGED
@@ -44,6 +44,24 @@ The tagset is as follows:
44
  - **Misc**: encyclopedia entry describing any other type of entity (such as abstract geographic concepts, cross-references to other entries, etc.)
45
 
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  ## How to Get Started with the Model
49
 
@@ -66,7 +84,6 @@ samples = [
66
  for sample in samples:
67
  print(pipe(sample))
68
 
69
-
70
  # Output
71
  [{'label': 'Place', 'score': 0.9984947443008423}]
72
  [{'label': 'Person', 'score': 0.9661000370979309}]
 
44
  - **Misc**: encyclopedia entry describing any other type of entity (such as abstract geographic concepts, cross-references to other entries, etc.)
45
 
46
 
47
+ ## Dataset
48
+
49
+
50
+ The model was trained using a set of 2200 paragraphs randomly selected out of 2001 Encyclopédie's entries.
51
+ All paragraphs were written in French and are distributed as follows among the Encyclopédie knowledge domains:
52
+
53
+ The spans/entities were labeled by the project team along with using pre-labelling with early models to speed up the labelling process.
54
+ A train/val/test split was used.
55
+ Validation and test sets are composed of 200 paragraphs each: 100 classified as 'Géographie' and 100 from another knowledge domain.
56
+ The datasets have the following breakdown of tokens and spans/entities.
57
+
58
+ | | Train | Validation | Test|
59
+ |---|:---:|:---:|:---:|
60
+ | Place | 707 | 125 | 147|
61
+ | Person | 123 | 22 | 26 |
62
+ | Misc | 197 | 35 | 41 |
63
+
64
+
65
 
66
  ## How to Get Started with the Model
67
 
 
84
  for sample in samples:
85
  print(pipe(sample))
86
 
 
87
  # Output
88
  [{'label': 'Place', 'score': 0.9984947443008423}]
89
  [{'label': 'Person', 'score': 0.9661000370979309}]