Update raw_data/readme.md
Browse files- raw_data/readme.md +13 -0
raw_data/readme.md
CHANGED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Labeled datasets for the Standard Based Impact Classification method extracted from approx. 230 CSR GRI reports (150 International companies, 2017-2021 period).
|
| 2 |
+
- labels_ipnit_paragraphs 22k records
|
| 3 |
+
- labels_ipnit_sentences 75k records
|
| 4 |
+
- labels_iporit_paragraphs 57k recors
|
| 5 |
+
- labels_iporit_sentences 193k records
|
| 6 |
+
|
| 7 |
+
Automatic labeling
|
| 8 |
+
*ipnit* stands for "index-page **AND** in-text" criteria of label identification. *iporit* stands for "index-page **OR** in-text" criteria of label identification.
|
| 9 |
+
index-page means the algorithm searches for index page within the pdf file, and extracts page numbers from there. in-text means the algorithm searching for the label using regex on each page of the report.
|
| 10 |
+
*ipnit* represents more strict condition for considering the text as labeled, as it needs both of above methods to return true value. *iporit* is a more relaxed version, considering text as labeled when either of the two returns true value.
|
| 11 |
+
No sifnificant increase in the accuracy of the prediction model was observed when using *ipnit* compared to *iporit*.
|
| 12 |
+
|
| 13 |
+
|