juhoinkinen commited on
Commit
3d570df
·
verified ·
1 Parent(s): db11fa2

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .dvc/.gitignore +3 -0
  2. .dvc/config +6 -0
  3. .dvcignore +4 -0
  4. .gitignore +2 -0
  5. LICENSE +121 -0
  6. README.md +49 -0
  7. corpora/.gitignore +1 -0
  8. corpora/fulltext-test/en/.gitignore +4 -0
  9. corpora/fulltext-test/en/abo-theses.dvc +12 -0
  10. corpora/fulltext-test/en/jyu-theses.dvc +13 -0
  11. corpora/fulltext-test/en/kirjaesittelyt2021.dvc +13 -0
  12. corpora/fulltext-test/en/vapaakappaleet-orig.dvc +12 -0
  13. corpora/fulltext-test/fi/.gitignore +8 -0
  14. corpora/fulltext-test/fi/jyu-theses.dvc +13 -0
  15. corpora/fulltext-test/fi/kirjaesittelyt2021.dvc +13 -0
  16. corpora/fulltext-test/fi/kirjastonhoitaja.dvc +13 -0
  17. corpora/fulltext-test/fi/satakunnan-kansa-1.dvc +12 -0
  18. corpora/fulltext-test/fi/satakunnan-kansa-2.dvc +12 -0
  19. corpora/fulltext-test/fi/satakunnan-kansa-3.dvc +12 -0
  20. corpora/fulltext-test/fi/satakunnan-kansa-4.dvc +12 -0
  21. corpora/fulltext-test/fi/vapaakappaleet-orig.dvc +12 -0
  22. corpora/fulltext-test/sv/.gitignore +4 -0
  23. corpora/fulltext-test/sv/abo-theses.dvc +13 -0
  24. corpora/fulltext-test/sv/jyu-theses.dvc +13 -0
  25. corpora/fulltext-test/sv/kirjaesittelyt2021.dvc +13 -0
  26. corpora/fulltext-test/sv/vapaakappaleet-orig.dvc +12 -0
  27. corpora/fulltext-train/en/.gitignore +4 -0
  28. corpora/fulltext-train/en/abo-theses.dvc +12 -0
  29. corpora/fulltext-train/en/jyu-theses.dvc +13 -0
  30. corpora/fulltext-train/en/kirjaesittelyt2021.dvc +13 -0
  31. corpora/fulltext-train/en/vapaakappaleet-orig.dvc +12 -0
  32. corpora/fulltext-train/fi/.gitignore +5 -0
  33. corpora/fulltext-train/fi/jyu-theses.dvc +13 -0
  34. corpora/fulltext-train/fi/kirjaesittelyt2021.dvc +13 -0
  35. corpora/fulltext-train/fi/kirjastonhoitaja.dvc +13 -0
  36. corpora/fulltext-train/fi/satakunnan-kansa.dvc +12 -0
  37. corpora/fulltext-train/fi/vapaakappaleet-orig.dvc +12 -0
  38. corpora/fulltext-train/sv/.gitignore +4 -0
  39. corpora/fulltext-train/sv/abo-theses.dvc +13 -0
  40. corpora/fulltext-train/sv/jyu-theses.dvc +12 -0
  41. corpora/fulltext-train/sv/kirjaesittelyt2021.dvc +13 -0
  42. corpora/fulltext-train/sv/vapaakappaleet-orig.dvc +12 -0
  43. corpora/shorttext-train/en/.gitignore +1 -0
  44. corpora/shorttext-train/en/yso-finna-en.tsv.gz.dvc +11 -0
  45. corpora/shorttext-train/fi/.gitignore +4 -0
  46. corpora/shorttext-train/fi/yso-finna-fi-01.tsv.gz.dvc +11 -0
  47. corpora/shorttext-train/fi/yso-finna-fi-02.tsv.gz.dvc +11 -0
  48. corpora/shorttext-train/fi/yso-finna-fi-03.tsv.gz.dvc +11 -0
  49. corpora/shorttext-train/fi/yso-finna-fi-04.tsv.gz.dvc +11 -0
  50. corpora/shorttext-train/sv/.gitignore +1 -0
.dvc/.gitignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ /config.local
2
+ /tmp
3
+ /cache
.dvc/config ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ [cache]
2
+ dir = /data/dvc-cache/FintoAI-data-YSO
3
+ shared = group
4
+ type = symlink
5
+ [core]
6
+ autostage = true
.dvcignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # Add patterns of files dvc should ignore, which could improve
2
+ # the performance. Learn more at
3
+ # https://dvc.org/doc/user-guide/dvcignore
4
+ *.pdf
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ venv
2
+ venv-installed
LICENSE ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Creative Commons Legal Code
2
+
3
+ CC0 1.0 Universal
4
+
5
+ CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
6
+ LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
7
+ ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
8
+ INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
9
+ REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
10
+ PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
11
+ THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
12
+ HEREUNDER.
13
+
14
+ Statement of Purpose
15
+
16
+ The laws of most jurisdictions throughout the world automatically confer
17
+ exclusive Copyright and Related Rights (defined below) upon the creator
18
+ and subsequent owner(s) (each and all, an "owner") of an original work of
19
+ authorship and/or a database (each, a "Work").
20
+
21
+ Certain owners wish to permanently relinquish those rights to a Work for
22
+ the purpose of contributing to a commons of creative, cultural and
23
+ scientific works ("Commons") that the public can reliably and without fear
24
+ of later claims of infringement build upon, modify, incorporate in other
25
+ works, reuse and redistribute as freely as possible in any form whatsoever
26
+ and for any purposes, including without limitation commercial purposes.
27
+ These owners may contribute to the Commons to promote the ideal of a free
28
+ culture and the further production of creative, cultural and scientific
29
+ works, or to gain reputation or greater distribution for their Work in
30
+ part through the use and efforts of others.
31
+
32
+ For these and/or other purposes and motivations, and without any
33
+ expectation of additional consideration or compensation, the person
34
+ associating CC0 with a Work (the "Affirmer"), to the extent that he or she
35
+ is an owner of Copyright and Related Rights in the Work, voluntarily
36
+ elects to apply CC0 to the Work and publicly distribute the Work under its
37
+ terms, with knowledge of his or her Copyright and Related Rights in the
38
+ Work and the meaning and intended legal effect of CC0 on those rights.
39
+
40
+ 1. Copyright and Related Rights. A Work made available under CC0 may be
41
+ protected by copyright and related or neighboring rights ("Copyright and
42
+ Related Rights"). Copyright and Related Rights include, but are not
43
+ limited to, the following:
44
+
45
+ i. the right to reproduce, adapt, distribute, perform, display,
46
+ communicate, and translate a Work;
47
+ ii. moral rights retained by the original author(s) and/or performer(s);
48
+ iii. publicity and privacy rights pertaining to a person's image or
49
+ likeness depicted in a Work;
50
+ iv. rights protecting against unfair competition in regards to a Work,
51
+ subject to the limitations in paragraph 4(a), below;
52
+ v. rights protecting the extraction, dissemination, use and reuse of data
53
+ in a Work;
54
+ vi. database rights (such as those arising under Directive 96/9/EC of the
55
+ European Parliament and of the Council of 11 March 1996 on the legal
56
+ protection of databases, and under any national implementation
57
+ thereof, including any amended or successor version of such
58
+ directive); and
59
+ vii. other similar, equivalent or corresponding rights throughout the
60
+ world based on applicable law or treaty, and any national
61
+ implementations thereof.
62
+
63
+ 2. Waiver. To the greatest extent permitted by, but not in contravention
64
+ of, applicable law, Affirmer hereby overtly, fully, permanently,
65
+ irrevocably and unconditionally waives, abandons, and surrenders all of
66
+ Affirmer's Copyright and Related Rights and associated claims and causes
67
+ of action, whether now known or unknown (including existing as well as
68
+ future claims and causes of action), in the Work (i) in all territories
69
+ worldwide, (ii) for the maximum duration provided by applicable law or
70
+ treaty (including future time extensions), (iii) in any current or future
71
+ medium and for any number of copies, and (iv) for any purpose whatsoever,
72
+ including without limitation commercial, advertising or promotional
73
+ purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
74
+ member of the public at large and to the detriment of Affirmer's heirs and
75
+ successors, fully intending that such Waiver shall not be subject to
76
+ revocation, rescission, cancellation, termination, or any other legal or
77
+ equitable action to disrupt the quiet enjoyment of the Work by the public
78
+ as contemplated by Affirmer's express Statement of Purpose.
79
+
80
+ 3. Public License Fallback. Should any part of the Waiver for any reason
81
+ be judged legally invalid or ineffective under applicable law, then the
82
+ Waiver shall be preserved to the maximum extent permitted taking into
83
+ account Affirmer's express Statement of Purpose. In addition, to the
84
+ extent the Waiver is so judged Affirmer hereby grants to each affected
85
+ person a royalty-free, non transferable, non sublicensable, non exclusive,
86
+ irrevocable and unconditional license to exercise Affirmer's Copyright and
87
+ Related Rights in the Work (i) in all territories worldwide, (ii) for the
88
+ maximum duration provided by applicable law or treaty (including future
89
+ time extensions), (iii) in any current or future medium and for any number
90
+ of copies, and (iv) for any purpose whatsoever, including without
91
+ limitation commercial, advertising or promotional purposes (the
92
+ "License"). The License shall be deemed effective as of the date CC0 was
93
+ applied by Affirmer to the Work. Should any part of the License for any
94
+ reason be judged legally invalid or ineffective under applicable law, such
95
+ partial invalidity or ineffectiveness shall not invalidate the remainder
96
+ of the License, and in such case Affirmer hereby affirms that he or she
97
+ will not (i) exercise any of his or her remaining Copyright and Related
98
+ Rights in the Work or (ii) assert any associated claims and causes of
99
+ action with respect to the Work, in either case contrary to Affirmer's
100
+ express Statement of Purpose.
101
+
102
+ 4. Limitations and Disclaimers.
103
+
104
+ a. No trademark or patent rights held by Affirmer are waived, abandoned,
105
+ surrendered, licensed or otherwise affected by this document.
106
+ b. Affirmer offers the Work as-is and makes no representations or
107
+ warranties of any kind concerning the Work, express, implied,
108
+ statutory or otherwise, including without limitation warranties of
109
+ title, merchantability, fitness for a particular purpose, non
110
+ infringement, or the absence of latent or other defects, accuracy, or
111
+ the present or absence of errors, whether or not discoverable, all to
112
+ the greatest extent permissible under applicable law.
113
+ c. Affirmer disclaims responsibility for clearing rights of other persons
114
+ that may apply to the Work or any use thereof, including without
115
+ limitation any person's Copyright and Related Rights in the Work.
116
+ Further, Affirmer disclaims responsibility for obtaining any necessary
117
+ consents, permissions or other rights required for any use of the
118
+ Work.
119
+ d. Affirmer understands and acknowledges that Creative Commons is not a
120
+ party to this document and has no duty or obligation with respect to
121
+ this CC0 or use of the Work.
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc0-1.0
3
+ language:
4
+ - fi
5
+ - sv
6
+ - en
7
+ pipeline_tag: text-classification
8
+ thumbnail: https://raw.githubusercontent.com/NatLibFi/FintoAI/main/ai.finto.fi/static/img/finto-ai-social.png
9
+ tags:
10
+ - glam
11
+ - lam
12
+ - subject indexing
13
+ - annif
14
+ ---
15
+ # FintoAI-data-YSO
16
+ This repository is for the Annif projects with the
17
+ [YSO vocabulary](https://finto.fi/yso)
18
+ used at the [Finto AI service](https://ai.finto.fi/).
19
+ The current models were published there 2023-09-04.
20
+ The models have been trained on Python 3.8.10 with [Annif](https://annif.org) version 1.0.0.
21
+ See [projects.toml](projects.toml) for the configurations of the models.
22
+
23
+ This repository is mirrored from GitHub to the 🤗 Hugging Face Hub;
24
+ the GitHub repository does not contain the model files, but only the configurations for the projects and the DVC pipeline, see below.
25
+
26
+ The training corpora that are public can be found from the [Annif-corpora repository](https://github.com/NatLibFi/Annif-corpora/).
27
+
28
+ The [notebook](/repository-metrics-analysis/analyse-theseus-tietolinja.ipynb) contains analysis of Annif suggestions in [Theseus repository](https://www.theseus.fi/).
29
+
30
+ ## Models
31
+ The downloadable directories for projects and vocabularies are stored in the
32
+ [`/data`](https://huggingface.co/juhoinkinen/FintoAI-data-YSO/tree/main/data)
33
+ directory of this repository in the 🤗 Hugging Face Hub.
34
+
35
+ ## DVC pipeline
36
+ The projects are trained and evaluated using a [DVC (Data Version Control) pipeline](https://dvc.org/doc/start/data-management/data-pipelines) defined in [dvc.yaml](./dvc.yaml).
37
+
38
+ The pipeline takes care of
39
+
40
+ 1. installing Annif in a venv,
41
+ 2. loading the vocabulary,
42
+ 3. training the projects,
43
+ 4. evaluating the projects.
44
+
45
+ When the necessary vocabulary and training corpora are in place the pipeline can be run using the command
46
+
47
+ dvc repro
48
+
49
+ For more information about using DVC with Annif projects see the [DVC exercise of Annif tutorial](https://github.com/NatLibFi/Annif-tutorial/blob/master/exercises/OPT_dvc.md).
corpora/.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ /yso-skos.ttl
corpora/fulltext-test/en/.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ /abo-theses
2
+ /jyu-theses
3
+ /kirjaesittelyt2021
4
+ /vapaakappaleet-orig
corpora/fulltext-test/en/abo-theses.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 3eb7c18baf9f361da8535b971115dc94
2
+ deps:
3
+ - md5: 7743e923695da632586b6426318d6ac6.dir
4
+ size: 181887549
5
+ nfiles: 306
6
+ path: /data/Annif-corpora/fulltext/abo-theses/eng/test/
7
+ hash: md5
8
+ outs:
9
+ - md5: 8543bf23d8ae5c415973beeac864929c.dir
10
+ size: 10823806
11
+ nfiles: 227
12
+ path: abo-theses
corpora/fulltext-test/en/jyu-theses.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 8d33b71700855688cf72611c81332782
2
+ deps:
3
+ - md5: 35f96b1279030a69ca1c67c792768faf.dir
4
+ size: 70776290
5
+ nfiles: 603
6
+ path: /data/Annif-corpora/fulltext/jyu-theses/eng-test/
7
+ hash: md5
8
+ outs:
9
+ - md5: 35f96b1279030a69ca1c67c792768faf.dir
10
+ size: 70776290
11
+ nfiles: 603
12
+ path: jyu-theses
13
+ hash: md5
corpora/fulltext-test/en/kirjaesittelyt2021.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: c689b1b0c1c84f3fcf11afcb41675684
2
+ deps:
3
+ - md5: 7d714ce2b7222e511507946d7eed43ca.dir
4
+ size: 603188
5
+ nfiles: 894
6
+ path: /data/Annif-corpora-restricted/kirjaesittelyt2021/yso/eng/test/
7
+ hash: md5
8
+ outs:
9
+ - md5: 7d714ce2b7222e511507946d7eed43ca.dir
10
+ size: 603188
11
+ nfiles: 894
12
+ path: kirjaesittelyt2021
13
+ hash: md5
corpora/fulltext-test/en/vapaakappaleet-orig.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 2259d4d7676ffa10002217ce68653b73
2
+ deps:
3
+ - md5: d2fe684741eb7b4ef338da37a5c24fbf.dir
4
+ size: 250449809
5
+ nfiles: 2102
6
+ path: /data/Annif-corpora-local/vapaakappaleet-orig/en/test/
7
+ hash: md5
8
+ outs:
9
+ - md5: 015ae9b1152b530ea0b90e430f97ac97.dir
10
+ size: 250449809
11
+ nfiles: 2102
12
+ path: vapaakappaleet-orig
corpora/fulltext-test/fi/.gitignore ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ /kirjastonhoitaja
2
+ /jyu-theses
3
+ /satakunnan-kansa-1
4
+ /satakunnan-kansa-2
5
+ /satakunnan-kansa-3
6
+ /satakunnan-kansa-4
7
+ /kirjaesittelyt2021
8
+ /vapaakappaleet-orig
corpora/fulltext-test/fi/jyu-theses.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 788eb07f26d934aaf602601da4c5ce7d
2
+ deps:
3
+ - md5: 84ad598c69b51ccf24fba7cf5770a8e9.dir
4
+ size: 156492348
5
+ nfiles: 1532
6
+ path: /data/Annif-corpora/fulltext/jyu-theses/fin-test/
7
+ hash: md5
8
+ outs:
9
+ - md5: 84ad598c69b51ccf24fba7cf5770a8e9.dir
10
+ size: 156492348
11
+ nfiles: 1532
12
+ path: jyu-theses
13
+ hash: md5
corpora/fulltext-test/fi/kirjaesittelyt2021.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 0686f9e231846e10e78c35062a1439f7
2
+ deps:
3
+ - md5: 669296255c4cc5e064bee77d5d0e9f90.dir
4
+ size: 1915579
5
+ nfiles: 2622
6
+ path: /data/Annif-corpora-restricted/kirjaesittelyt2021/yso/fin/test/
7
+ hash: md5
8
+ outs:
9
+ - md5: 669296255c4cc5e064bee77d5d0e9f90.dir
10
+ size: 1915579
11
+ nfiles: 2622
12
+ path: kirjaesittelyt2021
13
+ hash: md5
corpora/fulltext-test/fi/kirjastonhoitaja.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 63d69ce511c1262977bf8db01fe2f752
2
+ deps:
3
+ - md5: 90d818aa356f6b5d8e3992817702a0df.dir
4
+ size: 466317
5
+ nfiles: 624
6
+ path: /data/Annif-corpora/fulltext/kirjastonhoitaja/test/
7
+ hash: md5
8
+ outs:
9
+ - md5: 90d818aa356f6b5d8e3992817702a0df.dir
10
+ size: 466317
11
+ nfiles: 624
12
+ path: kirjastonhoitaja
13
+ hash: md5
corpora/fulltext-test/fi/satakunnan-kansa-1.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: ff82a7714027f4a3fd2c5b98b82ac0d8
2
+ deps:
3
+ - md5: 70c57356422bf59c87ad4545c6975393.dir
4
+ size: 122589
5
+ nfiles: 100
6
+ path: /data/Annif-corpora-restricted/satakunnan-kansa/test1/
7
+ hash: md5
8
+ outs:
9
+ - md5: 70c57356422bf59c87ad4545c6975393.dir
10
+ size: 122589
11
+ nfiles: 100
12
+ path: satakunnan-kansa-1
corpora/fulltext-test/fi/satakunnan-kansa-2.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 94f4c590e519ceb83745134706d4941d
2
+ deps:
3
+ - md5: 438f14417893527d7d4ab74cbc2b63a0.dir
4
+ size: 119038
5
+ nfiles: 100
6
+ path: /data/Annif-corpora-restricted/satakunnan-kansa/test2/
7
+ hash: md5
8
+ outs:
9
+ - md5: 438f14417893527d7d4ab74cbc2b63a0.dir
10
+ size: 119038
11
+ nfiles: 100
12
+ path: satakunnan-kansa-2
corpora/fulltext-test/fi/satakunnan-kansa-3.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 4d840ec899949b09e42ec29f7c0ec592
2
+ deps:
3
+ - md5: e8011b405e848a6c7ecb1d54b5f63456.dir
4
+ size: 122239
5
+ nfiles: 100
6
+ path: /data/Annif-corpora-restricted/satakunnan-kansa/test3/
7
+ hash: md5
8
+ outs:
9
+ - md5: e8011b405e848a6c7ecb1d54b5f63456.dir
10
+ size: 122239
11
+ nfiles: 100
12
+ path: satakunnan-kansa-3
corpora/fulltext-test/fi/satakunnan-kansa-4.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: d88ed0e407d028bfffd012a84873fe44
2
+ deps:
3
+ - md5: 43d672dfadc16ffffd6537f2ffbbd82b.dir
4
+ size: 122546
5
+ nfiles: 100
6
+ path: /data/Annif-corpora-restricted/satakunnan-kansa/test4/
7
+ hash: md5
8
+ outs:
9
+ - md5: 43d672dfadc16ffffd6537f2ffbbd82b.dir
10
+ size: 122546
11
+ nfiles: 100
12
+ path: satakunnan-kansa-4
corpora/fulltext-test/fi/vapaakappaleet-orig.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: e3eacb5db12f34d67376e6c6efc022e0
2
+ deps:
3
+ - md5: 4d6ea13ca02024a016142c25567ded7d.dir
4
+ size: 226413101
5
+ nfiles: 2080
6
+ path: /data/Annif-corpora-local/vapaakappaleet-orig/fi/test/
7
+ hash: md5
8
+ outs:
9
+ - md5: ea34a6a16b98b497edb2ea21cf41bb96.dir
10
+ size: 226413101
11
+ nfiles: 2080
12
+ path: vapaakappaleet-orig
corpora/fulltext-test/sv/.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ /abo-theses
2
+ /jyu-theses
3
+ /kirjaesittelyt2021
4
+ /vapaakappaleet-orig
corpora/fulltext-test/sv/abo-theses.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: f3c85e3ead529de7d073ece9515cd097
2
+ deps:
3
+ - md5: 49e25736b4df7cdddd4eddc4018ba461.dir
4
+ size: 135949330
5
+ nfiles: 452
6
+ path: /data/Annif-corpora/fulltext/abo-theses/swe/test/
7
+ hash: md5
8
+ outs:
9
+ - md5: 92766d2134bd95255b00015baa4e5537.dir
10
+ size: 17219952
11
+ nfiles: 337
12
+ path: abo-theses
13
+ hash: md5
corpora/fulltext-test/sv/jyu-theses.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: cc9aa22b429e0c9f7e45ab53e0b75379
2
+ deps:
3
+ - md5: dc55be91f5b3157e4bc2e4b010bc6847.dir
4
+ size: 5734687
5
+ nfiles: 64
6
+ path: /data/Annif-corpora/fulltext/jyu-theses/swe-test/
7
+ hash: md5
8
+ outs:
9
+ - md5: dc55be91f5b3157e4bc2e4b010bc6847.dir
10
+ size: 5734687
11
+ nfiles: 64
12
+ path: jyu-theses
13
+ hash: md5
corpora/fulltext-test/sv/kirjaesittelyt2021.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 5c70d93cc7f357ab78ea013751785a08
2
+ deps:
3
+ - md5: 03528a8434a6e4ba9e35468ce1d8bc7c.dir
4
+ size: 677165
5
+ nfiles: 1120
6
+ path: /data/Annif-corpora-restricted/kirjaesittelyt2021/yso/swe/test
7
+ hash: md5
8
+ outs:
9
+ - md5: 03528a8434a6e4ba9e35468ce1d8bc7c.dir
10
+ size: 677165
11
+ nfiles: 1120
12
+ path: kirjaesittelyt2021
13
+ hash: md5
corpora/fulltext-test/sv/vapaakappaleet-orig.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 1983227e4acbc6ea45546111ddcd20a8
2
+ deps:
3
+ - md5: 19cc037d5da0a35ca7df8d587789aed9.dir
4
+ size: 43104636
5
+ nfiles: 352
6
+ path: /data/Annif-corpora-local/vapaakappaleet-orig/sv/test/
7
+ hash: md5
8
+ outs:
9
+ - md5: 9a0bbb9a19c71df69ec7623ccbff005e.dir
10
+ size: 43104636
11
+ nfiles: 352
12
+ path: vapaakappaleet-orig
corpora/fulltext-train/en/.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ /vapaakappaleet-orig
2
+ /kirjaesittelyt2021
3
+ /abo-theses
4
+ /jyu-theses
corpora/fulltext-train/en/abo-theses.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: ef3e52d478dfc68126670c72a123f172
2
+ deps:
3
+ - md5: 3fad4fdc3b9b767e91e6d5f8c78f66f9.dir
4
+ size: 844863786
5
+ nfiles: 1343
6
+ path: /data/Annif-corpora-local/fulltext-train/en/abo-theses
7
+ hash: md5
8
+ outs:
9
+ - md5: bba9a6f747e56765005c9a33c32f078a.dir
10
+ size: 47738259
11
+ nfiles: 997
12
+ path: abo-theses
corpora/fulltext-train/en/jyu-theses.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: ad203010df2d637157aac2f7596d576f
2
+ deps:
3
+ - md5: 6520f92450253e62d8165bd62e4ad0a2.dir
4
+ size: 119290546
5
+ nfiles: 998
6
+ path: /data/Annif-corpora-local/fulltext-train/en/jyu-theses
7
+ hash: md5
8
+ outs:
9
+ - md5: 6520f92450253e62d8165bd62e4ad0a2.dir
10
+ size: 119290546
11
+ nfiles: 998
12
+ path: jyu-theses
13
+ hash: md5
corpora/fulltext-train/en/kirjaesittelyt2021.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: c34a43c0b010c387892ef193b06865e0
2
+ deps:
3
+ - md5: d5cb69f6a8969fcdb45661c4cd07fd38.dir
4
+ size: 1232470
5
+ nfiles: 1652
6
+ path: /data/Annif-corpora-local/fulltext-train/en/kirjaesittelyt2021
7
+ hash: md5
8
+ outs:
9
+ - md5: d5cb69f6a8969fcdb45661c4cd07fd38.dir
10
+ size: 1232470
11
+ nfiles: 1652
12
+ path: kirjaesittelyt2021
13
+ hash: md5
corpora/fulltext-train/en/vapaakappaleet-orig.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: d349357d0fe88fcd67853ddf180ac966
2
+ deps:
3
+ - md5: fa2f6d99b3b80057c3277a459b71878b.dir
4
+ size: 100089232
5
+ nfiles: 932
6
+ path: /data/Annif-corpora-local/fulltext-train/en/vapaakappaleet-orig
7
+ hash: md5
8
+ outs:
9
+ - md5: 4e970e8206aa0b8db67cf8e29bac8c4b.dir
10
+ size: 100089232
11
+ nfiles: 932
12
+ path: vapaakappaleet-orig
corpora/fulltext-train/fi/.gitignore ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ /vapaakappaleet-orig
2
+ /kirjaesittelyt2021
3
+ /kirjastonhoitaja
4
+ /jyu-theses
5
+ /satakunnan-kansa
corpora/fulltext-train/fi/jyu-theses.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 51402df088f59db105b4100fcc6d0a22
2
+ deps:
3
+ - md5: 5915d784b3dee464cf79dfc7ed8972fd.dir
4
+ size: 111873334
5
+ nfiles: 1000
6
+ path: /data/Annif-corpora-local/fulltext-train/fi/jyu-theses
7
+ hash: md5
8
+ outs:
9
+ - md5: 5915d784b3dee464cf79dfc7ed8972fd.dir
10
+ size: 111873334
11
+ nfiles: 1000
12
+ path: jyu-theses
13
+ hash: md5
corpora/fulltext-train/fi/kirjaesittelyt2021.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 2f01d0b71ea32c24598860cafe5510b6
2
+ deps:
3
+ - md5: 3900e0f22556c3c69f84115ee81bf6fd.dir
4
+ size: 693029
5
+ nfiles: 1000
6
+ path: /data/Annif-corpora-local/fulltext-train/fi/kirjaesittelyt2021
7
+ hash: md5
8
+ outs:
9
+ - md5: 3900e0f22556c3c69f84115ee81bf6fd.dir
10
+ size: 693029
11
+ nfiles: 1000
12
+ path: kirjaesittelyt2021
13
+ hash: md5
corpora/fulltext-train/fi/kirjastonhoitaja.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: c53fbdaa4c77397e06e6ec6ced29fda9
2
+ deps:
3
+ - md5: 8979846e372fb084dec5663088f7cdc2.dir
4
+ size: 758017
5
+ nfiles: 1000
6
+ path: /data/Annif-corpora-local/fulltext-train/fi/kirjastonhoitaja
7
+ hash: md5
8
+ outs:
9
+ - md5: 8979846e372fb084dec5663088f7cdc2.dir
10
+ size: 758017
11
+ nfiles: 1000
12
+ path: kirjastonhoitaja
13
+ hash: md5
corpora/fulltext-train/fi/satakunnan-kansa.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 54e7cd77b751dccfb90c850b9fadf144
2
+ deps:
3
+ - md5: 477591160a10d6289a6eb92204e64b48.dir
4
+ size: 128726
5
+ nfiles: 100
6
+ path: /data/Annif-corpora-local/fulltext-train/fi/satakunnan-kansa
7
+ hash: md5
8
+ outs:
9
+ - md5: 477591160a10d6289a6eb92204e64b48.dir
10
+ size: 128726
11
+ nfiles: 100
12
+ path: satakunnan-kansa
corpora/fulltext-train/fi/vapaakappaleet-orig.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: efcd970176b36e6cf1dc11f135bb48e5
2
+ deps:
3
+ - md5: a3fba8d50ca8bab8d53c49dc32c32cf3.dir
4
+ size: 300405374
5
+ nfiles: 2477
6
+ path: /data/Annif-corpora-local/fulltext-train/fi/vapaakappaleet-orig
7
+ hash: md5
8
+ outs:
9
+ - md5: 8f479ecda100df8c244485b1419724dc.dir
10
+ size: 300405374
11
+ nfiles: 2477
12
+ path: vapaakappaleet-orig
corpora/fulltext-train/sv/.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ /vapaakappaleet-orig
2
+ /kirjaesittelyt2021
3
+ /abo-theses
4
+ /jyu-theses
corpora/fulltext-train/sv/abo-theses.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 3ef7f112a8a284fb154ca2e7c6c0994f
2
+ deps:
3
+ - md5: 97cd40ae5e815b3ca56955a87628f048.dir
4
+ size: 815900768
5
+ nfiles: 2104
6
+ path: /data/Annif-corpora-local/fulltext-train/sv/abo-theses
7
+ hash: md5
8
+ outs:
9
+ - md5: 2566088025265f800f55e1f04db01625.dir
10
+ size: 81123768
11
+ nfiles: 1565
12
+ path: abo-theses
13
+ hash: md5
corpora/fulltext-train/sv/jyu-theses.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 5ddae68412821047810dfcda806dd38a
2
+ deps:
3
+ - md5: c1af325f14aabba73fb65baf77d85543.dir
4
+ size: 19596525
5
+ nfiles: 206
6
+ path: /data/Annif-corpora-local/fulltext-train/sv/jyu-theses
7
+ hash: md5
8
+ outs:
9
+ - md5: c1af325f14aabba73fb65baf77d85543.dir
10
+ size: 19596525
11
+ nfiles: 206
12
+ path: jyu-theses
corpora/fulltext-train/sv/kirjaesittelyt2021.dvc ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 847ede919aacc94af9fef348e8021721
2
+ deps:
3
+ - md5: 7bde899cf4bbbf7c8baa0014a3d8634e.dir
4
+ size: 918716
5
+ nfiles: 1488
6
+ path: /data/Annif-corpora-local/fulltext-train/sv/kirjaesittelyt2021
7
+ hash: md5
8
+ outs:
9
+ - md5: 7bde899cf4bbbf7c8baa0014a3d8634e.dir
10
+ size: 918716
11
+ nfiles: 1488
12
+ path: kirjaesittelyt2021
13
+ hash: md5
corpora/fulltext-train/sv/vapaakappaleet-orig.dvc ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 87bd7c808ae8e6c24981c9505c1bf175
2
+ deps:
3
+ - md5: 7da4409207e17533342df239c1fa7986.dir
4
+ size: 53458333
5
+ nfiles: 490
6
+ path: /data/Annif-corpora-local/fulltext-train/sv/vapaakappaleet-orig
7
+ hash: md5
8
+ outs:
9
+ - md5: cc91ab54440572bffee62cfa75ecaee2.dir
10
+ size: 53458333
11
+ nfiles: 490
12
+ path: vapaakappaleet-orig
corpora/shorttext-train/en/.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ /yso-finna-en.tsv.gz
corpora/shorttext-train/en/yso-finna-en.tsv.gz.dvc ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 71c24280ff9d20993ef67d4ad4d0de61
2
+ deps:
3
+ - md5: c45518688682af2848246b8d6db617a5
4
+ size: 98828900
5
+ path: /data/Annif-corpora/training/yso-finna-en.tsv.gz
6
+ hash: md5
7
+ outs:
8
+ - md5: c45518688682af2848246b8d6db617a5
9
+ size: 98828900
10
+ path: yso-finna-en.tsv.gz
11
+ hash: md5
corpora/shorttext-train/fi/.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ /yso-finna-fi-01.tsv.gz
2
+ /yso-finna-fi-02.tsv.gz
3
+ /yso-finna-fi-03.tsv.gz
4
+ /yso-finna-fi-04.tsv.gz
corpora/shorttext-train/fi/yso-finna-fi-01.tsv.gz.dvc ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: 564736494bd3e9f6dcbad43db05e4fcd
2
+ deps:
3
+ - md5: 3d5ecada7cbe4ad320c0e5f8a54846f7
4
+ size: 99401083
5
+ path: /data/Annif-corpora/training/yso-finna-fi-01.tsv.gz
6
+ hash: md5
7
+ outs:
8
+ - md5: 3d5ecada7cbe4ad320c0e5f8a54846f7
9
+ size: 99401083
10
+ path: yso-finna-fi-01.tsv.gz
11
+ hash: md5
corpora/shorttext-train/fi/yso-finna-fi-02.tsv.gz.dvc ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: f6efdf2e9a81c0369226f14d007a7af8
2
+ deps:
3
+ - md5: 3dac9f314f705870ed66e9243a07ab70
4
+ size: 99477086
5
+ path: /data/Annif-corpora/training/yso-finna-fi-02.tsv.gz
6
+ hash: md5
7
+ outs:
8
+ - md5: 3dac9f314f705870ed66e9243a07ab70
9
+ size: 99477086
10
+ path: yso-finna-fi-02.tsv.gz
11
+ hash: md5
corpora/shorttext-train/fi/yso-finna-fi-03.tsv.gz.dvc ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: e5b52fed49c0c9f64110aa4254e17885
2
+ deps:
3
+ - md5: be00ff7dc5fd91d7e13b366e0089089c
4
+ size: 99411108
5
+ path: /data/Annif-corpora/training/yso-finna-fi-03.tsv.gz
6
+ hash: md5
7
+ outs:
8
+ - md5: be00ff7dc5fd91d7e13b366e0089089c
9
+ size: 99411108
10
+ path: yso-finna-fi-03.tsv.gz
11
+ hash: md5
corpora/shorttext-train/fi/yso-finna-fi-04.tsv.gz.dvc ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ md5: a7af220a29741c0b213676bd77c221ee
2
+ deps:
3
+ - md5: a8f16f5d33de8c534010e7d2191a5e71
4
+ size: 60270897
5
+ path: /data/Annif-corpora/training/yso-finna-fi-04.tsv.gz
6
+ hash: md5
7
+ outs:
8
+ - md5: a8f16f5d33de8c534010e7d2191a5e71
9
+ size: 60270897
10
+ path: yso-finna-fi-04.tsv.gz
11
+ hash: md5
corpora/shorttext-train/sv/.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ /yso-finna-sv.tsv.gz