josephimperial commited on
Commit
70e7a70
·
verified ·
1 Parent(s): a5ba722

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -10
README.md CHANGED
@@ -1,10 +1,31 @@
1
- ---
2
- title: README
3
- emoji: 😻
4
- colorFrom: purple
5
- colorTo: red
6
- sdk: static
7
- pinned: false
8
- ---
9
-
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # The UniversalCEFR Data Directory
2
+
3
+ UniversalCEFR is a largescale, multilingual, multidimensional dataset comprising of texts annotated according to the [CEFR (Common European Framework of Reference)](https://www.coe.int/en/web/common-european-framework-reference-languages/level-descriptions). The collection comprises of a total of **505,807 CEFR-labeled texts** annotated in **13 languages** in 4 script (Latin, Arabic, Devanagari, and Cyrillic).
4
+
5
+ - English (en)
6
+ - Spanish (es)
7
+ - German (de)
8
+ - Dutch (nl)
9
+ - Czech (cs)
10
+ - Italian (it)
11
+ - French (fr)
12
+ - Estonian (et)
13
+ - Portuguese (pt)
14
+ - Arabic (ar)
15
+ - Hindi (hi)
16
+ - Russian (ru)
17
+ - Welsh (cy)
18
+
19
+ ## UniversalCEFR Data Format / Schema
20
+ To ensure interoperability, transformation, and machine readability, adopted **standardised JSON format** for each CEFR-labeled text. These fields include the source dataset, language, granularity (document, paragraph, sentence, discourse), production category (learner or reference), and license.
21
+
22
+ | **Field** | **Description** |
23
+ |-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
24
+ | `title` | The unique title of the text retrieved from its original corpus (`NA` if there are no titles such as CEFR-assessed sentences or paragraphs). |
25
+ | `lang` | The source language of the text in ISO 638-1 format (e.g., `en` for English). |
26
+ | `source_name` | The source dataset name where the text is collected as indicated from their source dataset, paper, and/or documentation (e.g., `cambridge-exams` from Xia et al., 2016). |
27
+ | `format` | The format of the text in terms of level of granularity as indicated from their source dataset, paper, and/or documentation. The recognized formats are the following: [`document-level`, `paragraph-level`, `discourse-level`, `sentence-level`]. |
28
+ | `category` | The classification of the text in terms of who created the material. The recognized categories are `reference` for texts created by experts, teachers, and language learning professionals and `learner` for texts written by language learners and students. |
29
+ | `cefr_level` | The CEFR level associated with the text. The six recognized CEFR levels are the following: [`A1`, `A2`, `B1`, `B2`, `C1`, `C2`]. A small fraction (<1%) of text in UniversalCEFR contains unlabelled text, texts with plus signs (e.g., `A1+`), and texts with no level indicator (e.g., `A`, `B`). |
30
+ | `license` | The licensing information associated with the text (e.g., `CC-BY-NC-SA` or `Unknown` if not stated). |
31
+ | `text` | The actual content of the text itself.