Spaces:
Configuration error
Configuration error
| # The UniversalCEFR Data Directory | |
| UniversalCEFR is a largescale, multilingual, multidimensional dataset comprising of texts annotated according to the [CEFR (Common European Framework of Reference)](https://www.coe.int/en/web/common-european-framework-reference-languages/level-descriptions). The collection comprises of a total of 505,807 CEFR-labeled texts in 13 languages as listed below: | |
| English (en), Spanish (es), German (de), Dutch (nl), Czech (cs), Italian (it), French (fr), Estonian (et), Portuguese (pt), Arabic (ar), Hindi (hi), Russian (ru), Welsh (cy) | |
| ## UniversalCEFR Data Format / Schema | |
| To ensure interoperability, transformation, and machine readability, adopted **standardised JSON format** for each CEFR-labeled text. These fields include the source dataset, language, granularity (document, paragraph, sentence, discourse), production category (learner or reference), and license. | |
| | **Field** | **Description** | | |
| |-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | |
| | `title` | The unique title of the text retrieved from its original corpus (`NA` if there are no titles such as CEFR-assessed sentences or paragraphs). | | |
| | `lang` | The source language of the text in ISO 638-1 format (e.g., `en` for English). | | |
| | `source_name` | The source dataset name where the text is collected as indicated from their source dataset, paper, and/or documentation (e.g., `cambridge-exams` from Xia et al., 2016). | | |
| | `format` | The format of the text in terms of level of granularity as indicated from their source dataset, paper, and/or documentation. The recognized formats are the following: [`document-level`, `paragraph-level`, `discourse-level`, `sentence-level`]. | | |
| | `category` | The classification of the text in terms of who created the material. The recognized categories are `reference` for texts created by experts, teachers, and language learning professionals and `learner` for texts written by language learners and students. | | |
| | `cefr_level` | The CEFR level associated with the text. The six recognized CEFR levels are the following: [`A1`, `A2`, `B1`, `B2`, `C1`, `C2`]. A small fraction (<1%) of text in UniversalCEFR contains unlabelled text, texts with plus signs (e.g., `A1+`), and texts with no level indicator (e.g., `A`, `B`). | | |
| | `license` | The licensing information associated with the text (e.g., `CC-BY-NC-SA` or `Unknown` if not stated). | | |
| | `text` | The actual content of the text itself. | |
| ## Accessing UniversalCEFR | |
| If you're interested in a specific individual or group of datasets from UniversalCEFR, you may access their transformed, standardised version here: https://huggingface.co/UniversalCEFR | |
| A separate Github Organization is also available containing the code from the UniversalCEFR paper: https://github.com/UniversalCEFR | |
| If you use any of the datasets indexed in UniversalCEFR, **please cite the original dataset papers** they are associated with. You can find them when you open each dataset in this organization. | |
| Note that there are a few datasets in UniversalCEFR---`EFCAMDAT`, `APA-LHA`, `BEA Shared Task 2019 Write and Improve`, and `DEPlain`---that are not directly available from the UniversalCEFR Huggingface Org as they require users to agree with their Terms of Use before using them for non-commercial research. Once you've done this, you can use the preprocessing Python scripts in [`universalcefr-experiments`](https://github.com/UniversalCEFR/universalcefr-experiments) repository to transform the raw version to UniversalCEFR version. | |
| ## Do you want to get updates? / Do you have datasets we can add to UniversalCEFR? | |
| We want to grow this community of researchers, language experts, and educators to further advance openly accessible CEFR/language proficiency assessment datasets for all. | |
| If you're interested in this direction and/or have open dataset/s you want to add to UniversalCEFR for better exposure and utility to researchers, please fill up this **[form](https://forms.office.com/e/hjd7ew0M8C)**. | |
| When we index your dataset to UniversalCEFR, we will cite you and the paper/project from which the dataset came across the UniversalCEFR platforms. | |
| ## Contact | |
| For questions, concerns, clarifications, and issues, please contact [Joseph Marvin Imperial](https://www.josephimperial.com/) (jmri20@bath.ac.uk). | |
| ## Reference | |
| When using datasets from this resource, you should cite the original dataset papers (reference included on the data card) on top of the UniversalCEFR paper. | |
| ``` | |
| @inproceedings{imperial-etal-2025-universalcefr, | |
| title = "{U}niversal{CEFR}: Enabling Open Multilingual Research on Language Proficiency Assessment", | |
| author = "Imperial, Joseph Marvin and | |
| Barayan, Abdullah and | |
| Stodden, Regina and | |
| Wilkens, Rodrigo and | |
| Mu{\~n}oz S{\'a}nchez, Ricardo and | |
| Gao, Lingyun and | |
| Torgbi, Melissa and | |
| Knight, Dawn and | |
| Forey, Gail and | |
| Jablonkai, Reka R. and | |
| Kochmar, Ekaterina and | |
| Reynolds, Robert Joshua and | |
| Ribeiro, Eug{\'e}nio and | |
| Saggion, Horacio and | |
| Volodina, Elena and | |
| Vajjala, Sowmya and | |
| Fran{\c{c}}ois, Thomas and | |
| Alva-Manchego, Fernando and | |
| Tayyar Madabushi, Harish", | |
| editor = "Christodoulopoulos, Christos and | |
| Chakraborty, Tanmoy and | |
| Rose, Carolyn and | |
| Peng, Violet", | |
| booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing", | |
| month = nov, | |
| year = "2025", | |
| address = "Suzhou, China", | |
| publisher = "Association for Computational Linguistics", | |
| url = "https://aclanthology.org/2025.emnlp-main.491/", | |
| doi = "10.18653/v1/2025.emnlp-main.491", | |
| pages = "9714--9766", | |
| ISBN = "979-8-89176-332-6"} | |
| ``` | |