Spaces:

UniversalCEFR
/

README

Configuration error

App Files Files Community

README / README.md

josephimperial

Update README.md

900b9d2 verified 2 months ago

preview code

raw

history blame contribute delete

7.15 kB

	# The UniversalCEFR Data Directory

	UniversalCEFR is a largescale, multilingual, multidimensional dataset comprising of texts annotated according to the [CEFR (Common European Framework of Reference)](https://www.coe.int/en/web/common-european-framework-reference-languages/level-descriptions). The collection comprises of a total of 505,807 CEFR-labeled texts in 13 languages as listed below:

	English (en), Spanish (es), German (de), Dutch (nl), Czech (cs), Italian (it), French (fr), Estonian (et), Portuguese (pt), Arabic (ar), Hindi (hi), Russian (ru), Welsh (cy)

	## UniversalCEFR Data Format / Schema
	To ensure interoperability, transformation, and machine readability, adopted standardised JSON format for each CEFR-labeled text. These fields include the source dataset, language, granularity (document, paragraph, sentence, discourse), production category (learner or reference), and license.

	\| Field \| Description \|
	\|-------------------\|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|
	\| `title` \| The unique title of the text retrieved from its original corpus (`NA` if there are no titles such as CEFR-assessed sentences or paragraphs). \|
	\| `lang` \| The source language of the text in ISO 638-1 format (e.g., `en` for English). \|
	\| `source_name` \| The source dataset name where the text is collected as indicated from their source dataset, paper, and/or documentation (e.g., `cambridge-exams` from Xia et al., 2016). \|
	\| `format` \| The format of the text in terms of level of granularity as indicated from their source dataset, paper, and/or documentation. The recognized formats are the following: [`document-level`, `paragraph-level`, `discourse-level`, `sentence-level`]. \|
	\| `category` \| The classification of the text in terms of who created the material. The recognized categories are `reference` for texts created by experts, teachers, and language learning professionals and `learner` for texts written by language learners and students. \|
	\| `cefr_level` \| The CEFR level associated with the text. The six recognized CEFR levels are the following: [`A1`, `A2`, `B1`, `B2`, `C1`, `C2`]. A small fraction (<1%) of text in UniversalCEFR contains unlabelled text, texts with plus signs (e.g., `A1+`), and texts with no level indicator (e.g., `A`, `B`). \|
	\| `license` \| The licensing information associated with the text (e.g., `CC-BY-NC-SA` or `Unknown` if not stated). \|
	\| `text` \| The actual content of the text itself.

	## Accessing UniversalCEFR

	If you're interested in a specific individual or group of datasets from UniversalCEFR, you may access their transformed, standardised version here: https://huggingface.co/UniversalCEFR

	A separate Github Organization is also available containing the code from the UniversalCEFR paper: https://github.com/UniversalCEFR

	If you use any of the datasets indexed in UniversalCEFR, please cite the original dataset papers they are associated with. You can find them when you open each dataset in this organization.

	Note that there are a few datasets in UniversalCEFR---`EFCAMDAT`, `APA-LHA`, `BEA Shared Task 2019 Write and Improve`, and `DEPlain`---that are not directly available from the UniversalCEFR Huggingface Org as they require users to agree with their Terms of Use before using them for non-commercial research. Once you've done this, you can use the preprocessing Python scripts in [`universalcefr-experiments`](https://github.com/UniversalCEFR/universalcefr-experiments) repository to transform the raw version to UniversalCEFR version.


	## Do you want to get updates? / Do you have datasets we can add to UniversalCEFR?
	We want to grow this community of researchers, language experts, and educators to further advance openly accessible CEFR/language proficiency assessment datasets for all.

	If you're interested in this direction and/or have open dataset/s you want to add to UniversalCEFR for better exposure and utility to researchers, please fill up this [form](https://forms.office.com/e/hjd7ew0M8C).

	When we index your dataset to UniversalCEFR, we will cite you and the paper/project from which the dataset came across the UniversalCEFR platforms.

	## Contact
	For questions, concerns, clarifications, and issues, please contact [Joseph Marvin Imperial](https://www.josephimperial.com/) (jmri20@bath.ac.uk).

	## Reference
	When using datasets from this resource, you should cite the original dataset papers (reference included on the data card) on top of the UniversalCEFR paper.

	```
	@inproceedings{imperial-etal-2025-universalcefr,
	title = "{U}niversal{CEFR}: Enabling Open Multilingual Research on Language Proficiency Assessment",
	author = "Imperial, Joseph Marvin and
	Barayan, Abdullah and
	Stodden, Regina and
	Wilkens, Rodrigo and
	Mu{\~n}oz S{\'a}nchez, Ricardo and
	Gao, Lingyun and
	Torgbi, Melissa and
	Knight, Dawn and
	Forey, Gail and
	Jablonkai, Reka R. and
	Kochmar, Ekaterina and
	Reynolds, Robert Joshua and
	Ribeiro, Eug{\'e}nio and
	Saggion, Horacio and
	Volodina, Elena and
	Vajjala, Sowmya and
	Fran{\c{c}}ois, Thomas and
	Alva-Manchego, Fernando and
	Tayyar Madabushi, Harish",
	editor = "Christodoulopoulos, Christos and
	Chakraborty, Tanmoy and
	Rose, Carolyn and
	Peng, Violet",
	booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
	month = nov,
	year = "2025",
	address = "Suzhou, China",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2025.emnlp-main.491/",
	doi = "10.18653/v1/2025.emnlp-main.491",
	pages = "9714--9766",
	ISBN = "979-8-89176-332-6"}
	```