Update README.md

d7c1c5b verified over 1 year ago

5.37 kB

	---
	license: mit
	language:
	- de
	metrics:
	- accuracy
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- Science
	---
	## G-SciEdBERT: A Contextualized LLM for Science Assessment Tasks in German
	This model developed a contextualized German Science Education BERT (G-SciEdBERT),
	an innovative large language model tailored for scoring German-written responses to science tasks.
	Using G-BERT, we pre-trained G-SciEdBERT on a corpus of 50K German written science responses with 5M tokens to the Programme for International Student Assessment (PISA) 2015.
	We fine-tuned G-SciEdBERT on 59 assessment items and examined the scoring accuracy. We then compared its performance with G-BERT.
	Our findings reveal a substantial improvement in scoring accuracy with G-SciEdBERT, demonstrating a 10% increase of quadratic weighted kappa compared to G-BERT
	(mean accuracy difference = 0.096, SD = 0.024). These insights underline the significance of specialized language models like G-SciEdBERT,
	which is trained to enhance the accuracy of automated scoring, offering a substantial contribution to the field of AI in education.

	## Dataset
	It is a pre-trained German science education BERT for written German science assessments of the PISA test.
	PISA is an international test to monitor education trends led by OECD (Organisation for Economic Co-operation and Development).
	PISA items are developed to assess scientific literacy, highlighting real-world problem-solving skills and the needs of future workforce.
	This study analyzed data collected for 59 construct response science assessment items in German at the middle school level.
	A total of 6,116 German students from 257 schools participated in PISA 2015.
	Given the geographical diversity of participants, PISA data reflect the general German students' science literacy.

	The PISA items selected require either short (around one sentence) or extended (up to five sentences) responses.
	The minimum score for all items is 0, with the maximum being 3 or 4 for short responses and 4 or 5 for extended responses.
	Student responses have 20 words on average. Our pre-training dataset contains more than 50,000 student-written German responses,
	which means approximately 1,000 human-scored student responses per item for contextual learning through fine-tuning.
	More than 10 human raters scored each response in the training dataset organized by OECD.
	The responses were graded irrespective of the student's ethnicity, race, or gender to ensure fairness.

	## Architecture
	The model is pre-trained on [G-BERT](https://huggingface.co/dbmdz/bert-base-german-uncased?text=Ich+mag+dich.+Ich+liebe+%5BMASK%5D) and the pre-trainig method can be seen as:
	![architecture](https://huggingface.co/ai4stem-uga/G-SciEdBERT/resolve/main/G-SciEdBERT_architecture.png)


	## Evaluation Results
	The table below compares the outcomes between G-BERT and G-SciEdBERT for randomly picked five PISA assessment items and the average accuracy (QWK)
	reported for all datasets combined. It shows that G-SciEdBERT significantly outperformed G-BERT on automatic scoring of student written responses.
	Based on the QWK values, the percentage differences in accuracy vary from 4.2% to 13.6%, with an average increase of 10.0% in average (from .7136 to .8137).
	Especially for item S268Q02, which saw the largest improvement at 13.6% (from .761 to .852), this improvement is noteworthy.
	These findings demonstrate that G-SciEdBERT is more effective than G-BERT at comprehending and assessing complex science-related writings.

	The results of our analysis strongly support the adoption of G-SciEdBERT for the automatic scoring of German-written science responses in large-scale
	assessments such as PISA, given its superior accuracy over the general-purpose G-BERT model.


	\| Item \| Training Samples \| Testing Samples \| Labels \| G-BERT \| G-SciEdBERT \|
	\|---------\|------------------\|-----------------\|--------------\|--------\|-------------\|
	\| S131Q02 \| 487 \| 122 \| 5 \| 0.761 \| 0.852 \|
	\| S131Q04 \| 478 \| 120 \| 5 \| 0.683 \| 0.725 \|
	\| S268Q02 \| 446 \| 112 \| 2 \| 0.757 \| 0.893 \|
	\| S269Q01 \| 508 \| 127 \| 2 \| 0.837 \| 0.953 \|
	\| S269Q03 \| 500 \| 126 \| 4 \| 0.702 \| 0.802 \|
	\| Average \| 665.95 \| 166.49 \| 2-5 (min-max) \| 0.7136 \| 0.8137 \|


	## Usage

	With Transformers >= 2.3 our German BERT models can be loaded like this:

	```python
	from transformers import AutoModel, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("ai4stem-uga/G-SciEdBERT")
	model = AutoModel.from_pretrained("ai4stem-uga/G-SciEdBERT")
	```

	# Acknowledgments
	This project is supported by the Alexender von Humboldt Foundation (PI Xiaoming Zhai, xiaoming.zhai@uga.edu).

	## Citation

	```bibtex
	@InProceedings{Latif_2024_G-SciEdBERT,
	author = {Latif, Ehsan and Lee, Gyeong-Geon and Neuman, Knut and Kastorff, Tamara and Zhai, Xiaoming},
	title = {G-SciEdBERT: A Contextualized LLM for Science Assessment Tasks in German},
	journal = {arXiv preprint arXiv:2402.06584},
	year = {2024}
	pages = {1-9}
	}
	```

	*This model is trained and shared by Ehsan Latif, Ph.D (ehsan.latif@uga.edu)