ICCIES-2025-DetectAI
/

multilingual_e5_roberta

Text Classification

Model card Files Files and versions

multilingual_e5_roberta / README.md

AnhNguyen2299's picture

Update README.md

e6341ca verified 5 months ago

|

history blame contribute delete

2.01 kB

	---
	license: cc-by-4.0
	language:
	- vi
	metrics:
	- accuracy
	pipeline_tag: text-classification
	datasets:
	- ICCIES-2025-DetectAI/vietnamese_news_human_ai
	---

	# Multilingual-E5 with RoBERTa-base for AI-Generated Vietnamese News Detection

	## Overview

	This repository hosts the implementation of a hybrid model that combines Multilingual-E5 embeddings with a RoBERTa-base classification head to distinguish between human-authored and AI-generated Vietnamese news articles.

	Developed as part of the research published in Computational Intelligence in Engineering Science (Springer CCIS, vol. 2587), the model achieves a classification accuracy of over 99%, offering a reliable tool for combating misinformation and enhancing journalistic integrity in the Vietnamese context.

	By leveraging the semantic richness of Multilingual-E5 and the optimized pre-training of RoBERTa-base, the model effectively captures subtle linguistic and stylistic differences. Training was performed on a balanced dataset of 200,000 articles:

	- 100,000 human-written texts sourced from reputable outlets (Thanh Niên, VnExpress)
	- 100,000 AI-generated texts produced by advanced large language models (LLMs) such as GPT-4o Mini, Gemini Flash 1.5, Llama 3.3, and DeepSeek

	---

	## Citation

	If you use this model or dataset, please cite the following paper:

	```bibtex
	@InProceedings{10.1007/978-3-031-98170-8_11,
	author = {Huynh, Minh-Phuc and Nguyen, Hoang-Anh and Le, Anh-Cuong and Truong, Dinh-Tu},
	title = {Detecting AI-Generated Vietnamese News Articles with Multilingual-E5 and BERT},
	booktitle = {Computational Intelligence in Engineering Science},
	year = {2026},
	publisher = {Springer Nature Switzerland},
	address = {Cham},
	pages = {130--144},
	isbn = {978-3-031-98170-8}
	}
	```


	## Contact

	For questions or clarifications regarding the dataset or evaluation procedure, please contact Lê Anh Cường at leanhcuong@tdtu.edu.vn