|
|
--- |
|
|
license: cc-by-4.0 |
|
|
language: |
|
|
- vi |
|
|
metrics: |
|
|
- accuracy |
|
|
pipeline_tag: text-classification |
|
|
datasets: |
|
|
- ICCIES-2025-DetectAI/vietnamese_news_human_ai |
|
|
--- |
|
|
|
|
|
# Multilingual-E5 with RoBERTa-base for AI-Generated Vietnamese News Detection |
|
|
|
|
|
## Overview |
|
|
|
|
|
This repository hosts the implementation of a **hybrid model** that combines **Multilingual-E5 embeddings** with a **RoBERTa-base classification head** to distinguish between **human-authored** and **AI-generated** Vietnamese news articles. |
|
|
|
|
|
Developed as part of the research published in *Computational Intelligence in Engineering Science (Springer CCIS, vol. 2587)*, the model achieves a classification accuracy of **over 99%**, offering a reliable tool for combating misinformation and enhancing journalistic integrity in the Vietnamese context. |
|
|
|
|
|
By leveraging the **semantic richness** of Multilingual-E5 and the **optimized pre-training** of RoBERTa-base, the model effectively captures subtle linguistic and stylistic differences. Training was performed on a **balanced dataset of 200,000 articles**: |
|
|
|
|
|
- 100,000 human-written texts sourced from reputable outlets (*Thanh Niên*, *VnExpress*) |
|
|
- 100,000 AI-generated texts produced by advanced large language models (LLMs) such as **GPT-4o Mini, Gemini Flash 1.5, Llama 3.3, and DeepSeek** |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model or dataset, please cite the following paper: |
|
|
|
|
|
```bibtex |
|
|
@InProceedings{10.1007/978-3-031-98170-8_11, |
|
|
author = {Huynh, Minh-Phuc and Nguyen, Hoang-Anh and Le, Anh-Cuong and Truong, Dinh-Tu}, |
|
|
title = {Detecting AI-Generated Vietnamese News Articles with Multilingual-E5 and BERT}, |
|
|
booktitle = {Computational Intelligence in Engineering Science}, |
|
|
year = {2026}, |
|
|
publisher = {Springer Nature Switzerland}, |
|
|
address = {Cham}, |
|
|
pages = {130--144}, |
|
|
isbn = {978-3-031-98170-8} |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or clarifications regarding the dataset or evaluation procedure, please contact Lê Anh Cường at leanhcuong@tdtu.edu.vn |