Update README.md
Browse files
README.md
CHANGED
|
@@ -5,5 +5,41 @@ language:
|
|
| 5 |
metrics:
|
| 6 |
- accuracy
|
| 7 |
pipeline_tag: text-classification
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
metrics:
|
| 6 |
- accuracy
|
| 7 |
pipeline_tag: text-classification
|
| 8 |
+
datasets:
|
| 9 |
+
- ICCIES-2025-DetectAI/vietnamese_news_human_ai
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# Multilingual-E5 with RoBERTa-base for AI-Generated Vietnamese News Detection
|
| 13 |
+
|
| 14 |
+
## Overview
|
| 15 |
+
|
| 16 |
+
This repository hosts the implementation of a **hybrid model** that combines **Multilingual-E5 embeddings** with a **RoBERTa-base classification head** to distinguish between **human-authored** and **AI-generated** Vietnamese news articles.
|
| 17 |
+
|
| 18 |
+
Developed as part of the research published in *Computational Intelligence in Engineering Science (Springer CCIS, vol. 2587)*, the model achieves a classification accuracy of **over 99%**, offering a reliable tool for combating misinformation and enhancing journalistic integrity in the Vietnamese context.
|
| 19 |
+
|
| 20 |
+
By leveraging the **semantic richness** of Multilingual-E5 and the **optimized pre-training** of RoBERTa-base, the model effectively captures subtle linguistic and stylistic differences. Training was performed on a **balanced dataset of 200,000 articles**:
|
| 21 |
+
|
| 22 |
+
- 100,000 human-written texts sourced from reputable outlets (*Thanh Niên*, *VnExpress*)
|
| 23 |
+
- 100,000 AI-generated texts produced by advanced large language models (LLMs) such as **GPT-4o Mini, Gemini Flash 1.5, Llama 3.3, and DeepSeek**
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## Citation
|
| 28 |
+
|
| 29 |
+
If you use this model or dataset, please cite the following paper:
|
| 30 |
+
|
| 31 |
+
```bibtex
|
| 32 |
+
@InProceedings{10.1007/978-3-031-98170-8_11,
|
| 33 |
+
author = {Huynh, Minh-Phuc and Nguyen, Hoang-Anh and Le, Anh-Cuong and Truong, Dinh-Tu},
|
| 34 |
+
title = {Detecting AI-Generated Vietnamese News Articles with Multilingual-E5 and BERT},
|
| 35 |
+
booktitle = {Computational Intelligence in Engineering Science},
|
| 36 |
+
year = {2026},
|
| 37 |
+
publisher = {Springer Nature Switzerland},
|
| 38 |
+
address = {Cham},
|
| 39 |
+
pages = {130--144},
|
| 40 |
+
isbn = {978-3-031-98170-8}
|
| 41 |
+
}
|
| 42 |
+
|
| 43 |
+
## Contact
|
| 44 |
+
|
| 45 |
+
For questions or clarifications regarding the dataset or evaluation procedure, please contact Lê Anh Cường at leanhcuong@tdtu.edu.vn
|