Text Classification
Vietnamese
File size: 2,013 Bytes
311abf0
 
 
 
76e56e6
 
 
be0bc47
 
3cce935
 
be0bc47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6341ca
 
be0bc47
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
license: cc-by-4.0
language:
- vi
metrics:
- accuracy
pipeline_tag: text-classification
datasets:
- ICCIES-2025-DetectAI/vietnamese_news_human_ai
---

# Multilingual-E5 with RoBERTa-base for AI-Generated Vietnamese News Detection  

## Overview  

This repository hosts the implementation of a **hybrid model** that combines **Multilingual-E5 embeddings** with a **RoBERTa-base classification head** to distinguish between **human-authored** and **AI-generated** Vietnamese news articles.  

Developed as part of the research published in *Computational Intelligence in Engineering Science (Springer CCIS, vol. 2587)*, the model achieves a classification accuracy of **over 99%**, offering a reliable tool for combating misinformation and enhancing journalistic integrity in the Vietnamese context.  

By leveraging the **semantic richness** of Multilingual-E5 and the **optimized pre-training** of RoBERTa-base, the model effectively captures subtle linguistic and stylistic differences. Training was performed on a **balanced dataset of 200,000 articles**:  

- 100,000 human-written texts sourced from reputable outlets (*Thanh Niên*, *VnExpress*)  
- 100,000 AI-generated texts produced by advanced large language models (LLMs) such as **GPT-4o Mini, Gemini Flash 1.5, Llama 3.3, and DeepSeek**  

---

## Citation  

If you use this model or dataset, please cite the following paper:  

```bibtex
@InProceedings{10.1007/978-3-031-98170-8_11,
  author    = {Huynh, Minh-Phuc and Nguyen, Hoang-Anh and Le, Anh-Cuong and Truong, Dinh-Tu},
  title     = {Detecting AI-Generated Vietnamese News Articles with Multilingual-E5 and BERT},
  booktitle = {Computational Intelligence in Engineering Science},
  year      = {2026},
  publisher = {Springer Nature Switzerland},
  address   = {Cham},
  pages     = {130--144},
  isbn      = {978-3-031-98170-8}
}
```


## Contact

For questions or clarifications regarding the dataset or evaluation procedure, please contact Lê Anh Cường at leanhcuong@tdtu.edu.vn