# Vietnamese News Classification based on BoW with Keywords Extraction and Neural Network [cite_start]**Toan Pham Van** [cite: 4] [cite_start]*Framgia Inc. R&D Group* [cite: 5] [cite_start]*13F Keangnam Landmark 72 Tower* [cite: 6] [cite_start]*Plot E6, Pham Hung, Nam Tu Liem, Ha Noi* [cite: 7] [cite_start]*pham.van.toan@framgia.com* [cite: 7] [cite_start]**Ta Minh Thanh** [cite: 27] [cite_start]*Dept. of Network Technology* [cite: 28] *Le Quy Don Technical University* [cite_start]*236 Hoang Quoc Viet, Cau Giay, Ha Noi* [cite: 29] [cite_start]*thanhtm@mta.edu.vn* [cite: 29] --- ## Abstract [cite_start]Text classification (TC) is a primary application of Natural Language Processing (NLP). [cite: 8] [cite_start]While many research efforts exist for classifying text documents using methods like Random Forest, Support Vector Machines, and Naive Bayes, most are applied to English. [cite: 9, 10] [cite_start]Research on Vietnamese text classification remains limited. [cite: 10] [cite_start]This paper proposes methods to address Vietnamese news classification problems using a Vietnamese news corpus. [cite: 11] [cite_start]By employing Bag of Words (BoW) with keyword extraction and Neural Network approaches, a machine learning model was trained that achieved an average accuracy of approximately 99.75%. [cite: 12] [cite_start]The study also analyzes the merits and demerits of each method to identify the best one for this task. [cite: 13] [cite_start]**Keywords:** Vietnamese Keywords Extraction, Vietnamese News Categorization, Text Classification, Neural Network, SVM, Random Forest, Natural Language Processing. [cite: 14] --- ## I. Introduction [cite_start]Text classification is a machine learning problem that involves labeling a text document with categories from a predefined set. [cite: 17] [cite_start]The goal is to build a system that can automatically label incoming news stories with a topic from a set of categories $C = (c_1, .., c_m)$. [cite: 21] [cite_start]With advancements in hardware, TC has become a crucial subfield of NLP. [cite: 21] [cite_start]This paper applies popular multilabel classification algorithms like Naive Bayes, Random Forest, and multiclass SVM to Vietnamese text and compares their accuracy with a custom Neural Network. [cite: 23] [cite_start]A key challenge in processing Vietnamese compared to English is word boundary identification, as Vietnamese word boundaries are not always space characters. [cite: 29, 30] [cite_start]The process of recognizing linguistic units is called word segmentation, which is a critical step in text preprocessing. [cite: 33, 52] [cite_start]Inaccurate word segmentation leads to low accuracy in keyword extraction and, consequently, wrong classification. [cite: 56] [cite_start]After keyword extraction, a dictionary is created and used to train the classification model. [cite: 57, 58] ## II. Related Works ### A. Text Classification [cite_start]TC assigns documents to one or more predefined categories. [cite: 64] [cite_start]Modern TC methods use a predefined corpus for training. [cite: 68] [cite_start]Features are extracted for each text category, and a classifier estimates similarities between texts to guess the category. [cite: 69, 70] [cite_start]State-of-the-art methods for English processing include Naive Bayes (NB), Support Vector Machine (SVM), and Convolutional Neural Network (CNN). [cite: 72] ### B. Vietnamese Corpus [cite_start]While standard corpora like Reuters and 20 Newsgroups are available for English, Vietnamese datasets are often restricted and small. [cite: 134, 135] [cite_start]This research uses a comprehensive Vietnamese corpus created by Vu Cong Duy and colleagues, which was constructed from four well-known Vietnamese online newspapers. [cite: 138, 140] [cite_start]The dataset contains a training set of 33,759 documents and a testing set of 50,373 documents across 10 main topics. [cite: 78, 79] ### C. Keyword Extraction [cite_start]Keyword extraction is a vital technique for text classification. [cite: 149] [cite_start]It involves finding unique, non-stop-word words and ordering them by frequency. [cite: 152, 153, 154] [cite_start]This paper uses the top ten keywords to calculate a Keyword Score to build a dictionary of keywords from the corpus. [cite: 154, 158] ### D. Feature Selection 1. [cite_start]**Bag of Words (BoW) approach**: This is a common method for representing text documents, where a document is described as a set of words with their associated frequencies, independent of the word sequence. [cite: 160, 162, 163, 164] 2. [cite_start]**Word Segmentation**: A robust word segmentation method is crucial for document classification in Vietnamese. [cite: 166] [cite_start]The study uses vnTokenizer for this purpose. [cite: 167] 3. [cite_start]**Stop-words Removal**: Common words that are not specific to different classes (e.g., "và", "bị") are removed. [cite: 169, 170, 171] [cite_start]A manually collected list of about 2000 stop-words was used. [cite: 172] ## III. Text Classification Methods [cite_start]After preprocessing the text and extracting numeric features from the BoW, supervised learning algorithms are applied. [cite: 174, 175] ### A. Random Forest [cite_start]Random Forest (RF) is a classifier that consists of a collection of tree-structured classifiers. [cite: 179, 180] [cite_start]It uses averaging to improve prediction accuracy and control over-fitting. [cite: 182] [cite_start]For classification problems, each tree casts a vote for the most popular class, and the final prediction is the average of the predictions from all trees. [cite: 181, 190] ### B. SVM [cite_start]Support Vector Machines (SVMs) work by determining the optimal hyperplane that best separates different classes. [cite: 197] [cite_start]For multiclass problems, the classifier maps a feature vector to a label by finding the class that has the highest similarity score. [cite: 211, 214] ### C. Neural Network (NN) [cite_start]The proposed Neural Network architecture consists of a neuron receiving a set of inputs (the BoW feature vector) and using a set of weights to compute an output. [cite: 218, 220, 221] [cite_start]This study employs a multi-layered feed-forward neural network with 6 hidden layers using the `tanh` activation function and optimized with stochastic gradient descent. [cite: 231, 242] [cite_start]The input layer corresponds to the BoW feature vector, and the output layer represents the document's label vector. [cite: 243] ## IV. Result [cite_start]The classification models were evaluated using precision, recall, and F1-score. [cite: 252] [cite_start]The proposed keyword extraction with BoW method (KEBOW) was compared against the N-gram method and other machine learning algorithms like SVM and Random Forest. [cite: 261, 262] [cite_start]The results showed that the KEBOW feature selection method was more effective than other methods on the same dataset. [cite: 274] The Neural Network's performance was compared with other algorithms, as shown in the table below. **TABLE I: Accuracy Comparison Result** [cite: 285] | | SVM | Random Forest | SVC | Neural Network | | :--- | :---: | :---: | :---: | :---: | | **10 Topics Dataset** | 0.9652 | 0.9921 | 0.9922 | **0.9975** | | **27 Topics Dataset** | 0.9780 | 0.9925 | 0.9965 | **0.9969** | ## V. Conclusion and Future Works [cite_start]The research proposed a new neural network architecture that achieved an average accuracy of 99.75% for Vietnamese text classification, outperforming methods like SVM and Random Forest on the same dataset. [cite: 281, 282] [cite_start]This result confirms the effectiveness of the proposed feature selection method combining keyword extraction and BoW. [cite: 284, 297] Identified limitations include: * [cite_start]The stop-words list was built subjectively. [cite: 299] * [cite_start]The corpus has ambiguities between topics. [cite: 299] * [cite_start]Word segmentation is limited by a third-party library. [cite: 301] [cite_start]Future work will focus on improving the Neural Network's accuracy, addressing preprocessing disadvantages, and incorporating more semantic and contextual features. [cite: 302] ### Application of Research [cite_start]The results of this research were applied in Viblo, a technical knowledge-sharing service, to automatically classify posts upon publication. [cite: 304] --- ## References [1] B. Alexander, S. Thorsen, "A sentiment-based chat hot." (2013)[cite_start]. [cite: 312] [2] Mooney. J. Raymond, Roy. [cite_start]Loriene, "Content-based book recommending using learning for text categorization," Proc. of the 5th ACM conference on Digital libraries, ACM, 2000. [cite: 313, 314] [3] D. Dinh, V. Thuy. "A maximum entropy approach for Vietnamese word segmentation." Research, Innovation and Vision for the Future. [cite_start]International Conference on IEEE, 2006. [cite: 315, 316] [4] D. Dien, H. Kiem, N.V.Toan, "Vietnamese Word Segmentation" Proc. of the 6th Natural Language Processing Pacific Rim Symposium, Tokyo. [cite_start]Japan, pp.749-756, 2001. [cite: 317, 318] [5] Y. Yang and X. Liu. A re-examination of text categorization methods. In 22nd Annual International SIGIR, pp. 42-49, Berkley. [cite_start]August 1999. [cite: 319, 320] [6] F. Sebastiani. Machine learning in automated text categorisation: a survey. [cite_start]Technical Report IEI-B4-31-1999, Istituto di Elaborazione dell'Informazione, Consiglio Nazionale delle Ricerche, 1999. [cite: 321, 322] [7] Yang, Y. 1994. Expert network: effective and efficient learning from human decisions in text categorization and retrieval. [cite_start]In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, IE, 1994), pp. 13-22. [cite: 323, 324] [8] Thorsten Joachims. "Text Categorization with Support Vector Machines: Learning with Many Relevant Features." [cite_start]Proc. of ECML-98, 10th European Conference on Machine Learning, No. 1398, pp. 137-142. [cite: 325, 326] [9] Z. Xiang, J. Zhao, Y. LeCun, "Character-level convolutional networks for text classification." Advances in neural information processing systems. [cite_start]2015. [cite: 329, 330] [10] H. V. C. Duy, et al. [cite_start]"A comparative study on vietnamese text classification methods," International Conf. on Research, Innovation and Vision for the Future, 2007. [cite: 331, 332] [11] S. Fabrizio. "Machine learning in automated text categorization." ACM computing surveys (CSUR), no. 34, vol. [cite_start]1, pp. 1-47, 2002. [cite: 334] [12] Hung Nguyen, Ha Nguyen, Thuc Vu, Nghĩa Tran, and Kiem Hoang. 2005. Internet and Genetics Algorithm-based Text Categorization for Documents in Vietnamese. [cite_start]Proceedings of 4th IEEE International Conference on Computer Science Research, Innovation and Vision of the Future, 2006. [cite: 335, 336, 337] [13] D. Gunawan, et al. "Automatic Text Summarization for Indonesian Language Using TextTeaser." IOP Conf. Series: Materials Science and Engineering, vol. 190. no. [cite_start]1, 2017. [cite: 338, 339] [14] L. N. Minh, et al. "VNLP: an open source framework for Vietnamese natural language processing." [cite_start]Proc. of the Fourth Symposium on Information and Communication Technology, 2013. [cite: 340, 341] [15] L. Breiman, "Random forests." [cite_start]UC Berkeley TR567, 1999. [cite: 342] [cite_start][16] V. Vapnik, "Estimations of dependencies based on statistical data," Springer, 1982. [cite: 343] [cite_start][17] C. Cortes, V. Vapnik, "Support-vector networks. Machine Learning," 20: pp. 273-297, 1995. [cite: 347] [18] C.Koby, Y. Singer, "On the algorithmic implementation of multiclass kernel-based vector machines." [cite_start]J. of machine learning research, pp. 265-292, 2001. [cite: 348, 349] [19] O. Guobin. Y. L. Murphey, "Multi-class pattern classification using neural networks." Pattern Recognition, vol. 40, no. 1. [cite_start]pp. 4-18, 2007. [cite: 350, 351] [20] Yin, Xinyou, et al. "A flexible sigmoid function of determinate growth," Annals of botany, vol. 91, no. 3. [cite_start]pp. 361-371, 2003. [cite: 351, 352] [21] G. Xavier, A. Bordes, Y. Bengio, "Deep sparse rectifier neural networks." [cite_start]Proc. of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011. [cite: 353, 354] [22] B. Léon. "Large-scale machine learning with stochastic gradient descent." [cite_start]Proc of COMPSTAT 2010, pp. 177-186, 2010. [cite: 355, 356] [23] K. Bekir, A. V. Olgac. "Performance analysis of various activation functions in generalized MLP architectures of neural networks." International J. of Artificial Intelligence and Expert Systems, vol. 1, no. [cite_start]4 pp. 111-122, 2011. [cite: 357, 358] [24] F. Sebastiani, "Machine Learning in Automated Text Categorization," ACM Computing Surveys, vol. 34, no. 1. [cite_start]pp.1-47, 2002. [cite: 359] [25] A. M. Salih, et al. "Modified extraction 2-thiobarbituric acid method for measuring lipid oxidation in poultry." Poultry Science, vol. 66, no. 9. [cite_start]pp. 1483-1488, 1987. [cite: 360, 361]