Vietnamese Text Classification with TextRank and Jaccard Similarity Co- efficient

Article history: Received: 03 September, 2020 Accepted: 28 October, 2020 Online: 20 November, 2020


Introduction
This paper is an extension of work initially presented in IEEE RIVF 2020 [1] as an invited paper.
Within the era of information redundancy, it is challenging and time-consuming to manually engineer a sizable amount of multilingual data. For example, an electronic library might quickly identify documents for archiving and management. Besides, it is often less accurate and overwhelming if humans are involved in the process. Therefore, apply machine approaches to automate the procedure of text classification is mandatory. It makes the classified results more reliable and less subjective. It helps alleviate human involvement and information overload and enhances knowledge retrieval efficiency. There are many text classification studies such as, e.g. Bayes [2], decision trees [3], K-nearest neighbor [4], and neural network [5]. The literature shows that automated text classification is one mainstream research in natural language processing [6,7]. Many research papers have been conducted to solve such problems as email-messages filtering [8], topic modeling [9], geo-localization [10], and document categorization [11]. The flowchart of standard text classification is presented in Figure 1. The curse of dimensionality affects accuracy and computing time that arise when analyzing and organizing textual data in highdimensional spaces [9]. However, machine learning models do not need to learn all tokens in the texts to categorize them. Instead, the text's label can be identified through its necessary tokens, which contribute the most to the text's meaning. Consequently, if we can extract the text source's main keywords, we might accurately classify it into an appropriate topic. The TextRank algorithm [12,13] allows extracting the list of representative keywords of textual contents. On that consideration, the authors propose the method of automatic Vietnamese text classification based on the representative keyword analysis of the text. Textual datasets have been downloaded from several news websites with 15 main topics. Then, keyword sets that represent each topic are built. The system will extract that text's specific keywords when it is necessary to identify the topic of a text. The class of the text will be determined by computing the coverage of the representative keywords. It would be the right way for text classification applications in areas such as electronic library management. Although many Vietnamese documents are electronically available, no previous research has applied keywords-based text classification to them.
This work portrays our endeavor to construct a practical framework for addressing Vietnamese texts' classification task on the news. To the best of our knowledge, in this extended article, the authors have made several contributions. First, we discuss the paper [1] in a lot more detail by extending the related works and technical background. Second, we extend the experiments by investigating more scenarios.

Related Work
An early effort to address the task of Vietnamese text classification was conducted more than a decade ago [14]. In that paper, the authors solved automatically categorizing the problem, given textual sources into predefined categories. A comparison between statistical N-Gram language modeling and bag of words approaches has been investigated on their collected dataset. Several researchers have applied the idea of spam filtering into Vietnamese text sources [15]. Short messages such as conversational texts have also been exploited by addressing the task of suggestion intents [16]. The authors proposed a user suggestion intent definition in general from conversational texts at a functional segment unit. The task of automatic text categorization has been studied by comparing the performance of several term weighting schemes rather than analyzing the actual classification task [17]. Regarding Vietnamese sources, full-text representation has been exploited by many other research papers [18,19,20]. Thoughtfully learning the literature, we could claim that this is the first attempt to featuring Vietnamese texts using the idea of representative keywords.

Design Concept
The Vietnamese text classification system proposed by the authors includes two main components. One is the keyword extraction module, and the other is the comparison with the training set to identify the topic of a new document. We present the overall design of the system in Figure (??). The process begins with the text sources featured in the representative keyword vectors. For the training set, the vectors will be marked with the topic label. For the test data, the keyword extraction module is applied to convert the original text data into its keyword-based representation. Next, the system compares it to training data by calculating similarity scores. Finally, the prediction is assigned to the test data.

Text Pre-processing
Vietnamese is the only language in which every syllable is pronounced separately and is represented by a written word. This feature is evident in all aspects of phonetics, vocabulary and grammar. Data pre-processing is the first important step of any data mining process. It makes data in its original form easier to observe and explore. For the problem of text classification, due to specific characteristics, each language has its own challenges. The preprocessing process will help improve sorting efficiency and reduce the complexity of the training algorithm. Depending on the purpose of the classifier, we will have different preprocessing methods, such as • Convert text to lowercase and correct spelling errors.
• Remove punctuation marks (if no sentence separation is performed). • Separate tokens by compound words (Vietnamese).
• Remove the stopwords, e.g. the words that appear most in the text that are not meaningful when participating in text classification. We utilize a list of 1942 Vietnamese stopwords 1 in our data processing.
• For the tokenization step, we utilize vnTokenizer [21] in our research. The comparison of tokenization accuracy achievable with different software is beyond the scope of this research paper.

Vietnamese Text Tokenization
Phonetic characteristics In Vietnamese, there is a special type of unit called "tieng" or a sound of the thing. Phonetically, each "tieng" is a syllable. For example, the word "student" is translated into two syllables "sinh vien" which are two separate words. As a result, these two words should come together to form a meaning token.
Vocabulary characteristics Each "tieng", in general, is a meaningful element. Continuing the previous example, the words "sinh" and "vien" have their own meaning when coming alone. But when they come together to form a single word "sinh vien", it has the meaning of student as in English. The vocabulary of Vietnamese is based on single words (one syllable) and the countless combination of them. Creating new words is very easy and flexible. If we pronounce a sound, we could write it down as a word.
Grammatical characteristics The Vietnamese words do not change morphology. For example, verbs in Vietnamese do not have -ed, -s, -ing forms. This feature will dominate other grammatical characteristics. When words combine with words into sentences, it is important to know the word order, the word phrase, and keywords for tenses recognition. Sorting words in a certain order is one of the main processes to express syntax relations.

Keyword Extraction by TextRank
Algorithm TextRank [22] was developed based on the main idea of the PageRank algorithm that Google search engine uses to rank website cite zhou2019chip, langville2008google, berkhout2016google. The bottom line of the TextRank algorithm is to use graphs to represent text and to score important information about the structure in which the text is represented by keywords. In other words, the TextRank algorithm processes a group of keywords representing the entire text. TextRank ranks words by their importance, arranges them in descending order of computed value, and extracts the most important words. The number of important words is a hyperparameter which is determined by the user prior to the TextRank algorithm execution. This algorithm is successfully applied for keyword extraction based on key value from a single text and this is also the advantage of TextRank. The TextRank algorithm represents the textual source as a graph G = (V, E) where V is the set of vertices and E is the set of edges in the graph. E is derived from a subset of V x V. Each vertex of the graph G corresponds to one word extracted from the text. An edge between any two vertices is created when their words appear in the text at any position between 2 and N. The value for the importance of the vertex V i is calculated using the following formula: where d = [0, 1]. In our experiments, d is set to 0.85 by default [22]. In(V i ) is the collection of vertices that point to it, and Out(V i ) is the collection of vertices that vertex V i points to it. The TextRank algorithm can be presented in the Algorithm (1).
Algorithm 1 Implementation of TextRank algorithm. Input: Textual data Output: Extracted keywords K 1: Build the graph G = (V, E) 2: Compute the edges' importance score by Equation (1) 3: Sort the edges by their scores 4: Select top K edges

Keyword Extraction for Topics
At this stage, we build a sample set of keywords for each topic from a set of keywords extracted from the subject-labeled texts in the training set. The model of building a set of keyword samples for topics can be summarized as Figure (3). The keywords of the topic are calculated by the statistical method of the number of occurrences of each word in the list of keyword sets of training text. Words that are keywords of one topic cannot be keywords of another topic.

Similarity Measurement by Jaccard Distance
Mathematically, there are many ways to calculate the similarity between any two keyword lists R i and any R j , provided they are of the same length. However, in the context of the similarity between two documents, we do not need to include all words in the text but only the representative keywords T . The number of keywords represented will be a lot less than the entire word in the text. Then, a weighted version of the Jaccard [23] distance is determined as follows:

Datasets
Data are collected from highly reputable Vietnamese websites. We used Teleport pro 2 software for automated data collection. The downloaded data is converted to plain text file and saved to the corresponding folders with the folder name as predefined theme name. Specifically, data is downloaded from the website 3 4 5 6 with 15 main topics. These topics are summarized in Table (1). The collected data was distributed into 80% training data group and 20% test data group.

Evaluation Metric
We define the number of true positive (TP), true negative (TN), false positive (FP), and false negative (FN). We also define m + the total of condition positives, m − the total of condition negatives,m + the total predicted condition positives,m − the total predicted condition negatives, and m the total population. Then, we compute the sensitivity or recall by using: We compute precision as follows: Then, we compute F1-score as follows: F1-score = 2 recall · precision recall + precision .

Experimental Results
In     The authors also make statistics on all experimental scenarios in Table (2). Note that all in the Table means all the tokens in a particular text. The texts' length is varied, and the average number of tokens is 500. We plotted the correlation between the number of keywords and accuracy scores in Figure (4). The accuracy of 90% is stable at 50 keywords. In Figure (5), we observe a considerable increase from 60 to 70 keywords. While in Figure (6), the test time grows eventually.      www.astesj.com 367  Topic  ID  1  2  3  4  5  6  7  8  9

Conclusion
In this article, the authors have described a proposed approach that allows text classification based on the solution of extracting specific representative keywords of the text. We discussed the proposed system in detail from the abstract design, text pre-processing, and Vietnamese characteristics. Then we described the TextRank algorithm based on graphs to score important information about the text's structure. Intensive experiments have been conducted to prove the stability and robustness of the proposed system. High accuracy of 90.07% has been achieved. Although many Vietnamese documents are electronically available, this is the first to conduct text classification based on keywords. This research portrays our endeavor to construct a practical framework for addressing Vietnamese texts' classification tasks on news websites.