Sentiment Analysis on Utilizing Online Transportation of Indonesian Customers Using Tweets in the Normal Era and the Pandemic Covid-19 Era with Support Vector Machine

Online transportation in Indonesia is a new trend of transportation that is currently used among the lower to upper society The change in behavior began in 2011 and is growing to this day, The comments that are growing on social media are very important for the online transportation company the negative comments lower the level of users while the good comments increase the users’ level Thus, the comments influence the overall trust of their customers Among social media, Twitter is a place where many people convey feelings of pleasure and displeasure timely, especially at a time when the COVID-19 pandemic is becoming a serious outbreak Through these “tweets,” many customers express their experience with the service In this paper, we aim to analyze the experience of online transportation consumers using Support Vector Machine The data were taken in two periods, i e April 2019 (“the normal era”) and June 2020 (“the COVID-19 pandemic outbreak Class logging is done based on 3 categories namely positive, negative and normal, while in mining the data we labeled with the keywords @grabID and @gojekindonesia, 1618 data were obtained with a ratio of 1183 is normal era data and 435 data in the era of COVID-19 pandemic, The highest accuracy results occurred in the normal era with a ratio of 10% as test data and 90% as its training data on linear and sigmoid kernels of 0 8060 while the COVID-19 era only got the highest accuracy of 0 59 in linear kernels with a ratio of 60:40 This is a sign that the COVID-19 pandemic does not contribute to decreasing trust in the service © 2020 ASTES Publishers All rights reserved


Introduction
Online transportation in Indonesia began to operate in early 2011 and continues to develop since 2015. It is now one of the most in-demand service needs in Indonesia [1] which is prominently run by two large companies. Despite the in-app review, many customers choose to post their responses or suggestions even disappointment in social media, such as Twitter.
Sentiment analysis is part of data mining which uses text as the object of analysis. Simply stated, it describes a concept of knowledge capture obtained by appraising opinions or sentiments that are incurred from the expression with text as an object.
Some scholars also refer to sentiment analysis as opinion mining. People give their reviews in the form of unstructured formats through blogs, forums, etc. These unstructured reviews have been processed to extract opinions from them and these opinions can be positive, negative, or neutral [2].
The purpose of this paper is to unravel the underlying online transportation review on Twitter or tweets and compare them in two periods: the normal era and the Covid-19 pandemic era. The data was analyzed using vector machine support with the python language.

Related Work
Several papers explain the use of SVM with its performance. In previous studies [3], the author explains of the 800 data obtained by being divided into two 500 sections for test data and 300 for data testing, the results obtained were quite high at 0.80. However, this research possesses a weakness on the side of its features so that it can be calcified. Another study was conducted In the paper [4] the author purposed who compared the performance of SVM and Naïve Bayes algorithms with the following stages in the preprocessing data set converted into structured reviews to then converted back to numerical using the lexicon-based approach, this preprocessing approach uses feature selection and semantic analysis, here are the word cutting, POS bar tagging and ASTESJ ISSN: 2415-6698 calculating sentiment scores with the stone dictionary from SentiWorldNet. These were then put into the classification process using both SVM and Naïve Bayes. The result was that from 3 data sets tested (Gold, Film, and Twitter) achieved performance for a naïve bay of 69.10%, 74.55%, and 76.67%, for SVM gained 72.74%, 74.73%, and 76.92%, apart from two comparison algorithms also performed using the RBF kernel where the result was 73.56%, 74,745 and 78.18%. Further analysis resulted in the highest accuracy value by RBF kernel, so the use of the RBF kernel is highly recommended. In [5], the author presented a sentiment analysis of regional head candidates using an SVM algorithm to build the model. The data sets were taken from news documents about regional heads in East Java 2018, i.e. Kompas, JPNN, Republika, and Detik, The result was that Gus Ipul as a candidate for regional head number two has precision value, remember, and AUC as many 0.940, 0.948 and 0.890 respectively. Models built on the data resulted in a good performance with accurate estimation.
While sentiments regarding the use of transportation are presented in the following papers. The first in [6] introduced SVM and nonnegative matrix factorization in the stock market prediction process using Twitter and historical data. The stock exchange dataset used LQ 45 from August 2018 to January 2019. This feature consists of closing price, volume, percentage of topics, and sentiment. The price and volume were taken from yahoo finance data, while topics and sentiments were retrieved from the comments of every stock market in LQ45. The result obtained 60.16% accuracy.
The second in [7] The author presented also showed that SVM provides a fairly good level of accuracy from 1183 data to 90% training data and 10% data testing, the result is linear use 0.8, RBF 0.78, Sigmoid 0.8, and polynomial 0.77.
In [8] the author performs conducted a review of his comments from Twitter related to the use of transportation managed by the government of DKI Jakarta namely "TransJakarta" algorithm used is Convolutional Neural Network, architecture testing using VGG, ResNet, and GoogleNet, consisting of 16 layers, ResNet 34 layers, and GoogleNet 22 layers, the results obtained from the Resnet 34 layer architecture model provide the best F-Score 98.11%, better than VGG which has the highest F-Score score of 96.74% and GoogleNet 96.80%.
In [9] The author presented a paper with the title "Exploring the Performance Characteristics of the Naïve Bayes Classifier in the Sentiment Analysis of an Airline's Social Media Data," the purpose is to know comments of airline operator users on Twitter using Naïve Bayes its architectural model, The dataset obtained was a negative sentiment of 37.47% and positive by 65.53%, They highlighted that the large training data set is potential to improve the accuracy of classifier classification.

Data Collecting / Methods and Technique
In the early stages of this paper is limited to the two largest online transportation companies in Indonesia namely GoJek Indonesia and GrabId, following the retrieval of data taken from [7] in April 2019 then continued the process of crawling data using twitter APIs in April to June 2020 where this era is referred to as the era of COVID-19 pandemic.

Research Planning
The planning of the research consists of 5 stages as follows:

Data Selecting
Data retrieval was carried out in the data crawling process. The content of the data was customer key access is as the entry key into Twitter APIs, Access Tokens, and access token secrets for the final stage of data crawling. The keyword, i.e. @grabid and @gojekindonesia was used and tweets published in April 2019 until June 2020 were selected. The data were randomly retrieved so that not all days were taken. Then the data was examined using the help of the Indonesian dictionary in the literary package as standardization of Natural Language Processing (NLP) to be then labeled, The data was validated by a team of Indonesian language experts in the language center building at Universitas Singaperbangsa Karawang.

Preprocessing
• Cleaning This stage is the process by which the data is cleaned, while some of the writings that must be cleaned between others are writings in the form of @ symbols like "@", "#" and characters "!$%^&*()_", the purpose of which is to make the data readable clearly. Case folding is a stage where data is standardized. The letters, words, or sentences that sentence case, upper case, capital each word, and toggle case were changed into lowercase so that the uniformity of the data is maintained.

• Tokenizing
Tokenization is a process in which sentences are cut into tokens. Some punctuation marks such as commas, dots, and others were limited. Tokens will be read into words according to Indonesian language grammar. To be able to read more clearly, the spelling was normalized so the words are in proper spelling, Also, many abbreviations were normalized such as min being admin, telp to telephon (make/made a phone call). The step was to remove unused words, this process is also called Stop Word.

Data Transformation
After the preprocessing phase is completed, the transport process in which the term used is TF-IDF (Term Frequencies -Inverse Document Frequency) which is a step of changing from categorical values to numerical values. The data is then extracted into values.

Text mining with SVM
The next step is data mining with SVM as the model, Positive, Negative, and Neutral were the 3 classes used, as follows:

Evaluation
The evaluation was carried out to see the performance of SVM performance, accuracy being the benchmark for performance, accuracy will determine how well the initial data is obtained, preprocessing, and transformation.

Result
There are more than 3,000 data gained in the crawling process. The data was then preprocessed and resulted in much as 1618 for the normal era and 435 for the Covid-19 eras.   In Figure 12 the data is separated into categories based on the object, for Gojek Indonesia 733 data was generated while for Grab ID 885 data was generated. Each category was then revealed the pattern as shown in Table 3 during the normal era and Table in the pandemic era.    Table 4 shows that there is a significant increase in the sentiment of online transportation in the Covid-19 era   7 show that following the additional data from the COVID-19 era, there is a decrease in inaccuracy. Hence, this is very poor data with which high accuracy is rather difficult to achieve. It is assumed that the trend is heavily influenced by the transportation policies, both from the Government and company in the pandemic era, such as health protocols on online transportation partners that make the performance of the online transportation decrease.

Conclusion
This study aimed to compare sentiment analysis toward online transportation service in pre-and whilst COVID-19 pandemic era by VMS. It is found that the performance in every kernel in the pandemic era decreased. In the normal era, the highest accuracy obtained is in the linear kernel, and sigmoid with a ratio of 90:10 results in an accuracy of 0.8. Meanwhile, in the pandemic era, the highest accuracy is only found in linear kernels with a ratio of 50:40 which is 0.59. The accuracy likely declined due to several factors, such as poor and unstructured initial data and the costumers' reluctance to adopt the health protocol which contributes to the raise of displeased with the service.