Exploring the Performance Characteristics of the Naïve Bayes Classifier in the Sentiment Analysis of an Airline’s Social Media Data

A R T I C L E I N F O A B S T R A C T Article history: Received: 20 April, 2020 Accepted: 08 July, 2020 Online: 28 July, 2020 Airline operators get much feedback from their customers which are vital for both operational and strategic planning. Social media has become one of the most popular platforms for obtaining such feedback. However, to analyze, categorize, and generate useful insight from the huge quantity of data on social media is not a trivial task. This study investigates the capability of the Naïve Bayes classifier for analyzing sentiments of airline image branding. It further examines the impact of data size on the accuracy of the classifier. We collected data about some online conversations relating to an incident where an airline's security operatives roughly handled a passenger as a case study. It was reported that the incident resulted in a loss of about $1 billion of the company's corporate value. Data were extracted from twitter, preprocessed and analyzed using the Naïve Bayes Classifier. The findings showed a 62.53% negative and 37.47% positive sentiments about the incident with a classification accuracy of over 0.97. To assess the impact of training size on the accuracy of the classifier, the training sets were varied into different sizes. A direct linear relationship between the training size and the classifier's accuracy was observed. This implies that large training data sets have the potentials for increasing the classification accuracy of the classifier. However, it was also observed that a continuous increase in the classification size could lead to overfitting. Hence there is a need to develop mechanisms for determining optimum training size for finest accuracy of the classifier. The negative perceptions of customers could have a damaging effect on a brand and ultimately lead to a catastrophic loss in the organization.


Introduction
The evaluation of customers' perception of services offered by a firm is crucial to continuous patronage and subsequently, the growth of the business of the firm. How a customer feels about the service of a company over time is the sum of conscious events which is seen as a coordinated series of interactions between the customer and the brand of the service or product of the company. Thus, a customer's perception is entirely his/her personal view about the brand of service [1]. The major ways of capturing customers feedback of service or product include customer contact forms, customer feedback surveys (offline or online), social listening and interviews. Social listening is fast becoming the most prominent among the listed approaches.
Businesses can engage in social listening through social media platforms such as Twitter, Facebook, YouTube, Blog, and Email. Examples of brands that use Twitter to engage with their customers include Starbucks, PlayStation, Samsung Mobile US, Sony, Whole Foods, MacDonald's, Amazon, JetBlue Airways amidst others. Social media for sentiments are probably the most trusted medium through which people express their view about various public issues compared to other media. This may be because most brands get talked about on social media by their customers, and such brands can obtain feedback from their customers without them necessarily knowing that such data are collected about them. Sentiment analysis is one of the ways of scrutinizing people's perception. In [2], it was described as an area of study concerned with examining the people's views, perception, assessments, judgements, attitudes and emotions towards the services, concerns, topical issues, just to mention a few. It was described in [3], as "contextual mining of text which recognizes and extracts subjective information in the source material to help an enterprise understand the societal perception of their brand, product or service through monitoring online conversations. It has become a major area of natural language processing (NLP). Its major task is the classification of texts-based conversations to obtain insight into the intention of the author of the text [4].

ASTESJ ISSN: 2415-6698
New York Times [5] reported an incident that involved three security operatives and a sergeant who roughly handed and violently forced out a passenger of an Airlines Flight 3411 on April 9, 2017, from boarding the flight. It was further reported that the airline staff provided false information and consciously removed material facts from their reports. Consequently, the market value of the parent company of the airline lost about $1 billion. Could the incident had contributed to the lost? What was people's perception of the incident? These and more are the questions we seek to answer in this work by analyzing peoples' comments on social media using artificial intelligence. Advances in artificial intelligence, particularly, Natural Language Processing, has considerably improved the ability of algorithms to analyze text. This study, therefore, explores the performance characteristics and the effect of data size on using a Naïve Bayes classification model to analyze people's sentiment on the social media data of a company.

Related Works
There have been a number of research efforts ongoing on sentiments analysis in recent times through social media platforms. In this section, we reviewed selected literature. Hand-crafted and automatic models for removing factual or neutral comments that have no sentiments attached were reviewed in [6]. It was discovered that hand-crafted models perform well with strong sentiments but could not identify weak sentiments contents. Deep learning an automatic technique provides a meta-level feature illustration which generalizes new domains and languages. The multi-modal techniques merge audio and video social data with text using multiple kernels. a representation that generalizes well on new domains and languages. Multi-modal methods can combine the abundant audio and video forms of social data with text using multiple kernels. The high dimensionality of n-gram features and temporal nature of sentiments in long product reviews are identified as major challenges in sentiments mining.
In [4], Naive Bayes (NB) and Support Vector Machine (SVM) were implemented to classify sentiment analysis for movie reviews. In the study, SVM classifier outperformed the NB classifier in forecasting the sentiment of a review. Other supervised learning classification techniques such as stochastic gradient classifier, K-nearest neighbour, maximum entropy classifiers were suggested for future study. The classification of a Twitter sentiment review using the average semantic orientation of phrases containing adverbs and adjectives were conducted in [7]. Phrases with the good association are said to have positive semantic orientation while those with the bad association have a negative semantic orientation. The semantic orientation of a phrase is calculated as the difference of the mutual information between the given phrase and the word "excellent" and the mutual information between the given phrase and the word "poor". Experiments revealed that the proposed techniques outperformed the existing methods. The performance of Naive Bayes, Max Entropy, and Support Vector Machine (SVM), on twitter data streams were compared in [8]. SVM and naïve Bayes showed superior performance in terms of high accuracy and thus could be regarded as baseline learning methods. The review of mobile user' sentiments was analyzed in [9], statistical analysis was carried out on a large amount of real data, and four characteristics that distinguish phone reviews from that of PCs were identified, namely: Short average length; Large span of length; Power-law distribution and significant difference in polarity of the mobile application reviews. The results further showed that for classification, the Bayesian algorithm outperformed the SVM algorithm. The extent to which sentiment analysis techniques could provide more useful new insights than those of traditional quality assessment methods was investigated in [10], using 4392 tweets dataset. Sentiment analysis techniques identified 23 attributes that could be used for comparison with other American Society For Quality (ASQ) scales. Results indicated that the rate at which passengers respond to the attributes of scale differed greatly in certain instances and that identifying these differences could provide insights for airport management to improve the airport service quality.
The performances of back propagation neural network (BPN), probabilistic neural network (PNN) and a homogeneous ensemble of PNN (HEN) were compared in [11], using varying levels of word granularity as features for feature level sentiment classification. Product review dataset collected from Amazon reviews websites was used to validate the algorithms. Performance analysis on the results of the ANN-based methods with two individual statistical methods was carried out using five quality parameters. The results showed that the homogeneous ensemble of the neural network method offers superior performance. In addition, probabilistic neural networks (PNNs) outperformed BPN during classification, and the combination of neural network-based sentiment classification methods with principal component analysis (PCA) as a feature reduction method offered superior performance in terms of training time. A hybrid approach for identifying product reviews offered by Amazon was proposed in [12]. The results showed that the hybrid approach outperformed the individual classifiers (Random Forest and Support Vector Machine) in this amazon dataset. Various sentiment analysis techniques and their performance at analyzing sentiments were presented in [13]. The study was aimed at developing a sentiment analysis technique that would categorize various reviews efficiently. The work described machine learning techniques like SVM, NB, maximum Entropy and other techniques that could improve the analysis process. They suggested the n-gram evaluation of semantic analysis instead of word by word analysis.
An evaluation of features on sentiment analysis carried out in [14] showed that that unigram was the best method to extract sentiment from the review. Unigram with stemming with stopwork gave 82.9% accuracy, and unigram with steaming without stop word gave 83% accuracy in positive class, while unigram with stemming and without stop word gave better accuracy of 83.1% in the negative class. The two classes offered improved performance with information gain. The authors suggested an ensemble feature selection technique to perform an additional experiment for further studies.
In [15], a sentiment analysis system was developed for a resource-poor language, namely Roman Urdu. The authors noted that most works on with sentiment analysis have been on the resource-rich languages of the world, while very few had been done on resource-poor languages. The work involved four different studies based on word-level, character level features, and feature union. The results showed that the error rate could be reduced by 12% from the baseline (80.07%), and the results of studies were statistically significant from the baseline.
The reviewed works show that sentiments analysis is an active research area in computing with diverse classification approaches that seek to provide an effective and efficient mechanism for analyzing sentiments. It was particularly noted that the popularity of machine learning techniques has been on the increase. However, little or no effort has been put in place to examine the impact of the data size on the classification performance of the various algorithms. We have, therefore, considered the relationship between data size and performance, using one of the highly-rated machine learning algorithms in the literature, the naïve Bayes, in the sentiment analysis of an airline's social media data.

Methodology
This section presents the sentiment classification process and methods applied in this study. A summary process is presented in Figure 1, and a description of methods follow. Tweets were extracted from Twitter using Python libraries, pyQuery, in the form of a JSON response, a process known as Web scraping, to a flat file. The data were loaded into the data frame for further processing. Figure 1 shows the sentiments classification process. Datasets for the training of the classifier were downloaded from cowdFlower.com, which contains a large amount of already labelled tweets texts in the areas as airline sentiments, hotel review, product reviews, and many other datasets, classified already into positive, negative or neutral. The extracted labelled airline tweets from crowdflower.com were used as a training set for the classifier and new downloaded tweets' datasets used for testing the classification accuracy.

Data Collection and Description
Data collection was achieved using "forked" Tweet API, Jsoup library. The standard Tweet API allows retrieval of tweets for a maximum seven days lag time and could only permit for scraping 18,000 tweets per 15minute window" (https://towardsdatascience.com/how-to-scrape-tweets-fromtwitter-59287e20f0f1). However, the tweets used for this study went live between April and May 2017. Due to this restriction, a "forked program" available on GitHub (https://github.com/Jefferson-Henrique/GetOldTweets-java) was used instead to collect the data. The extracted data were saved in some files. Jsoup is a Java library. It works with data in HTML format. It scrapes and parses HTML from a URL, file or string using the WHATWG HTML5 specification. Forty thousand rows of tweets with the keyword "United Airlines" was scraped from the 12th May 2017 to 31st May 2017. This was done by querying Tweet with the command in the "forked" Tweet API,: "java -jar got.jar QuerySearch="United Airlines" since=2017-05-12 until=2017-05-31 maxtweets =40000".

Filtering
Filtering was further carried out to eliminate the inclusion of metadata and some other irrelevant texts while classifying sentiments. The texts, such as Identity numbers, date, time, Irrelevant tags, Hyperlinks, #tags, punctuation, special characters, etc. were removed.

Feature Extraction
Several methods could be used to classify texts as positive, negative or neutral. One of such is a subset of adjectives, which are manually classified as positives and a subset classified as negative are used as seed words [7]. Equation (1) describes such a method.
Suppose a seed set size is denoted respectively as ADJp and ADJn for positive and negative sets seed sizes. Let L(Wi, POSj) represents a modified log-likelihood ratio for a Wi with part of speech POSj as the ratio of its collocation frequency with ADJp and ADJn within a sentence. Then L(Wi, POSj) can be computed as follows:

, ,
is similarly defined for . Brill's tagger was used in [7] to obtain part of speech information. In this study, the datasets for the training of the classifier were downloaded from CrowdFlower.com. The site houses many datasets in areas as airline sentiments, hotel review, product reviews, and many other datasets. The dataset contains tweets already classified into positive, negative and neutral based on a cloud of words that shows that a new tweet should be classified as either positive or negative or neutral.

Training and Test Datasets
The training dataset contained tagged data from crowdflower.com, in which the tweets were already labelled as positive or negative tweets. df dataframe (which contains data loaded from airline_sentiment_main.csv) was split into training (80%) and test datasets (20%). df_train contains the dataframe of the training datasets while df_test is the dataset of the testing data. To begin training, the tweets with positive sentiments were copied into the dataframe df_pos. The same was done for negative (df_neg) and neutral sentiments (df_neu). The dataframe df_train was passed into the naïve Bayes (TweetNBClassifier) for the classification.

Naïve Bayes Classifier
The Bayes classification algorithm for text mining is based on the Bayesian rule, which defines the conditional probability that allows for twisting the condition in a convenient way. Given events X and Y, we compute that conditional probability, P(X|Y), of an event X on the condition that event Y had occurred as in equation (2) defined in [7] as: where P(X|Y) denotes the Probability of X, given Y has occurred, X∩Y is the probability of the intersection of event X and Y, while P(Y) is the probability of event B occurring.
Applying the multiplicative law, Based on the multiplication rule, P(X∩Y)=P(X|Y)×P(Y), so P(Y∩X)=P(Y|X)×P(X) We applied Bayes' Law (Bayes' Rule or Bayes' Theorem) to help us understand the relationship between two events by computing the different conditional probabilities as follows: Hence, P(X|Y) can be computed more conveniently, applying the Bayes rule by equation (5) [13, 7].
The rule can be redefined with regard to the probability of a document occurring given it has been predetermined to be positive or negative since we have examples of positive and negative sentiments from our data set. Thus, we have P(Sentiment|Sentence)= P(Sentence|Sentiment)×P(Sentiment) P(Sentence) We computed ( | ) as a product of P(token|Sentiment) as in [7], and estimated ( | ) as (token|Sentiment)= Count(thistoken∈class)+1 Count(alltoken∈class)+Count(alltokens) The one and count of all tokens are referred to as add one or Laplace smoothing. The naïve Bayes Classifier chooses the most probable classification Vnb for a given set of attribute values a1, a2…….an [13,15].

Measure of Accuracy
While scanning through the list of sentiments features, if the value of the feature (the predicted value) is equal to the actual value (target), the value 'correct' is appended to the list; otherwise, incorrect value is appended. Hence the classification accuracy was computed by equation 8.

Results and Discussion
Some tweets with the number of both positive and negative sentiments are shown in Figure 2. There was a regular pattern of the sentiments. Negative sentiments were consistently higher in value than positive sentiments for each day. This suggests that the majority of the tweets oppose the action of the airline operatives on the passenger.  An extract of sample classification of ten tweets of the airline is depicted in Table 1. The tweets have been classified as either positive or negative sentiments. From the classification, a count of moods, which refers to the count of both negative and positive opinions, was computed using the snapshot code in Figure 4. A snapshot for the code of the mood distribution is shown in Figure 5, while the resultant graphical distribution of the mood follows in Figure 6.   Figure 7 shows a screenshot of the code segment that defined the method, called "score" for computing the classification accuracy using equation (8).. To measure the accuracy of the classification on the test data sets, the following code segment shown in Figure 8 was used to call the method score for the computation of the classification accuracy. Accuracy of 0.9 was recorded based on 11,000 training datasets. Figure 8 shows code segment for computation of classification accuracy calling the method in Figure 7. The test result for an 11,000-training dataset produced an accuracy of 0.9. To measure the impact of training data size on the classification accuracy, various training dataset sizes were used experimented, and a subset of the result is depicted in Table 2.  Figure 9 showed a direct linear relationship between training size and classification accuracy. We also observed that the accuracy increases in direct proportion to the training size. It further revealed, however, that at a particular point, overfitting arose as we continued to increase the training size. This finding agrees with a number of machine learning techniques in the literature [16,17]. Therefore, some experimentation with various data sizes may be required to determine optimal training size.

Conclusion
The study explored the Naïve Bayes classifier for analyzing airline image branding, a public relations crisis in which a customer was dragged off an airplane. The finding revealed 62.53% negative sentiments and 37.47% positive perceptions of people about the image of the airline. The negative perceptions from customers feedback on the case study could lead to negative image branding and subsequently to a loss of patronage to the organization. Furthermore, experimenting the classification accuracy with a number of training sizes showed a direct linear relationship between training size and its accuracy. It was also observed that a continuous increase in the classification size could lead to overfitting. These findings are supported by the features of machine learning techniques as obtained in [16,17]. Future works will consider the development of mechanisms for the optimal determination of the training size.