Evaluation of Disadvantaged Regions in East Java Based-on the 33 Indicators of the Ministry of Villages, Development of Disadvantaged Regions, and Transmigration Using the Ensemble ROCK (Robust Clustering Using Link) Method

A R T I C L E I N F O A B S T R A C T Article history: Received: 01 July, 2020 Accepted: 14 August, 2020 Online: 10 September, 2020 East Java province is a large province in Indonesia, in which Surabaya is the second largest metropolitan city after Jakarta. Various problems of development inequality in East Java have caused East Java to be defined as a disadvantaged area in 2015. The determination of disadvantaged regions is carried out every 5 years using 6 criteria and 33 indicators that have been set by the Ministry of Villages, Development of Disadvantaged Regions, and Transmigration. However, from several studies that have been conducted on the determination of disadvantaged regions, there is no research applies 33 indicators as a whole. So in this study, an evaluation of the determination of disadvantaged regions will be carried out using 33 indicators that have been determined by The Ministry of Villages, Development of Disadvantaged Regions, and Transmigration. Criteria data used are the results of the 2014 and 2018 surveys. These data are in the form of numerical data and categorical data. The method used is ensemble Robust Clustering Using Link (ROCK), which is a clustering method that can accommodate mixed data both categorical and numerical, using the concept of distance to measure the similarity or closeness between a pair of data points. The best cluster results for evaluating the determination of disadvantaged regions in 2020 consist of 4 clusters with the smallest Sw and Sb ratio of 0.3873984 and the optimum threshold value of 0.04. The results of the clustering, place Trenggalek, Bondowoso, Situbondo, Probolinggo, Tuban, Pamekasan, Sumenep, Bangkalan, and Sampang regions as disadvantaged regions in East Java.


Introduction
Based on the Presidential Regulation of the Republic of Indonesia Number 131 the year 2015 concerning the Determination of Disadvantaged Regions in 2015-2019, East Java Province is one of the 21 Provinces that are lagging in Indonesia. Not only that, but East Java Province is also the only Province in Java which has several disadvantaged district or city. Therefore, a study needs to be carried out to evaluate various problems of development inequality that have left some regions in East Java behind.
Government Regulation number 78 of the year 2014 article 6 paragraph 1 states that the determination of disadvantaged regions is carried out every 5 years based on criteria and indicators established by the Ministry of Villages, Development of Disadvantaged Regions, and Transmigration. In this case, the last disadvantaged region was determined in 2015 listed in Presidential Regulation number 131 of the year 2015 and will be re-established in 2020. In this study, the criteria used are survey data in 2014 and 2018, which sources data were obtained from the Central Statistics Agency in the form of data on village potential, statistics on people's welfare and the profile of each province in a certain number of years. The data in 2014 are used as a comparison with government decisions related to the determination of disadvantaged regions in 2015. While the data in 2018 will be used as predictions for the determination of disadvantaged regions in 2020. The results of this study are expected to provide a relevant picture in which regions have the potential to be left behind in the future. Thus, the government of District/City can take policies towards their regions that are adjusted to the characteristics of each region to alleviate the region from being left behind.
In practice, the government determines disadvantaged regions based on Presidential Regulation Number 131 the Year 2015 ASTESJ ISSN: 2415-6698 Article 6 Paragraph 2, using composite aspects and range values. Statistically, the two methods are only suitable for analyzing numerical data. While in reality, indicators to determine the status of disadvantaged regions do not only refer to numerical data. But several indicators are categorical. Thus, if the composite aspect and interval values are used as an analysis, it will not be able to accommodate 6 criteria consisting of 33 indicators. To overcome this, a special method is needed that can accommodate all types of data, both categorical and numerical. The statistical method that can be used for clustering mixed data is the ensemble method [1]- [3]. In this study, the ensemble method used is Robust Clustering Using Link (ROCK). Ensemble ROCK method is a clustering method that uses the concept of distance to measure the similarity or closeness between a pair of data points [4], [5]. The advantage of the ensemble ROCK method is it has better accuracy compared to the agglomerative hierarchy method with good scalability [6].
Ensemble ROCK method has proven to be optimal for conducting mixed data clustering in solving various cases [7], such as the research conducted by Shashi Sharma and Ram Lal Yadav, the research proved that ensemble ROCK method is more optimal when compared to the K-Means method for the cluster analysis process [8]. Similar to the research conducted by Dwi Harid Setiadi, in the application of ensemble ROCK method for mapping disadvantaged regions, it proved to be more optimal when compared to the SWFM ensemble method [9]. Then Alvionita compared the SWFM and ROCK methods for grouping orange accessions. In that study, it was found that the ROCK method had better grouping performance than the SWFM method [10]. Therefore, in this study, researchers will evaluate disadvantaged regions in East Java based-on indicators Ministry of Villages, Development of Disadvantaged Regions, and Transmigration using ensemble ROCK method.

Related Works
In the last few years evaluation of disadvantaged regions has been carried out, including Anik Djuridah in his research evaluating the status of disadvantaged regions using Discriminant analysis [11]. In that research, it was only determined the number of indicators that influence the determination of the status of being left behind from an area, without being known with certainty which regions are included in the group of disadvantaged regions and not. Similar to the research conducted by Satria, Herman, and Fajar who analyzed the development of disadvantaged regions in East Java using Location Quotient dan Shift Share Esteban Marquillas analysis [12]. In that study, it was only used the GRDP (Gross Regional Domestic Product) variable.
Furthermore, Dwi Hariadi Setiadi in his final project was mapping the District/City of disadvantaged regions using the Ensemble Similarity Weight And Filter Method (SWFM) and Robust Clustering Using Link (ROCK) [9]. In that study, researchers only used 5 criteria and 13 indicators. The five indicators are infrastructure, regional characteristics, economy, human resources (HR), and regional financial capacity, without including accessibility criteria. Whereas in the Government Regulation Ministry of Villages, Development of Disadvantaged Regions, and Transmigration listed in Law No. 78 of 2014 and explained in Presidential Regulation No. 131 of the year 2015 article 2 paragraph 1, 2 and 3 which states that the determination of disadvantaged regions uses six criteria (community economy, human resources, facilities and infrastructure, regional financial capacity, accessibility, and regional characteristics) consisting of 33 indicators used to determine the status of disadvantaged regions.
Based on several related studies mentioned above, no research evaluates disadvantaged regions using all the criteria and indicators that have been determined as a whole. So, in this study an evaluation of disadvantaged regions will be conducted based on all the criteria and indicators set by the Indonesia Ministry of Villages, Development of Disadvantaged Regions, and Transmigration.

Factor Analysis
Factor analysis is a step to reduce research variables both numerical and categorical data using the Principal Component Analysis (PCA) method. The technique of this analysis is conducted by finding the relationship between the variables that were originally independent of each other, becoming a set of new variables that have a strong correlation and number fewer than the original variable [13]- [15]. The first step is to test the assumption of the adequacy of the variables to be processed using the Kaiser Meyer Olkin Measure of Sampling (KMO) and the Barlett Test. If the KMO value is more than 0.5, then it has fulfilled the variable adequacy requirements. So that the data is enough to be factored. While the Hypothesis test for the Barlett test is as follows: 0 : The partial correlation formed from the data is not enough to be factored 1 : The partial correlation formed from the data is enough to be factored If < ( = 0.05) , then 0 is rejected. So it can be concluded that the partial correlation formed from the data is sufficient to be factored [16], [17].

K-Means
K-Means Clustering method is a method that partition data into K groups, where K is the number of groups determined by the researcher. In this research, numerical data will be clustered using K-Means. The K-Means algorithm is as follows [17], [18]: a. Determine the desired number of clusters b. Determine the initial centroid randomly as much as c. Determine the closest distance from each observation object to the cluster center which is determined using euclidean distance as follows: where ( , ) : Distance between two objects of and : The value of object in the group : The value of object in the group d. Determine the average value of each cluster as follows: where : The average value of the k cluster on the j variable : Amount of data e. Determine the new centroid closest distance using euclidean distance using (1) f. If it doesn't get the right result, then return to the calculation in step b The optimum grouping validation uses R-Square and Pseudo F-statistic values. The optimum number of groups can be shown by the highest R-Square and Pseudo F-statistics values [19]. Pseudo F-Statistics values can be calculated by: where the value of 2 is The R-Square calculation involves several diversity data calculations, they are total diversity, diversity within groups, and diversity between groups [10]. The value of diversity can be calculated by:

K-Modes
K-Modes is the development of the K-Means method specifically used to handle categorical data type cases [20], [18]. This method has an efficient algorithm based on frequency to find modes [21], [18].
Several modifications to the K-Modes method are accommodated from the K-Means method, as follows: a. The distance of two data points between X and Y is the number of features found in X and Y. Measuring the similarity between objects X and Y is given by: where : Number of Features ∈ ( , ): Matching value, the value is based on: b. Change the means value (average) to mode value (modes) c. In searching for mode values, data frequency is used. The centroid point is obtained from each feature's mode.
The validation method to find out the most optimum grouping in categorical data uses the calculation of the value of r is given by: where : The number of observations ℎ : The highest number of objects (dominance) in the hgroup with (ℎ = 1,2, . . . , ).

Ensemble ROCK
The ensemble ROCK method uses the concept of a link that is used to measure the similarity and closeness that occurs at a pair of data points [22], [23] and [4]. Here are the steps of clustering data by using ensemble ROCK method: a. Calculate ( , ) as a measurement of similarity as folllows: where : The i group observation group : The j group observation group b. Determine Neighbors by calculating the link value as follows: c. Calculate the Goodness measure value ( , ) as follows: The validity of ensemble ROCK method can be derived from the ratio of (sum within) and (sum between), ( ). The better grouping performance of the cluster obtained by the smallest ratio of ( ) [18] [24]. The value of and is: SSW and SSB for categorical data can be formulated by: This study consists of two types of variables used, namely alternative variables and criterion variables. The alternative variables in this study concern all District/City in East Java Province consisting of 29 District and 9 City according to the details in Table 1. The criterion variable in this study consists of 6 criteria whose 33 indicators. These variables consist of numeric and categorical data. For criterion variables that are of numerical data type, the determination of disadvantaged regions has 27 indicator variables shown in Table 2. Next, for the criterion variable whose categorical data, the determination of disadvantaged regions has 6 indicator variables shown in Table 3.

Data analysis
In this study, data analysis using ensemble ROCK method was carried out with the following steps: 1. Separate categorical data and numeric data 2. Reduce research variables with factor analysis both numeric and categorical data 3. Analyze numerical data clusters using the K-Means method 4. Validate the optimum grouping using Pseudo F-statistics and R 2 5. Analyze categorical data clusters using the K-Modes method 6. Validate the optimum grouping using the calculation of the value of r (highest accuracy) 7. Analyze mixed data clusters (numeric and categorical) results using the ROCK method

Determination of Disadvantaged Regions in 2015
The first step of this research is to cluster data from survey results in 2014 that used as a determination of disadvantaged regions in 2015. For the data processing in this section (in 2014), we don't elaborate, but we provide detailed discussions for data processing in 2018 in the next sub-section.
The following results of regional clusters in East Java (the data processing in 2014) using the ROCK method can be shown in Table 4.
Based on Table 4, it is known that cluster in disadvantaged regions are Trenggalek, Jember, Banyuwangi, Bondowoso, Situbondo, Probolinggo, Bangkalan, Sampang, Pamekasan, Sumenep, and Probolinggo City, this is in line with the determination of disadvantaged regions conducted by the Government. Based on Presidential Regulation No. 131/2015, it is known that District/City in East Java Province included in disadvantaged regions are Bangkalan, Sampang, Bondowoso, and Situbondo.

Predicting the Designation of Disadvantaged Regions in 2020
After the determination of disadvantaged regions in 2015, the next determination will carried-out in 2020. Based on data compiled from the 2018 survey, it can be predicted which regions are potentially designated as disadvantaged regions in East Java. To analyze this case, we use the ensemble ROCK method. In this section, we provide detailed discussions for its data processing.
The first step in the ROCK method is factor analysis. There are several assumption tests in factor analysis, including the adequacy of correlation data between variables using the KMO test and the dependency test between variables using the Barlett test. KMO test results (0.524) > 0.50 and Bartlett test obtain sig. (0.000) < (α = 0.05).They show that the research data has fulfilled the correlation and is sufficient to be factored. The results of the factor analysis of numerical data were obtained from the results of factoring and the biggest loading factors. The results of factor analysis for numerical data can be shown in Table 5. The results of the factor analysis will be analyzed and clustered using the K-Means method. In this case, cluster analysis will be carried out into 2,3, and 4 clusters. The selection of an optimal number of clusters is obtained from the largest Pseudo Fstatistics and R-Square values. Table 6 is the result of calculating the Pseudo F-statistic and R-Square values for each cluster. Based on Table 6 it is known that the optimal number of clusters is 4 cluster. The results of clustering numerical data in the district/city of East Java can be shown in Table 7. Then, factor analysis will be carried out for categorical data. The results of the KMO test and the Barlett test for categorical data indicate that the KMO test result (0.683) > 0.50 and Bartlett's test obtain sig. (0.000) < (α = 0.05). They show that the research data has fulfilled the correlation and is sufficient to be factored. The results of factor analysis for categorical data can be shown in Table 8. Next, the results of the factor analysis will be analyzed and clustered using the K-Modes method. In this case, cluster analysis will be conducted into 2,3, and 4 clusters. The selection of an optimal number of cluster is obtained from the highest accuracy value. Table 9 is the result of calculating the accuracy value of r for each cluster. Based on Table 9, it is known that the optimal number of clusters is 4 clusters. The results of clustering categorical data in the district/city of East Java can be shown in Table 10. After obtaining cluster of categorical and numerical data, the next step is to analyze a mixed data cluster using Ensemble ROCK. In this study, the mixed data will be clustered into 3 and 4 clusters. The threshold values (θ) that will be tested are 10 threshold (θ), they are 0.1; 0.2; 0.3; 0.01; 0.02; 0.03; 0.04; and 0.05. Among the 10 thresholds used, a threshold value which produce the optimum cluster will be chosen by finding the smallest Sw and Sb ratio. The ratio of and from each threshold for grouping 3 and 4 clusters are as follows: Based on Table 11, it is known that the smallest value of and ratio is owned by the 0.02 threshold with 4 clusters. The results of the mixed data cluster analysis in 4 clusters whose threshold value of 0.02 can be shown in Table 12.    Based on Figure 1, the percentage of negative composite level with the highest backwardness is cluster 3, then followed by cluster 1, cluster 4, and cluster 2. Each cluster result will be grouped successively into disadvantaged, independent, and developed regions. Table 13 is a cluster of disadvantaged, developing, independent, and developed regions.

Discussion
Based on Table 13, it is known that regions designated as disadvantaged regions in East Java consist of Trenggalek, Bondowoso, Situbondo, Probolinggo, Tuban, Bangkalan, Pamekasan, Sumenep, Sampang. It is quite reasonable, some facts compiled from the online mass media reinforce the statement.
In the prediction of setting up disadvantaged regions in 2020, it is known that Tuban has entered into disadvantaged regions. This is because of the percentage of poor people at Tuban increased from the previous year. Based on the news in suarabanyuurip.com (accessed 30-May-2019), the poverty level of Tuban is ranked fifth out of all District/City in East Java [25]. In addition, the percentage of clean water user households in Tuban District is very low, similar to Trenggalek and Situbondo District. Here is a graph of 5 districts/cities that are pockets of poverty in East Java.
On the other hand, based on the Presidential Regulation of the Republic of Indonesia Number 63 the Year 2020 concerning Determination of Disadvantaged Regions in 2020-2024, it was found that East Java has been separated from the provinces with disadvantaged regions. This is possible because local governments have conducted evaluations and various handling efforts to reduce the region of disadvantaged status. Nonetheless, this research result can provide an overview for district/city governments so that they can immediately anticipate and make policies that are adjusted to the characteristics of each region so that their region does not become disadvantaged again.

Conclusion
The results of clustering using the ROCK ensemble method obtained cluster results for the prediction of disadvantaged regions in 2020 consist of Trenggalek, Bondowoso, Situbondo, Probolinggo, Tuban, Pamekasan, Sumenep, Bangkalan, and Sampang. The best cluster results for evaluating the determination of disadvantaged regions in 2020 consist of 4 clusters with the smallest Sw and Sb ratio of 0.3873984 and the optimum threshold value of 0.04.
The characteristics of each region that are determined as disadvantaged regions, need to be improved to alleviate the area from disadvantages including the level of fiscal decentralization, the percentage of poor population, life expectancy, the average length of schooling, number of health infrastructure, percentage of household electricity users, the percentage of telephone user households, the low percentage of households using clean water, access to the nearest service facility, and the low percentage of protected forest regions.