Comparison Algorithm for Diabetes Classification with Consideration of Mutual Information and Information Feature

— Diabetes is a prevalent disease in humans that is caused by excessive sugar levels in the body. If left untreated, it can lead to severe consequences such as paralysis, decay in certain parts of the body, and even death. Unfortunately, early detection of diabetes is difficult, and many cases go untreated until it is too late. However, the development of technology has opened up new possibilities for early detection and treatment of diabetes. One such approach is classification, a commonly used method in the field of Computer Science. Classification is used in various fields, including health, agriculture, and animal diseases, to draw conclusions based on input data using cause-and-effect relationships. Many different learning concepts and methods can be used in classification, with the Decision Tree concept being one of the most popular examples. This study compares several classification methods, including Decision Tree, Random Forest, AdaBoost, and Stochastic Gradient Boost, with feature selections carried out using MI and IF. The study aims to evaluate the effectiveness of these methods and the influence of feature selection on improving their performance. Based on the results of the study, it can be concluded that feature selection using Mutual Information and Importance Feature can improve the classification accuracy in some methods, particularly in Random Forest, AdaBoost, and Stochastic Gradient Boost. However, the Decision Tree algorithm did not show any improvement in accuracy after feature selection. The best classification accuracy was achieved with the Stochastic Gradient Boost method using the original dataset without feature selection, while the Random Forest method showed the highest accuracy after using all the features. Overall, the results suggest that feature selection can be a useful technique for improving the performance of classification algorithms in diabetes prediction. The study suggests that future research could investigate other classification methods, such as Neural Network or Deep Learning, and use optimization algorithms like Genetic Algorithm or Particle Swarm Optimization to improve feature selection results.


INTRODUCTION
Diabetes is a disease that is often found in humans. This disease is the effect of excessive sugar levels in the body. Diabetes is often ignored resulting that the effects get worse. The consequences of diabetes range from paralysis decay in certain parts of humans to death. There is a lack of early treatment of diabetes because this disease is not easy to detect apart from blood check. The development of knowledge, especially in the field of technology, can help treat diabetes early. One way for early detection is by conducting early detection or commonly known as classification in the field of technology.
Classification is one method that is often used in the field of Computer Science. Classification is often used in various fields such as health, agriculture, and animal diseases, etc. The concept of classification is to conclude based on the input obtained with the concept of cause and effect. Many methods and learning concepts are used in classification, one of the most frequently used examples is the Decision Tree concept.
Decision Trees are often used in classifying. The decision tree concept adheres to the IF-THEN learning concept, where prerequisites meet to provide a conclusion. Previous research conducted by Clim et al. stated that the Decision Tree method could be used for health applications such as heart attack or hypertension detection applications. This study also states that it provides good accuracy and fast decisions.
However, the decision tree also has a weakness if too many features are used, the more difficult it is for this algorithm to form a decision. One algorithm that uses basic concepts, such as a decision tree is Random Forest. Random Forest is an algorithm that uses basic concepts such as a decision tree, but has a difference where Random Forest makes a decision based on majority voting. Previous research conducted by Primajaya and Sari for rain prediction stated that Random Forest could produce an accuracy of 99.45% [4] .
Another algorithm that uses the basic concept of a decision tree is AdaBoost. AdaBoost is an ensemble technique using the loss function of the exponential function to improve the accuracy of the predictions made. Previous research conducted by Mahesh et al., namely implemented AdaBoost for early detection of heart disease, stated that the Adaboost method outperformed the Naïve Bayes method, Decision Tree and other Ensemble Methods [8] .
A factor that causes lowering the performance results of the classification is too many features used. Too many features cause a bias that can lead to misclassification. A concept method to deal with this problem is using feature selection. Feature selection is a method to eliminate features that can cause bias and do not have any influence in making decisions. A method that can perform feature selection is Mutual Information.
Mutual Information (MI) is a feature selection model that shows how much information there is or is not about a feature that contributes to making a right or wrong classification decision. MI performs sorting from large to small to see how much influence the feature has on the classification, then discards the feature with the least effect. Research conducted by Peng et al. states that the use of MI as feature selection has increased accuracy when classifying compared to that without feature selection [2] .
Another method for selecting features is The Importance Feature. Importance Feature (IF) is a feature selection method, which ranks how important a feature is to the classification. The concept of IF is almost the same as that of MI, but there are differences in its determination and calculation. Previous research conducted by Jovic et al. stated that IF could be used for feature selection for classification, clustering and regression and could give better results [6] .
Based on the explanation, this study wants to compare several classification methods, which are Decision Tree, Random Forest, AdaBoost and Stochastic Gradient Boost. In this study, feature selections were also carried out using MI and IF. Tests in this study will see how good the classification method used and how influential the selection method is used in improving the performance of the classification method.

Decision Tree
Decision Tree classification is used to convert large facts into a decision tree that represents the rules into a database language such as Structured Query Language (SQL) to search for records in a particular category. The tree classification method is able to classify and show the relationship between attributes. This method shows the factors that influence the alternative decision results with the expected final results if the decision is accepted. The advantage of decision trees is that they are able to break down complex decision-making processes into simple decision-making. So, the decision maker will interpret the solution of the problem. The concept of a decision tree is to transform data into decisions that are represented by trees and rules [4] .

Fig. 1. Structure of Decision Tree
Decision Tree uses a hierarchical structure consisting of root nodes, internal nodes, and leaf nodes which are illustrated in Figure 1. The root node is the node located at the top. An internal node is a branch node that has only one input and at least two outputs. Leaf node is the last node that has one input and no output. There are several algorithms that can be used to form a decision tree, such as ID3, CART, Sprint, SLIQ Public, Cis, Random Forest, Random Tree, ID3+, Oci, Clouds and the C4.5 algorithm. The C4.5 algorithm is the development of the ID3 algorithm [1].

Random Forest
The Multivariate Random Forest is basically the same as the Random Forest. The difference between these two methods lies in the number of decision results, where Random Forest only gives one result, while Multivariate Random Forest can give more than one result [1].
Generally, Random Forests have hundreds or even thousands of Decision Trees, which predict individual classes. The purpose of the Random Forest is to form one representative decision from many decision trees. The majority vote of the entire tree can be defined as the result of the prediction class. Random Forest structure can be seen in Figure 2 [5]. a. Do a random resampling with the same size as the training data using the bootstrap method.
b. Choose attribute K from a total of M attributes where K < M using the random subspace method, usually the value of K is equal to the square root of M.
c. Form a decision tree using a bootstrap sample and pre-selected attributes.
d. Repeat steps 1 through 3 to form the tree to the desired number. Number of trees in Random Forest is determined based on out-of-bag error rate (OOB)

AdaBoost
AdaBoost, short for Adaptive Boosting, is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire. It can be used in conjunction with many other types of learning algorithms to improve performance. The outputs of the other learning algorithms ('weak learners') are combined into a weighted sum that represents the final output of the improved classifier. AdaBoost is adaptive in the sense that weak learners are further tweaked in favor of examples that were misclassified by the previous classifier. In some cases, it can be less prone to overfitting problems than other learning algorithms. Individual learners can be weak, but as long as each performs slightly better than random guesses, the final model can prove to be fused with strong learners [8].
Each learning algorithm tends to be better suited to some types of problems than others, and usually has many different parameters and configurations to adjust before achieving optimal performance on a data set. AdaBoost (with a decision tree as a weak learner) is often cited as the best out-of-the-box classifier. When used with decision tree learning, the information collected at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growth algorithm so that subsequent trees tend to focus on examples that are more difficult to classify [8].
The stages of the AdaBoost Classifier algorithm are as follows [8]: e. Initially, Adaboost selects a training subset at random.
f. It trains the AdaBoost machine learning model iteratively by selecting training sets based on accurate predictions from the last training.
g. It gives a higher weight to the misclassified observations so that in the next iteration these observations will get a high probability for classification.
h. Also, it assigns weights to the trained classifier in each iteration according to the classifier accuracy. A more accurate classifier will get a higher weight.
i. This process is repeated until the complete training data matches without errors or until the maximum number of estimators is determined.
j. To classify, do a "vote" on all the learning algorithms you create.

Stochastic Gradient Boost
Stochastic Gradient Boosting (SGB) is a modification of Gradient Boosting (GB) which is motivated by the Bootstrap Aggregating method. Friedman as the inventor of GBS said that at each iteration of the algorithm, the base learner should fit a randomly selected training subsample without replacement and observe a substantial increase in GB accuracy with these modifications. Gradient Boosting itself is a machine learning technique for regression and classification problems, which produces prediction models in the form of an ensemble of weak prediction models such as decision trees and has shown great success in various applications. With the modification of the SGB that presents the subsample, it can prevent overfitting or events where the training results match the train data too well but do not match the test data, thus acting as a process of adding information or commonly called regularization and producing lower errors than without modification at the time [3].

Mutual Information
Mutual information is developed from the idea of information theory, just like information entropy. However, feature selection is the primary application of mutual information in the field of data mining, and the use of mutual information in other disciplines is rarely investigated. Mutual information is used to identify the subset of features with the least amount of redundancy and to quantify the mutual inclusion relationships between various qualities [2].
The correlation of random variables can also be represented via mutual information. A property of the sample set has a poor link with the category if it is evenly distributed across all categories, as indicated by a mutual information value of 0 [2].
Decision tree classification incorporates mutual information. The outcomes demonstrate the superiority of the mutual information-based decision tree classification model as a classifier. It is confirmed that the accuracy of the decision tree method based on mutual information has significantly improved, and the classifier construction is faster than the ID3 classifier based on information entropy [2].

Important Feature
The findings of the present study demonstrate that the random forest classifier's variable importance measure is a very helpful foundation for the wrapper algorithms' solution of all pertinent problems. The variety of artificial data sets used to evaluate its usefulness poses a highly challenging problem for the tree-based classifier. However, in the random forest importance ranking, all of the pertinent qualities were typically ranked higher than the unimportant ones [6].
The rating from the random forest can be effectively used by heuristic approaches based on artificial contrast to separate unimportant features from actually relevant ones. The outcomes of the heuristic process are only somewhat worse than what underlying feature ranking would allow for. Heuristic algorithms can produce false positive findings, and the amount of them may be connected with the number of attributes, according to studies conducted on both synthetic and semi-synthetic data sets. In contrast, it did not provide any evident false positives when tested on the Golub data set, independently confirming the significance of the attributes discovered by alternative methods and discovering a large number of additional significant attributes. This finding implies that the characteristics of irrelevant features influence how sensitive the artificial contrast-based heuristic is to false positive finds [6].

RESULT AND DISCUSSION
In this study, a diabetes dataset was used at UCI Machine Learning. In this dataset there are 8 features used and 1 output label. The test is carried out in 3 stages, namely using all features, with feature selection using Mutual Information, and feature selection using Importance Feature.
The first test is done using all the features. In this test, all classification methods used in this study were carried out. The results in this study can be seen in Table 1. In the first test using all features, based on Table 1, it can be seen that the best classification results are in the Random Forest algorithm. Random Forest Algorithm gets 80.5% yield using all features. The Decision Tree algorithm gives the lowest classification results with an accuracy of 68.8%. The graph of the first test results can be seen in Figure 3. The next test is using the Mutual Information feature selection. In this feature selection, calculations are carried out to determine which features will be eliminated using the Mutual Information method. The results of the Mutual Information feature selection can be seen in Table 2. Based on the results of the Mutual Information calculation in Table 2, it can be seen that the triceps_skinfold_thickness feature is the feature with the lowest Mutual Information value. The triceps_skinfold_thickness feature is then eliminated because it has the lowest MI value. Then the next classification test was carried out using all the methods in this study. The results of this test can be seen in Table 3. Based on the results in Table IV, there is an increase in accuracy in 3 methods, Random Forest, AdaBoost and Stochastic Gradient Boost. The Decision Tree algorithm does not change in accuracy even though it has made a feature selection. The best results are found in the Random Forest algorithm with an accuracy of 81.8%. The graph of the results of testing using feature selection can be seen in Figure 3. The last test used the Importance Feature (IF) feature selection. Feature selection IF calculates how important a feature is to the classification. The calculation is carried out in 2 stages. The results of the first stage can be seen in Table 4.  In Table 4, it can be seen that the results of the first calculation of the IF method state that the triceps_skinfold_thickness feature is the least influential feature with the lowest results. The triceps_skinfold_thickness feature had an IF value of 70,296. Therefore, for this stage, the feature is then eliminated and the final IF value is calculated. The results of the calculation of the last IF value without the triceps_skinfold_thickness feature can be seen in Table 5. In Table 5, it can be seen that the feature with the lowest IF value is the serum_insulin feature with an IF value of 86,716. Based on the results of this calculation, the serum_insulin feature will be eliminated and entered into the classification stage. In this test using all the classification methods used in this study with features that have been selected using the IF method. The results of this test can be seen in Table 6 and  Based on Table 6 and Figure 4, it can be seen that there is no change in the classification accuracy results after selecting the features, except for the Random Forest method. Random Forest experienced a decrease in accuracy compared with other methods after using only six features. The best accuracy is found in the Stochastic Gradient Boost method with an accuracy value of 80.5%.

CONCLUSION
Based on all test results, it can be seen that feature selection can improve classification performance. The best accuracy value is the Random Forest method with the selection of the Mutual Information feature with an accuracy of 81.8%. However, when the feature from Importance Feature was selected, Random Forest decreased. This is because the few features decrease in the performance of Random Forest, where Random Forest uses the working concept of Majority Voting. Firstly, it can be seen that using all features, the Random Forest algorithm achieved the highest accuracy of 80.5%, while the Decision Tree algorithm had the lowest accuracy of 68.8%. Secondly, after applying feature selection using Mutual Information and Importance Feature methods, the Random Forest algorithm still achieved the highest accuracy, but with a slight improvement of 81.8% and a decrease to 77.9% after using the Importance Feature method. On the other hand, the Decision Tree algorithm did not show any improvement in accuracy after feature selection. Thirdly, the Stochastic Gradient Boost method showed consistent performance with an accuracy of 76.6% in the first test and an accuracy of 80.5% and 80.5% in the second and third tests, respectively. Overall, the study suggests that feature selection methods can improve classification performance, but the choice of algorithm plays a crucial role in achieving the best accuracy. The Random Forest and Stochastic Gradient Boost methods appear to be the most effective algorithms for this particular diabetes dataset. Further research can be done with other classification methods such as Neural Network or Deep Learning. For feature selection, Genetic Algorithm or Particle Swarm Optimization can be used to get.