Effect of Selecting Validation Dataset on Building Random Forest and Decision Tree Models

Mona Mohammed; Reem Alsunosi

Authors

Mona Mohammed Department of Computer Science, Faculty of Sciences, University of Omar Almukhtar, Albaida, Libya
Reem Alsunosi Department of Computer Science, Faculty of Sciences, University of Omar Almukhtar, Albaida, Libya https://orcid.org/0000-0001-5660-7125

Keywords:

Training, Validation and Testing Data, Train-Test Data Split, Random Forest (RF), Decision Tree (DT).

Abstract

Background and aims. Machine learning models are trained using appropriate learning algorithm and training data. The dataset partition into training and testing data, the training data were used by the model to learn, and the testing data used by the model to predict on unseen data which will evaluate model performance. The train-test split procedure was used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model. Machine learning models in production needs a lot more than just creating and validating models, Data validation are used to check that the model can return useful predictions in a real-world. The basic aim of this paper was to take a closer and critical look at the training data split methods to build the best models, and point out its weakness and limitation, especially for evaluating and comparing the performance of random forest and decision tree models. Methods. For this purpose, the experiments were carried out with different combinations of training and validation data which explain the effect of the method of selecting validation dataset in random forest and decision tree models performance for both classification and regression problems. Moreover, the experiments were going on testing the effect of increasing the training data size. Results. Classification tasks 60/40 ratio for training, and validation splits optimal for big data sets and 80/20 ratio for training, and validation splits optimal for small data sets in most experiments. In regression tasks the models performance increased as fold size increased in most cross-validation experiments. Conclusion. Performance of Random Forest classification, Decision Trees classification, Random Forest regression and Decision Trees regression under different ratios train/validation split better than the performance using cross-validation.