One of the important steps in using analytics to generate insights is model fitting. Typical projects involve a lot of data cleaning so that high accuracy is achieved on application of the model. Competitions are all about data cleaning and models. There are various models which can be fitted on data under different conditions. One of the most intuitive of those models is decision trees. Decision trees classify data into buckets based on “decisions” based on the feature values. Most of the competitions start with bench-marking based on results from ensemble of trees, known as random decision forests. Random Forests, as they are called, use ensemble of trees based and are the best examples of ‘Bagging’ techniques. R, the popular language for model fitting has made a variety of random forest packages available for use. Let’s discuss a few of them (in no way this list is exhaustive).
- RandomForest: The ‘classic’ package in R which implements the most basic random forest logic and is really robust. The package is very user friendly and provides the user with the option to tune features such as number of trees and depth of trees. The package optionally provides the ability to derive feature importanceand proximity measures. Feature importance is based on the error increase when OOB data is changed while keep all other things same. On the other hand, Proximity measure is a matrix where (i, j) element indicates fraction of trees in which elements i and j fall in the same terminal node. The package can be used for classification or regression problems and can be learnt with ease
- Cforest: This package is computationally more expensive and better than the randomForest package in terms of accuracy. cforest uses OOB data which means more information and higher accuracy. At the same time it is slower and can handle less data for the same memory. It then uses weighted average of the trees to get the final ensemble. However, the main cause for cforest having a more reliable predictions is the fact that it produces unbiased trees. randomForest have a drawback that the simple algorithm is invariably biased towards features with many cut points. There are features which are continuous or have many categories and can be preferred. Whenever you have large computational resources at your disposal, do use cforest for accuracy.
- ObliqueRF: “Oblique” forests is an underrated, advanced yet useful concept which is based on separating trees using hyper planes instead of features. They can easily outperform randomForest especially in cases when all the features are discrete or we have spectral data. Just like randomForest, Oblique forests are also governed by subspace dimensions(or number of features) and ensemble size(or number of trees). However, since they make oblique cuts rather than orthogonal ones, recursive binary splits and ridge regression are also involved for splitting. I have seen a cool implementation of oblique random forests as the prize winning code in a kaggle competition! Hence oblique random forests sure pack a punch. ObliqueRF does end up having a higher bias and lower variance than randomForest.
- ParallelForest: ParallelForest is an implementation to run randomForest using parallel computing. The package has functions grow.forest. Its pretty handy when there are millions of rows in the training set. A data set which took days for randomForest package to fit on was handled by ParallelForest in under an hour. However, there are still doubts on whether the accuracy is the same for both packages under all conditions and whether classification can be implemented using parallel processing. (Another package bigrf is also based on using multi-threading and caching for very large data but it was not built with the objective to speed up processing rather it is based on handling very large data).
- RandomUniformForest: This package produces unpruned trees and are useful for regression, classification and unsupervised learning. If cforest is slower but more accurate than randomForest then randomUniformForest falls on the other end of being the faster but slightly less accurate version. The trees have lower correlation, thereby resulting in lower bias but higher variance. Moreover, they involve use of uniform distribution. Since we don’t care much about bias as perfectly randomized trees will cancel it out, randomUniformForests are useful in situations where the features themselves follow specified distributions
- Randomforest SRC: Survival, Regression and Classification(SRC) are the three types of models this package provides a unified function for. Additionally, there are multivariate and unsupervised extensions as well as parallel processing through openMP. I have come to use this package whenever there is doubt on what should be the best approach for data model fitting. Coupled with missing value imputation, the package provides a first look kind of model useful for further exploration and deep dive analysis.
- Ranger: Ranger comes to the rescue when you have high dimensional data and want a memory efficient yet fast implementation of randomForest. The word ranger came from RANdom forest GEneRator. The main purpose where I have used ranger is to build models quickly and find out optimal parameter values using parameter tuning.
- Rborist: Rborist is a high performance implementation of randomForest. Compared to original randomForest, this package optimizes the algorithms such that model fitting is performed with less data movement within memory and create opportunities for scaling up performance. Hence, as the features increase, the processing time increases only linearly (as opposed to exponential increase expected for randomForests). The package also supports missing value imputation. Hence, in projects where we ourselves generate a lot of features, this package becomes seemingly more suitable.
Since the idea being first suggested in the 90’s Random forests have become a popular method of model fitting and are used in various forms. There are even more implementations such as rotationForests(based on fitting features over principal components), xgboost (extreme gradient boosting, a clever tree based technique that uses boosting) and rFerns (useful for comparing images) and regularized random forests. This article will be useful for those who have had gone through decision tree and basic random forest concepts and are willing to learn its different variations in R.