UrbanPro
true

Learn SAP from the Best Tutors

  • Affordable fees
  • 1-1 or Group class
  • Flexible Timings
  • Verified Tutors

Search in

Variations Of Random Forest In R

Ankit Katiyar
14/07/2017 0 0

One of the important steps in using analytics to generate insights is model fitting. Typical projects involve a lot of data cleaning so that high accuracy is achieved on application of the model. Competitions are all about data cleaning and models. There are various models which can be fitted on data under different conditions. One of the most intuitive of those models is decision trees. Decision trees classify data into buckets based on “decisions” based on the feature values. Most of the competitions start with bench-marking based on results from ensemble of trees, known as random decision forests. Random Forests, as they are called, use ensemble of trees based and are the best examples of ‘Bagging’ techniques. R, the popular language for model fitting has made a variety of random forest packages available for use. Let’s discuss a few of them (in no way this list is exhaustive).

  • RandomForest: The ‘classic’ package in R which implements the most basic random forest logic and is really robust. The package is very user friendly and provides the user with the option to tune features such as number of trees and depth of trees. The package optionally provides the ability to derive feature importanceand proximity measures. Feature importance is based on the error increase when OOB data is changed while keep all other things same. On the other hand, Proximity measure is a matrix where (i, j) element indicates fraction of trees in which elements i and j fall in the same terminal node. The package can be used for classification or regression problems and can be learnt with ease
  • Cforest: This package is computationally more expensive and better than the randomForest package in terms of accuracy. cforest uses OOB data which means more information and higher accuracy. At the same time it is slower and can handle less data for the same memory. It then uses weighted average of the trees to get the final ensemble. However, the main cause for cforest having a more reliable predictions is the fact that it produces unbiased trees. randomForest have a drawback that the simple algorithm is invariably biased towards features with many cut points. There are features which are continuous or have many categories and can be preferred. Whenever you have large computational resources at your disposal, do use cforest for accuracy.
  • ObliqueRF: “Oblique” forests is an underrated, advanced yet useful concept which is based on separating trees using hyper planes instead of features. They can easily outperform randomForest especially in cases when all the features are discrete or we have spectral data. Just like randomForest, Oblique forests are also governed by subspace dimensions(or number of features) and ensemble size(or number of trees). However, since they make oblique cuts rather than orthogonal ones, recursive binary splits and ridge regression are also involved for splitting. I have seen a cool implementation of oblique random forests as the prize winning code in a kaggle competition! Hence oblique random forests sure pack a punch. ObliqueRF does end up having a higher bias and lower variance than randomForest.
  • ParallelForest: ParallelForest is an implementation to run randomForest using parallel computing. The package has functions grow.forest. Its pretty handy when there are millions of rows in the training set. A data set which took days for randomForest package to fit on was handled by ParallelForest in under an hour. However, there are still doubts on whether the accuracy is the same for both packages under all conditions and whether classification can be implemented using parallel processing. (Another package bigrf is also based on using multi-threading and caching for very large data but it was not built with the objective to speed up processing rather it is based on handling very large data).
  • RandomUniformForest: This package produces unpruned trees and are useful for regression, classification and unsupervised learning. If cforest is slower but more accurate than randomForest then randomUniformForest falls on the other end of being the faster but slightly less accurate version. The trees have lower correlation, thereby resulting in lower bias but higher variance. Moreover, they involve use of uniform distribution. Since we don’t care much about bias as perfectly randomized trees will cancel it out, randomUniformForests are useful in situations where the features themselves follow specified distributions
  • Randomforest SRC: Survival, Regression and Classification(SRC) are the three types of models this package provides a unified function for. Additionally, there are multivariate and unsupervised extensions as well as parallel processing through openMP. I have come to use this package whenever there is doubt on what should be the best approach for data model fitting. Coupled with missing value imputation, the package provides a first look kind of model useful for further exploration and deep dive analysis.
  • Ranger: Ranger comes to the rescue when you have high dimensional data and want a memory efficient yet fast implementation of randomForest. The word ranger came from RANdom forest GEneRator. The main purpose where I have used ranger is to build models quickly and find out optimal parameter values using parameter tuning.
  • Rborist: Rborist is a high performance implementation of randomForest. Compared to original randomForest, this package optimizes the algorithms such that model fitting is performed with less data movement within memory and create opportunities for scaling up performance. Hence, as the features increase, the processing time increases only linearly (as opposed to exponential increase expected for randomForests). The package also supports missing value imputation. Hence, in projects where we ourselves generate a lot of features, this package becomes seemingly more suitable.

Since the idea being first suggested in the 90’s Random forests have become a popular method of model fitting and are used in various forms. There are even more implementations such as rotationForests(based on fitting features over principal components), xgboost (extreme gradient boosting, a clever tree based technique that uses boosting) and rFerns (useful for comparing images) and regularized random forests. This article will be useful for those who have had gone through decision tree and basic random forest concepts and are willing to learn its different variations in R.

0 Dislike
Follow 0

Please Enter a comment

Submit

Other Lessons for You

Keyboard Shortcuts In Excel
MS Excel offers many keyboard short-cuts. If you are familiar with windows operating system, you should be aware of most of them. Below is the list of all the major shortcut keys in Microsoft Excel. Ctrl...

Printing Worksheets In Excel
i. Quick Print: If you want to print a copy of a worksheet with no layout adjustment, use the Quick Print option. There are two ways in which we can use this option. Choose File » Print (which...

10 Reasons Why Big Data Analytics Is The Best Career Move?
Why Big Data Analytics is the Best Career move? If you are still not convinced by the fact that Big Data Analytics is one of the hottest skills, here are 10 more reasons for you to see the big picture: 1....

Data Sorting In Excel
Sorting data in MS Excel rearranges the rows based on the contents of a particular column. You may want to sort a table to put names in alphabetical order. Or, maybe you want to sort data by Amount from...

Use of Piggybank and Registration in Pig
What is a Piggybank? Piggybank is a jar and its a collection of user contributed UDF’s that is released along with Pig. These are not included in the Pig JAR, so we have to register them manually...
S

Sachin Patil

0 0
0
X

Looking for SAP Classes?

The best tutors for SAP Classes are on UrbanPro

  • Select the best Tutor
  • Book & Attend a Free Demo
  • Pay and start Learning

Learn SAP with the Best Tutors

The best Tutors for SAP Classes are on UrbanPro

This website uses cookies

We use cookies to improve user experience. Choose what cookies you allow us to use. You can read more about our Cookie Policy in our Privacy Policy

Accept All
Decline All

UrbanPro.com is India's largest network of most trusted tutors and institutes. Over 55 lakh students rely on UrbanPro.com, to fulfill their learning requirements across 1,000+ categories. Using UrbanPro.com, parents, and students can compare multiple Tutors and Institutes and choose the one that best suits their requirements. More than 7.5 lakh verified Tutors and Institutes are helping millions of students every day and growing their tutoring business on UrbanPro.com. Whether you are looking for a tutor to learn mathematics, a German language trainer to brush up your German language skills or an institute to upgrade your IT skills, we have got the best selection of Tutors and Training Institutes for you. Read more