Outliers
* An Outlier is an observation point that is distant from other observations.
* An outlier may indicate an experimental error, or it may be due to variability in the measurement.
* Outliers are different from the noise data. Noise is random error or variance that needs to be removed before outlier detection
* We can check the performance of dataset without outlier by checking scores.
Categories of Outlier
a) Global Outlier / Point Anomalies
* They are defined as data points that differ from the rest of the data.
* A particular type of global outlier is the influential point. It is defined as the outlier that impacts the rest of the data.
* We can evaluate an outlier based on the R2 score and regression line.
* Removing an outlier may or may not increase the R2. We have to study the individual purpose if needed.
b) Contextual / Conditional Outlier
* It is defined as the data point whose context differs from the rest of the data.
* It can be a point that is following a trend in some other context, with what defined for rest of the dataset.
* We have to study it and remove it carefully.
* We shall focus on two conditions :
Contextual attributes: e.g., time, location, etc.
Behavioural attributes: e.g., temperature, calories taken
c) Collective Outlier
* It is defined as a set of data point that deviates significantly from the rest of the data, even if the individual data points are not the outliers.
* We have to study it carefully. Removal of such data points may degrade the system. We shall focus on improving it.