Home > How to remove outliers in data

How to remove outliers in data

Commencing with the definition of outliers, outliers in a data is the specific data point that lies outside range of probability of the data set. In layman words, it differs from the surrounding of the data set and can cause grave problems in statistical analysis. Data outliers can affect training algorithms at a wide orbit. So, it is urged to remove theses outliers. But before removing, one requires to detect outliers. Some of the few methods to detect outliers are as follows-

  • Univariate Method: Detecting outliers using Box method is the most used method. The principal idea behind this method is the median, lower, and upper quartiles. This method sets an outline to be the specific value that differs from data point with a sufficiently large distance.
  • Multivariate Method: Univariate has a drawback. Outliers need not be extreme values. But univariate sometimes detect extreme values to be outlier. Multivariate tries to come up to this drawback using neural networks. We train our data using neural networks. After getting this predictive model, we perform a linear regression model. Graphically represent predicted values versus actual values, and if there exist points which differ significantly from the line joining all points, then these points are outliers.
  • Minikowski error: This method differs from the earlier two methods in basic operation. The primary focus of this method is not to remove the outliers, but to reduce their impact on data analysis. The Minkowski error is a loss index that is more insensitive to outliers than the standard mean squared error. The mean squared error raises each instance error to the square, making a huge contribution of outliers to the total error. The Minkowski error solves that by raising each instance error to a number smaller than 2, for instance 1.5. This reduces the contribution of outliers to the total error. For instance, if an outlier has an error of 10, the squared error for that instance will be 100, while the Minkowski error will be 31.62.