Detecting and Fixing Outliers in Data

Mon 31 March 2014

While working with any data, I spend a good amount of time cleaning it so that it can be useful for subsequent analysis. Some of the data I work with has inherent noise at unspecified intervals, making it difficult to visualize and analyze. For example, take this constructed sample -

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Generate random data
data = np.random.poisson(5, 100)

# Set a couple of values as 'outliers'
data[45] = 1000
data[89] = 670

# Plot the data
p = plt.scatter(range(1,101),data)
show(p)

Noise in data

As you can see, the erroneous outliers make it rather impossible to focus on the real data. Moreover, if left uncorrected, these values may have a disproportionate amount of influence on, say, an OLS regression due to their high leverage.

While there are various methods to deal with such outliers, including manually removing them, I wanted a method that would 1) preserve the real data and 2) make a reasonable assumption about the real value of the outlier.

So I coded up a simple function that I call the Selective Median Filter. The filter simply progressively looks at a subset of the data (specified by the kernel size) and detects whether there's an outlier in that window. It does so by comparing the values to the median of the values in the window. If it detects an outlier, it sets its value to the median value. The sensitivity of classifying a value as an outlier can be controlled using the threshold parameter.

Here's what the code looks like -

def selective_median_filter(data, kernel=31, threshold=2):
    """Return copy of data with outliers set to median of specified
        window. Outliers are values that fall out of the 'threshold'
        standard deviations of the window median"""
    if kernel % 2 == 0:
        raise Exception("Kernel needs to be odd.")
    n = len(data)
    res = list(data)
    for i in range(0, n):
        seg = res[max(0,i-(kernel/2)):min(n, i+(kernel/2)+1)]
        mn = np.median(seg)
        if abs(res[i] - mn) > threshold * np.std(seg):
            res[i] = mn
    return res

The underlying assumption here is that the values in a relatively small neighborhood will not vary too much. Here's what the results looks like when you pass the data above through this filter -

data = selective_median_filter(data, 7, 2)

s = plt.scatter(range(1,101), data)
show(s)

print(data[45], data[89])
# (5.0, 7.0)

Cleaned data

Much better!

While you are free to use the code above in your own analysis, the regular disclaimers apply - know your data, make sure applying this filter makes sense (sometimes just removing the outliers makes sense) etc.

The code and example above can be found on this IPython Notebook.