Say you have some data that looks like this -
It is hard to see any trends in the data owing to the high degree of variance. A common technique used to get rid of this noise such that patterns are more apparent is referred to as 'curve smoothing'. There are various algorithms available for this and this post will talk about three of them - rolling means, local regressions and smoothing splines.
Rolling Means
This is the simplest of the smoothing algorithms. The basic premise is
that taking averages tends to reduce the variance in a data set and
thus eliminates extreme values. A rolling mean calculates a value by
taking the average of the last n
values. So n = 10
would calculate
the value by taking the average of the current value and the previous
9 values. Here's how the original curve looks when the rolling mean algorithm is applied to it.
- n = 10
- n = 20
In R, the zoo
package provides a convenient rollmean
function that takes the size of the rolling window as parameter.
Local Regressions
In simple terms, this algorithm calculates the least squares fit for a given set of points chosen using the nearest neighbors algorithm. The number of data points is controlled by the ⍶
parameter. It is also referred to as loess
for brevity. Here's how the original curve looks when the local regression algorithm is applied to it.
- ⍶ = 0.1
- ⍶ = 0.6
In R, the loess
function in the base
package provides a good implementation. The span
parameter controls the ⍶
value. Note that you need to feed the model generated by the loess
function to predict
to get the resulting y
values.
Smoothing Splines
This algorthm uses the properties of a spline function to calculate the smooth curve. The algorithm is iterative in nature and is controlled by the λ
parameter. Here's how the original curve looks when the smoothing spline algorithm is applied to it.
- λ = 0.1
- λ = 0.6
In R, the smooth.spline
function in the base
package provides a good implementation. The spar
parameter controls the λ
value.
Of course, there are many more ways you could smooth the data and I would encourage you to find the one that makes the most sense for the problem domain you are working in.