2

There is a fairly standard technique of removing outliers from a sample by using standard deviation.

Specifically, the technique is - remove from the sample dataset any points that lie 1(or 2, or 3) standard deviations (the usual unbiased stdev) away from the sample's mean. Is it possible with this technique that one ends up removing all points from the dataset. Or, is there a property of sample stdev that prevents this from happening?

  • Since the standard deviation is in a sense a measure of how far from the mean the typical observation is (in a root mean square sense), throwing out observations which are one standard deviation away would be throwing out typical observations rather than outliers. – Henry Feb 28 '22 at 09:34

2 Answers2

2

Duh. Its quite simple to show that there is atleast one datapoint from a sample lying within one stddev from the mean.

Proof -

Assume all datapoints are more than one stddev away from the mean. That is -

$|x_i-\mu| > \sigma$, for all $1 \le i \le n$

Then we have,

$\sum\limits_{i=1}^n(x_i-\mu)^2 > n\sigma^2$

which is in contradiction with the definition of $\sigma$ (sample standard deviation).

$(n-1)\sigma^2 = \sum\limits_{i=1}^n(x_i-\mu)^2$

0

Interesting question.

For $k>1$, the chevyshev inequality guarantees that we will always have a non-trivial proportion of data points left. If we do this procedure a finite number of times, then some data points will always be left.

We could talk about taking the limit to infinity. At which point, simply show that if there is less than $N=k^2$ points of data, since $\frac{N-1}{N} < 1 - \frac{1}{k^2}$, this implies that we cannot lose any data points in this procedure.

For $k=1$, note that the uniform distribution on 2 points will result in both points lying exactly 1 SD away. Arguably, you'd want to throw this out, and hence we are left with 0 points.

For $k<1$, the uniform distribution on 2 points will result in both points being thrown out. Of course, several other distributions work too, esp those that are very heavy in the tail ends. Note that chevyshev becomes trivial, and hence doesn't apply in this case.

Calvin Lin
  • 68,864
  • Doesn't Chebyshev's inequality deal with populations? My question is about samples. – user869081 Jun 26 '13 at 08:32
  • The main difference would be that your unbiased estimator of population sd is slightly larger than the actual sample sd. The above applies to the actual sample sd, and you can do the same with it, accounting for the small change. So yes, it's conceivable that if your sample data is very heavy in the tails (e.g. uniform distribution on 2 points), then due to the larger sd used, you start to eliminate all the data. – Calvin Lin Jun 26 '13 at 14:41
  • There will always be at least one observation no more than one sample standard deviation away from the sample average, even without using Bessel's correction. If you do use Bessel's correction, the sample standard deviation will become larger so there will always be at least one observation strictly less than one standard deviation away from the sample average (noting that the sample average and standard deviation of the remaining data will change as you remove "outliers") – Henry Feb 28 '22 at 09:40