• Magdalena Konkiewicz

Why computing standard deviation in pandas and NumPy yields different results?


Image by Gerd Altmann from Pixabay

How many of you have noticed that when you compute standard deviation using pandas and compare it to a result of NumPy function you will get different numbers?

I bet some of you did not realize this fact. And even if you did you’re maybe asking: Why?

In this short article, we will demonstrate that:

standard deviations results are indeed different using both libraries (at least at the first glance),

discuss why is that so (focusing on populations, samples, and how this influences the calculation of standard deviation for each library)

and finally, show you how to obtain the same results using pandas and NumPy (in the end they should agree on such a simple computation that standard deviation is)

Let’s get started.



Standard deviation in NumPy and pandas

Let’s start by creating a simple data frame with weights and heights that we can use for standard deviation calculations later on.



import pandas as pd
df = pd.DataFrame({'height' : [161, 156, 172], 
                   'weight': [67, 65, 89]})
df.head()
                   

This is a data frame with just two columns and three rows. We will focus on just one column that is weight and compare standard deviations results from pandas and NumPy for this particular column.

Let’s start with pandas first:



df.weight.std()
13.316656236958787

And now let us do the same using NumPy:



import numpy as np
np.std(df.weight)
10.873004286866728

We get 13.31 and 10.87. They are quite different numbers indeed so why is it so?


Population standard deviation

The reason for the difference in the numbers above this is the fact that the packages use a different equation to compute the standard deviation. The most commonly known equation for standard deviation is:


Where:

σ = population standard deviation

N = size of the population

xi = each value from the population

µ = population mean

This equation refers to the population standard deviation and this is the one that NumPy uses by default.

When we collect that data it is actually quite rare that we work with populations. It is more likely that we will be working with samples of populations rather than whole populations itself.



Sample standard deviation

When we are working with samples rather than the populations the question changes a bit. Therefore, the new formula for standard deviations is:


Where:

σ = sample standard deviation

N = size of the sample

xi = each value from the sample

µ = sample mean

This equation refers to the sample standard deviation and this is the one that pandas uses by default.



Difference between population and a sample

As you have noticed the difference is in the denominator of the equation. When we compute sample standard deviation we divide by N- 1 instead of only using N as we do when we compute population standard deviation.

The reason for this is that in statistics in order to get an unbiased estimator for population standard deviation when calculating it from the sample we should be using (N-1). This is called one degree of freedom, we subtract 1 in order to get an unbiased estimator.

I will not discuss the detail of why we should be using one degree of freedom as it is a quite complicated concept. If you want you can watch this video to get a better understanding.



So pandas standard deviation is the correct one?

So I have told you that you should be using N-1 when in order to get the unbiased estimator. And this is usually the case as mostly you will be dealing with samples, not entire populations. This is why pandas default standard deviation is computed using one degree of freedom.

This may, however, may not be always the case so be sure what your data is before you use one or the other. Also in case, you want to use a specific library to achieve one or the other you can use parameter ddof to control the degrees of freedom in both packages.

Let’s have a look at the old example where we were getting σ =13.31 using pandas and σ= 10.87 using NumPy.



df.weight.std()
13.316656236958787
import numpy as np

np.std(df.weight)
10.873004286866728

You can change degree in of freedom in NumPy to change this to unbiased estimator by using ddof parameter:



import numpy as np
np.std(df.weight, ddof=1)
13.316656236958787

You can see that now the result is the same as the default standard deviation given by pandas calculation.

Similarly, you can change default pandas standard deviation computation not to use degrees of freedom:



df.weight.std(ddof=0)
10.873004286866728



Summary

In this article, we have discussed calculating the standard deviation for samples and populations and touched the idea of degrees of freedom in statistics.

We have demonstrated how to calculate standard deviation in pandas and NumPy and how to be able to control degrees of freedom in both packages.

I hope this solves the initial curiosity and explains why the standard deviation results initially seem to be different when using one library or the other.

1,055 views0 comments