How to use groupby() and aggregate functions in pandas for quick data analysis
One of the first functions that you should learn when you start learning data analysis in pandas is how to use groupby() function and how to combine its result with aggregate functions. This is relatively simple and will allow you to do some powerful and effective analysis quickly.
In this article, we will explain:
What is groupby() function and how does it work?
What are the aggregate functions and how do they work?
How to use groupby and aggregate functions together
At the end of this article, you should be able to apply this knowledge to analyze a data set of your choice.
Let’s get started. We will use an iris data set here to so let’s start with loading it in pandas.
Load iris data set
You can use the code below to load iris data set and inspect its first few rows:
from sklearn import datasets import pandas as pd data = datasets.load_iris() df = pd.DataFrame(data.data,columns=data.feature_names) df['target'] = pd.Series(data.target) df.head()
Just by eyeballing the data set, you can see that the only categorical variable we have is the target. You can use unique() function to check what values it takes:
df.target.unique() array([0, 1, 2])
We can see that it takes three values: 0, 1, and 2.
Now that you have checked that the target column is categorical and what values it takes you can try to use a groupby() function. As the name suggests it should group your data into groups. In this case, it will group it into three groups representing different flower species (our target values).
df.groupby(df.target) <pandas.core.groupby.generic.DataFrameGroupBy object at 0x1150a5150>
As you can see the groupby() function returns a DataFrameGroupBy object. Not very useful at first glance. This is why you will need aggregate functions.
What are the aggregate functions?
Aggregate functions are functions that take a series of entries and return one value that summarizes them in some way.
The good examples are:
As I have mentioned you can use them on the series of entries. Let’s apply it to one of the columns of our data set:
df['sepal length (cm)'].mean() 5.843333333333334
We can see that the mean of the sepal length column is 5.84.
Let’s now use the describe() function which will give us more summary stats.
df['sepal length (cm)'].describe() count 150.000000 mean 5.843333 std 0.828066 min 4.300000 25% 5.100000 50% 5.800000 75% 6.400000 max 7.900000 Name: sepal length (cm), dtype: float64
You can see that it gave us the same mean as the previous function and some of the additional info: count, min, max, std, and interquartile ranges.
Using groupby() and aggregate functions together
Now it is time to combine what you have learned together. The good news is that you can call the aggregate functions on a groupby object and that way you will obtain the results for each group.
Let’s demonstrate this with the iris data set again:
As you can see our index column is no giving us a group name (0, 1 and 2 in our case) and the mean value for each column and each group accordingly.
You can see that the average petal length for group 0 (1.46cm) is much smaller than the average petal length of two other groups: 1 (4.26 cm), and 2 (5.52cm). It looks like this could be an important difference between the flower pieces being analyzed.
You can also use describe() on the group by the object to get even all descriptive statistics of our groups:
As you can see this table gives you descriptive statistics for all groups and all columns. In case it is hard to read you can also focus on one column at the time:
df.groupby(df.target)['sepal length (cm)'].describe()
As you can see in these examples it is super easy and straight forward to use groupby and aggregate functions together.
The rules are to use groupby function to create groupby object first and then call an aggregate function to compute information for each group.
In this article, you have learned about groupby function and how to make effective usage of it in pandas in combination with aggregate functions.
I hope that you will be able to apply what you have learned to do some quick analysis of the data set of your choice.