Data visualization with seaborn library
Introduction If you have not used Seaborn for data exploration yet, this is a perfect time to learn a few basic plots. In this article, we will go through a few different types of graphs that you can use in Seaborn:
We will illustrate how to use them on the famous Iris data set.
Why should you use seaborn?
The reason to use Seaborn is very well described by this sentence that is taken from the library site:
“Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.”
I would like to emphasize the “high-level interface” and an adjective “attractive”. The mixture of those two makes Seaborn so attractive. It basically allows you to create beautiful graphs with minimal effort.
Installing Seaborn is very simple. Just run this command from the command line and you should be ready to go.
pip install seaborn
Load data set
As we will be showing how to use the graphs with a real data set we should load it first. The code below loads the famous Iris data set into a data frame using the seaborn load_dataset function.
import seaborn as sns df = sns.load_dataset("iris") df.head()
Let’s call the info() function to get some additional information.
This data five columns and 150 entries. Each entry is a flower and the first four columns are floats describing flowers' dimensions, and the last column is a string that indicates flower classification. We should have three different categories for Iris species.
We can use countplot to visualize species distribution across this data set.
Countplot is a perfect selection for visualizing value counts for categorical variables.
sns.countplot(x ='species', data=df);
As we have mentioned there are three categories: setosa, versicolor species, and virginica. We can also see that this is a balanced data set and each flower category has 50 examples.
Barplot is a graph that is widely used to compare means for continuous variables. Let’s have a look at how we can apply it to the Iris dataset.
import matplotlib.pyplot as plt sns.barplot(data=df) plt.xticks(rotation=45)
Note that the graph automatically excluded categorical variables. For the numeric columns, it is now easy to read what are the mean values for each measurement. Additionally, we have black error bars for each measurement (black lines on top). This type of graph can be manipulated by changing the x, y, and hue parameters. So play with it!
I also had to add a matplotlib function at the end of the code as otherwise labels on the x-axis were overlapping.
If you would like to plot a histogram in Seaborn I just would like to point out that it does not take a data frame like previous graphs that you have met in this article. Instead, you need to pass it just one data series e.g. one column from a data frame.
Let’s try it on the sepal length column.
As you can see the result is a histogram. The blue line that accompanies it is kernel density estimation (kde), that gives us a bit more information about the distribution that the histogram itself.
But how do I plot histograms for the whole data frame? Do I need a for loop? You can achieve this with for loop, or even better you can use a pairplot.
Pairplot will create histograms for all continuous variables and visualize correlations between all the pairs. Let’s have a look at how to use it.
That looks very useful. Note that the pairplot takes the data frame as its input.
I have also added kind parameter and I set it to ’reg’ (kind=’reg’). This is to get the linear regression fit for the correlation plots as I think it helps with visualizations.
Pairplot also has a hue parameter that is very useful for data exploration. Hue should be a categorical variable that allows you to partition a data set into smaller groups. In our case, we have only one categorical variable that is ‘species’. Let’s see how we can use it with a pairplot.
Great! Now you can see the correlations between variables with its species division. Jointplot
In case you work with data set with lots of variables, or just wanna explore one pair of variables in more depth you can use a jointplot. Let’s see the correlation between sepal length and sepal width.
sns.jointplot(data=df, x='sepal_length', y='sepal_width', kind='reg')
As you can see this is very informative again. We have histograms and kernel density estimation on the sides of the graph for each variable, and the main section shows as the individual points with a linear regression line.
Another useful visualization that allows us to inspect continuous variable distribution is a boxplot. It gives us interquartile range information and allows us to see the outliers. Let’s see how we can use to visualize our continuous variables.
As you can see we can see a quick overview of the distribution for our continuous variables. Only sepal width has outliers and they are on both sides of the whiskers (upper and lower).
You can use x, y, and hue parameters to further customize boxplots. For example, you can look at sepal length distribution across different flower species by adding x, and y parameter to the previous piece of code.
sns.boxplot(data=df, x='species', y='sepal_length')
Violin plots are similar to boxplots and they allow us to see distribution by showing kernel density estimation.
Let’s have a look at violin plots for our data set.
We can see that kde for petal_width has two spikes. My suspicion here is that each flower species could contribute to a different spike. You could further customize the violin plot to show that.
sns.violinplot(data=df, x='petal_width', y='species')
It looks like my suspension was correct. Each flower species has a different distribution of petal width variable.
As you can see it is very simple to use Seaborn to make effective and beautiful visualizations. Most of the code that we have used here was not more than one line! The library is very intuitive and user friendly even for people who did not fully master python yet.
I suggest that you now choose a data set and try some of the plots yourself!