top of page
  • Writer's pictureMagdalena Konkiewicz

Pandas profiling and exploratory data analysis with line one of code!


Image by Colin Behrens from Pixabay

Introduction

If you are already familiar with the pandas profiling package you will not learn anything new from this article so you can just skip it now.

However, if you have never heard of it, this may be one of the best productivity tips regarding data analysis you have been given so far, so hang on.



Pandas profiling

Pandas profiling is a package that allows you to create an exploratory analysis data report with minimal effort, one line of code.

Therefore, if you are a Data Scientist or Analyst who has been doing exploratory data analysis manually then using pandas profiling will save you a lot of time, effort, and typing. Do you remember all the repetitive code you use when doing exploratory data analysis, such as:

info(),

describe(),

isnull(),

corr(),

etc.

You will not have to do it anymore. Pandas profiling package will do it for you and will create a summary full report of your data.

So let’s get started!



How to install pandas profiling package

The installation of pandas profiling is very easy. You can use the standard pip command.



pip install pandas-profiling

It should take a minute or two to install the package and you should be ready to use pandas profiling within python.



How to create a profiling report

In order to create a report, you can just load a data set with a standard read_csv() function that stores data in the pandas data frame.

Then use ProfileReport initializer and pass it a data frame that you have just created as a parameter.

You can use to_file() function to export the report so you can inspect it.



import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv(data_file_name)
report = ProfileReport(df)
report.to_file(output_file='output.html')


Let’s see a real example.

We are going to create a report for a real data set. We have chosen a data set on heart disease that can be downloaded from here. This data set is 33 KB in size, has 14 columns, and 303 observations.

Let’s create a report for this data set using pandas profiling.



from pandas_profiling import ProfileReport
df = pd.read_csv('heart.csv')
report = ProfileReport(df)
report.to_file(output_file='output.html')

Once you run this code your should see progress bars with the report generation and within a few seconds, you should be able to view the full report by opening output.html file in your browser.

Yes, that was that easy! The report is ready and you can view it!

***report should be saved in the same folder from which the original data was read from.



Report structure

Let’s see what is contained in the pandas profiling report.

  • Overview

In the overview section, we should see three tabs: Overview, Reproduction, and Warnings.

The Overview tab gives basic information about data such as the number of columns and rows, data size, percentage of missing values, data types, etc.

The Reproduction contains information about report creation.

And the Warning tab includes warnings that have been triggered while producing the report.


  • Variables

This section focuses on a detailed analysis of each variable.

If the variable is continuous it will display a histogram and if it is categorical it will show a bar chart with value distribution.

You can see the percentage of missing values for each variable as well.

The picture below shows the analysis for age and sex variables from heart disease data set.


  • Interactions

Interactions section focuses on bivariate relationships between numerical variables. You can use the tabs to choose relation pairs you want to examine. The picture below shows the relationship between age and cholesterol.


  • Correlations

This section shows the different types of correlations. You can see the report for Pearson’s, Spearman’s, Kendall’s, and Phik correlation for numerical variables and Cramer’s V correlation for categorical variables.



  • Missing values

This is a section that shows missing values in the data set with the column break up. We can see that our data set has no missing values in my of the columns.


  • Sample

This is a section that replaces head() and tail() function from manual data analysis. You can see the first, and last 10 rows of the data set.


  • Duplicate rows

This section shows you if there are duplicate rows in the data set. There is actually one duplicate entry in the heart disease data set and its details are shown in the screenshot below.



Disadvantages

In this article, we have talked a lot about the advantages of pandas profiling packages, but are there any disadvantages? Yes, let’s mention some.

If your data set is very big it takes very long to create a report (could be hours in extreme cases).

We have some basic EDA using a profiling package and it is a good start for data analysis but it is definitely not a complete exploration. Normally we would see more graph types such as boxplots, more detailed bar charts, and some other types of visualizations and explorations techniques that would reveal quirks of the particular data set.

Additionally, if you are just starting your data science journey it may be worth learning how to gather the information included in the report using pandas itself. This is so you can practice coding and manipulating data!

Otherwise, I think it is a great and very useful package!



Summary

In this article, we have shown you how to install and use pandas profiling. We even showed you a quick interpretation of the results.

Download the heart disease data set and try it yourself.

1,259 views0 comments
bottom of page