• Magdalena Konkiewicz

What are categorical variables and how to encode them?


Image by alan9187 from Pixabay

Introduction

In this article, we will explain what categorical variables are and we will learn the difference between different types of them. We will discuss:

nominal categorical variables

versus

ordinal categorical variables.

Finally, we will learn what are the best methods for encoding each categorical variable type with examples. We will cover:

One hot encoding

and

integer encoding.

Let’s start with some simple definitions.



What are categorical variables?

In order to understand categorical variables, it is better to start with defining continuous variables first. Continuous variables can take any number of values. A good example of the continuous variable is weight or height. They both can take theoretically any value.

Categorical variables are variables in the data set that unlike continuous variables take a finite set of values. For example, grades that students are given by a teacher for assignments (A, B, C, D, E, and F).

Another example of a categorical variable is jersey color that a college is selling. Imagine that they are selling it only in green, blue, and black. Jersey color would be a categorical variable with three possible values.

In the dataset, categorical variables are often strings. In the two examples, we have seen above, they are strings as both grades and color values had this data type.

However, we need to be careful as sometimes integers could hide categorical as well. Therefore it is important to see how many unique values an integer variable has before deciding if it is a continuous or a categorical variable.



Nominal categorical variables

The examples of categorical variables that we have been given above are not identical. There is a difference between jersey color and grades as your intuition may suggest. The colors of the jersey that were green, blue, and black do not have any kind of ordering between themselves. If there is a lack of any kind of logical ordering between the values of the categorical variable we call it a nominal variable.



Ordinal categorical variables

Ordinal categorical variables are categorical variables that have some kind of logical ordering between its values, as in our grades example. Remember that the grades ranged from A to F, and had an ordered relationship (A > B > C >D >E > F). Other examples of categorical ordinal variables would be skiing track classification: easy, medium, and hard.



The need for encoding

Why do we need to encode categorical variables?

The reason for this is very simple, most of the machine learning algorithms allow features only in the numerical form. This means they need to be floats or integers, and the strings are not allowed. As we have mentioned before categorical features are most often strings and therefore we need to encode them into its integer correspondents.

The way of doing it is a bit different for nominal and ordinal categorical variables and we will explain the difference in the following sections.



One hot encoding — best for nominal categorical variables

The first method we are going to learn is called one-hot encoding and it is best suited for nominal variables. While using one-hot encoding we create a new variable for each variable value.

Let’s go back to our jersey color example. We had three color values: green, blue, and black. Therefore we would need to create three new variables, one for each color and assign each variable a binary value of 0 or 1, 1 meaning that the jersey is of that color and 0 meaning that jersey is not of the variable color. In order to do to apply one-hot encoding, we can use get_dummmies() function in pandas.

Let’s demonstrate it with a real example. Imagine that we have data set that holds information about students, it has students' names, students’ grades, and jersey colors that students opted for when they have signed up for college. I am going to create a DataFrame with five students that has this information.



import pandas as pd
import numpy as np
student_dictionary = {'name': ['Michael', 'Ana', 'Sean', 'Carl', 'Bob'], 
                     'grade': ['A', 'C', 'A', 'B', 'F'], 
                     'jersey': ['green', 'green', 'blue', 'green', 'black']}
df = pd.DataFrame(student_dictionary)
df.head()

As we can see this is a data frame with only five student entries and three columns: name, grade, and jersey. Jersey here takes only three values: green, blue, and black.

There is no logical order between them so we can apply one-hot encoding. We are going to do it bu using get_dummies() function from pandas.



pd.get_dummies(df.jersey)

These are the dummy variables we have created. As you can see categorical variable jersey that took three distinct values is described now by three binary variables: black, blue, and green.

If we were to replace the jersey variable with its dummies and feed to the machine learning model we should also make sure that we drop one of the binary variables. The reason for this to avoid a perfect correlation between dummy variables. You can easily drop the first binary variable by setting the drop_first parameter to True when using get_dummies function.



pd.get_dummies(df.jersey, drop_first=True)

As we can see the first binary variable is now excluded from the result. The two resulting variables blue and green are now ready to be passed to the machine learning algorithm.



Integer encoding — best for ordinal categorical variables

In order to encode ordinal categorical variables, we could use one-hot encoding in the same manner as we presented it with nominal variables. This, however, would not be the best choice as we would lose some information about the variables, the ordering.

A better approach would be to use the integer encoding. Every variable would be changed to its corresponding integer. That way we could preserve an order. Therefore, we could encode the grades from our sample data frame in the following manner:

A -> 1

B -> 2

C -> 3

D -> 4

E -> 5

F -> 6

In order to do this with pandas, we can create a dictionary with the mapping and use map() function :



mapping_dictionary = {'A': 1,
                      'B': 2,
                      'C': 3,
                      'D': 4,
                      'E': 5,
                      'F': 6,
                     }
df.grade.map(mapping_dictionary)
0    1
1    3
2    1
3    2
4    6
Name: grade, dtype: int64


As you can see the map function has returned a transformed Series with the mapping applied. This now could be added to a data frame and used as a feature in the machine learning model.

There are other methods to apply integer or label encoding (another name for integer encoding) but using map function and dictionary method is one of my favorites. This is because it gives us control in assigning mapped values.

Imagine that besides the standard grades we would like to add a new value: “not even attempted” and that we think that not attempting the test is much worse than failing it with the lowest grade. In this case, we could map a value of 10 for “not even attempted” demonstrating that this is much worse than the worst possible mark F that had a value of 6 in the mapping.



Summary

In this article, we have learned what categorical variables are. We have discussed ordinal and nominal categorical variables and have shown the best ways to encode them for machine learning models.

I hope you now know more about categorical variables now and that you will be able to apply this knowledge when developing your first machine learning models.

728 views0 comments