top of page
  • Writer's pictureMagdalena Konkiewicz

A quick overview of 5 scikit-learn classification algorithms


Image by Gabriele M. Reinhardt (LILO) from Pixabay

Introduction

In this article, I will show you how to build quick models with scikit- learn for classification purposes.

We will use the Iris data set with three different target values but you should be able to use the same code for any other multiclass or binary classification problem.

You will learn how to split the data for the model, fit to the algorithm to the data for five different types of models, and then briefly evaluate the results with a classification report.

The algorithms that we will be using here are:

Logistic regression

KNN

Decision tree

Random Forrest

Gradient boosting

It’s time to get started!



Loading data and quick data exploration

Let’s load the Iris data set using the following code:



from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['species'] = data['target']
df.head()


There are only five columns in this data set. The last column species is what we will be trying to predict and we will be calling it a target. All other columns will act as features and will use them to do our predictions.

It is important that you clearly identify what is the target and what are the features in your own data set.

Let’s call info() function to learn a bit more about our data:



df.info()

As you can see there are only 150 entries, there are no missing values in any of the columns. Also, all values are either floats or integers.

However, from the data set description I know that species is not a continuous variable but a categorical one (therefore classification not regression).

We can check this, and additionally see how target values are distributed with value_counts() function:



df.species.value_counts()
2    50
1    50
0    50
Name: species, dtype: int64

We can see that the species column takes only three values: 0, 1, and 2, and all of the classes have an equal number of examples: 50 each. This is a perfectly balanced data set in terms of target value distribution.

As our data is “extremely clean”, does not have missing values or categorical variables as features, and is well balanced in terms of the target we can actually proceed to modeling part.

*** In case your data had missing values you would have to deal with it by dropping or replacing them with some approximated value. Additionally, if you had categorical variables as features, one-hot encoding would be required.



Dividing data set into train and test

Before fitting data to the model it is necessary to divide the data into the train and test part. It is important as you should not be testing your model on the same data as it was trained on.

Luckily this is very easy with train_test_split function from scikit-learn:



from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.drop('species', axis=1), df.species ,test_size = 0.2, random_state=13)

As a result of this function we will have data divided into 4 parts now:

x_train

x_test

y_train

y_test

The x-prefix refers to the part of data that holds features information and y-prefixed data holds the target part of data. You can learn more about how this function works from the article I have written before.

From now on I will fit the model on the train part (x_train and y_train) and test on the test part (x_test and y_test).

We can now fit our first model.



Logistic regression

Let’s start by implementing the first model that is logistic regression. We will import the model from the scikit-learn linear model package and use the fit() function to train the model and then predict() function to make predictions on the test set:



from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(x_train, y_train)
predictions = clf.predict(x_test)

As you can see I have used the train part of the data with fit() function (both x and y parts) and to do predictions I have used the x_test only. I can now compare the predictions with the actual target values (y_test).

Until now I do not know if my model is doing correct predictions. In order to evaluate this, I will use classification_report from sci-kit learn:



from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))


The classification report compares predictions we have made for the target variable with the real classes. The metric that I would like you to primarily focus on is accuracy. In this case, we have predicted 97% of the classes correctly, not bad.

In this article, I am not going to explain the classification report in detail but I would like to emphasize that it is important to look at precision, recall, and f-score while comparing models along with accuracy. For this reason, I have printed the whole report instead of accuracy only. I think it is a convenient function that gives you all this metric together. If you want to learn more about precision, recall, f-score and also learn to read a confusion matrix check this article.

Back to the accuracy… It looks like the simple logistic regression allows us to get 97% of the predictions correctly. Definitively a good starting point, if I was satisfied with this I could actually use it to predict flower species with high accuracy.



K-Nearest Neighbour (KNN)

Let’s now train K-Nearest Neighbour on the same data:



from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier()
neigh.fit(x_train, y_train)
predictions = neigh.predict(x_test)

We have used the default parameters for the algorithm so we are looking at five closest neighbors and giving them all equal weight while estimating the class prediction.

You can now call the classification report:



from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

KNN with default values seems to work slightly worse than the logistic regression. The accuracy went down from 0.97 to 0.9 and average recall, precision, and f-score seem to be lower as well.

We could play with KNN param to see if this could be improved. Possible improvements would include changing the number of neighbors used for prediction or using different weighing that would take into account neighbor proximity.



Decision tree

Let’s have a look at another classification algorithm. I am now going to call a decision tree with default parameters perform:



from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
clf.fit(x_train, y_train)
predictions = clf.predict(x_test)

The only parameter that I am supplying is a random state. This is so my results are reproducible.



from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))


As you can see this decision tree performs perfectly on the test data set. I am predicting all the classes correctly!

It seems great at a glance however I am probably overfitting here. I should probably use cross-validation to tune model parameters in order to prevent overfitting.

Code below performs 10-fold cross-validation only on the train data and prints out accuracy for each fold:



from sklearn.model_selection import cross_val_score
cross_val_score(clf, x_train, y_train, cv=10)
array([1.        , 0.92307692, 0.91666667, 0.91666667, 0.91666667,
       1.        , 0.91666667, 1.        , 1.        , 0.90909091])

By examining the output I can see that on some data splits I am indeed getting 100% accuracy but there are quite a few data splits where the accuracy is almost 10% lower. The reason for this is that the data set I am using is extremely small. Misclassifying just one instance is causing big accuracy fluctuations.

*** In general, it is fine to use default algorithms to build first models as I am demonstrating here but the next step should be running parameters using cross-validation.



Random Forrest

Let’s try to call the random forest classifier with its default parameters. You can think of a random forest as a set of decision trees. The forest is actually built by using many decision trees and then averaging the results. I am not going to explain here the details of the algorithm but call it with its default parameters.



from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
clf.fit(x_train, y_train)
predictions = clf.predict(x_test)

Let’s run a classification report:



from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

We have an accuracy of 93% here. We had seen some better and some worse results with previous classification algorithms.

In order to tune and improve the algorithm, you could play with the number of estimators, their depths, and structure. That would require learning more about the trees and how the algorithm of the work itself. There are plenty of articles on Medium on this so I suggest that you search for them if you would like to learn more.



Gradient boosting

Let’s try the last algorithm that we will present in this article: gradient boosting classifier. It is another tree style algorithm and it has been very effective for many machine learning problems.

Let’s call is with using the scikit-learn function:



from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=0)
clf.fit(x_train, y_train)
predictions = clf.predict(x_test)

And run classification report:



from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

This is another highly scoring algorithm. We have achieved an accuracy of 97%. It looks like all our models were doing pretty decent predictions even with its default unturned parameters!



Comparing classification algorithms

You can see that we have presented five algorithms and all of them have achieved high accuracy on the test set. The algorithm's accuracy ranged from 90% (KNN) to 100% (decision tree).

Theoretically, any of this algorithm could be used to predict flower spices with decent accuracy (over 90%).

Because we had done a rather quick analysis and did not dig into the details of each implementation it is hard to decide which algorithm is the best. This would require more analysis and tunning of each algorithm implementation.



Summary

You have been presented with an overview of five basic classification algorithms and have learned how to call them with its default parameters. After this article, you should also be able to evaluate their performance using a quick classification report. The next step should be learning more about each algorithm and tunning it to improve performance and avoid overfitting.



3,246 views0 comments

Recent Posts

See All
bottom of page