• Magdalena Konkiewicz

Learn how to use grid search for parameter tunning


Image by PollyDot from Pixabay

Introduction


Once you have built a machine learning model you would like to tune its parameters for optimal performance. The best parameters would be different for each data set therefore they need adjusting so the algorithm can gain its maximum potential.


I have seen many beginner data scientists doing parameter tunning by hand. This means running the model, then changing one or multiple parameters within the notebook, waiting for the model to run, gathering results, and then repeating the process again and again. Usually, people forget on the way which parameters were the best and they need to do it again.


In general, the above strategy is not the most efficient. Luckily this process could be easily was automated thanks to the authors of the sci-kit learn library who added GridSeachCV.



What is GridSearchCV?

Image by Nicolás Damián Visceglio from Pixabay

GridSearchCV is an alternative to the naive method I have described above. Instead of manually tweaking the parameters and rerunning the algorithm several times you can list all parameter values that you would like the algorithm try and pass it to GridSeachCV.


GridSearchCV will try all combinations of those parameters, evaluate the results using cross-validation, and the scoring metric you provide. In the end, it will spit the best parameters for your data set.


GridSearchCV can be used with any supervised learning Machine Learning algorithm that is in sci-kit learn library. It will work both for regression and classification if you provide an appropriate metric.


Let's see how it works with a real example.



GridSearchCV code example


In order to illustrate let's load the Iris data set. This data set has 150 examples of three different Iris species. The data set has no missing values so there will be no data cleaning needed.



from sklearn.datasets import load_iris
import pandas as pd
%matplotlib inline

data = load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['species'] = data['target']
df.head()


Now let's divide our data set to train and test.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.drop('species', axis=1), df.species ,test_size = 0.2, random_state=13)

Once we have divided the data set we can set up the grid search with the algorithm of our choice. In our case, we will use it to tune the random forest classifier.


from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()

grid_values = {'n_estimators': [10, 30, 50, 100],
               'max_features': ['sqrt', 0.25, 0.5, 0.75, 1.0],
               'max_depth' : [4,5,6,7,8],
              }

grid_search_rfc = GridSearchCV(rfc, param_grid = grid_values, scoring = 'accuracy')
grid_search_rfc.fit(x_train, y_train)

In the code above we first set up the Random Forest Classifier by using a constructor with no parameters. Then we define parameters and the values to try for each parameter in the grid_values variable. 'grid_values' variable is then passed to the GridSearchCV together with the random forest object (that we have created before) and the name of the scoring function (in our case 'accuracy'). Last, by not least we fit it all by calling the fit function on the grid search object.


Now in order to find the best parameters, you can use the best_params_ attribute:


grid_search_rfc.best_params_

We are getting the highest accuracy with the trees that are six levels deep, using 75 % of the features for max_features parameter and using 10 estimators.


This has been much easier than trying all parameters by hand.


Now you can use a grid search object to make new predictions using the best parameters.



grid_search_rfc = grid_clf_acc.predict(x_test)


And run a classification report on the test set to see how well the model is doing on the new data.


from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))


You can see detailed results for accuracy, recall, precision, and f-score for all of the classes.


Note that we have used accuracy for tunning the model. This may not be the best choice. We can actually use other metrics such as precision, recall, and, f-score. So let's do that.



from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score

scoring = {'accuracy': make_scorer(accuracy_score),
           'precision': make_scorer(precision_score, average = 'macro'),
           'recall': make_scorer(recall_score, average = 'macro'),
           'f1': make_scorer(f1_score, average = 'macro')}

grid_search_rfc = GridSearchCV(rfc, param_grid = grid_values, scoring = scoring, refit='f1')
grid_search_rfc.fit(x_train, y_train)

In the code above we set up four scoring metrics: accuracy, precision, recall, and f-score and we store them in the list that is later on passed to grid search as a scoring parameter. We also set the refit parameter to be equal to one of the scoring functions. This is f-score is our case.


Once we run it we can get the best parameters for f-score:


grid_search_rfc.best_params_ 

Additionally, we can use the cv_results_ attribute to learn more about the set up of the grid_search.


grid_search_rfc.cv_results_

If you want to see results for other metrics you can use cv_results['mean_test_<metric_name>']. So in order to get results for the recall that we have set up before as one of the scoring functions you can use:


grid_search_rfc.cv_results_['mean_test_recall']

Above we can see all recall values for grid search param combinations.



GridSearchCV disadvanatges


Have you noticed that the list of all recall results was quite long? It actually had 100 elements. This means there were 100 different parameter combinations that the grid have has tried. This is a lot and can be very time-consuming especially on large data sets.


In our example, grid search did five-fold cross-validation for 100 different Random forest setups. Imagine if we had more parameters to tune!


There is an alternative to GridSearchCV called RandomizedSearchCV. Instead of trying all parameters it only samples a subset of parameters from a given distribution therefore could be faster and more effective.



Summary


In this article, you have learned how to use a grid search to optimize your parameter tunning. It is time to try your newly acquired skill on a different data set and using a different model than random forest. Happy coding!



427 views0 comments

Recent Posts

See All