Data School/50 scikit-learn tips

  • Free

50 scikit-learn tips

Sharpen your Machine Learning skills with this FREE 3-hour course!

What will I learn in this course?

Here are a few of the things you'll learn in these 50 short video lessons:

  • How to build, evaluate, and tune a Pipeline

  • Two easy ways to visualize a decision tree

  • How to benefit from missing values using a "missing indicator"

  • How to plot an ROC curve in one line of code

  • How to speed up a grid search

  • How to add feature selection to a Pipeline

  • Why you should use scikit-learn (not pandas) for preprocessing

  • How to create an interactive diagram of a Pipeline

  • How to save your best Pipeline for future predictions

  • Why dropping a level when one-hot encoding is usually a bad idea

  • How to create custom transformers for feature engineering

  • Why you should use stratified sampling with train/test split

  • How to build and tune an ensemble of models

  • Why you should try ordinal encoding with tree-based models

  • And much, much more!

Who should take this course?

This is the perfect course for you if:

  • You've already completed my introductory ML course

  • You want to work more efficiently in scikit-learn

  • You want to learn best practices for Machine Learning code

  • You want to keep up-to-date with scikit-learn's latest features

Join 3,000+ happy students...

Neil Dias (ML Engineer)

Your new videos are great! I find them as excellent and concise refreshers on ML implementation topics.

Beltran Rovira (Master's student)

Thanks so much for your videos! They have helped me to optimize the Machine Learning workflow and to understand what’s going on underneath the hood!

Lautaro Cisterna (Data Scientist)

This course is a great guide and resource, where you can come back and check how something was done in a very clear and easy way.

Michael Reinhard (Tutor)

I don't feel I really understand something until Kevin explains it to me. He breaks things down very clearly, making no assumptions about what you know.

Levon

I really love your videos, they are just right, concise and informative, no unnecessary fluff.

Varun K.

This series of videos is absolutely awesome!

J. K.

A treasure trove of extremely useful scikit-learn tips.

Satya T.

Thank you so much, Kevin. You are making everything easy.

Aren't these videos already on YouTube?

I uploaded this series to YouTube in 2020 and 2021, and it has since gotten more than 450,000 views.

Here's why you'll have a better learning experience by taking the course here:

  1. You can watch the videos without ads

  2. You can save your progress and return later to the same spot

  3. You can download the course notebooks

  4. You can run the notebooks online using Binder or Colab

  5. You can access relevant links about each tip

  6. You can post your own questions, and I'll do my best to respond

  7. After completing the course, you'll receive a certificate of completion

FAQs

How do I know if I'm ready for the course?

You're ready for this course if you understand the basics of scikit-learn and Machine Learning. If you're brand new to Machine Learning, you should start with my free introductory course.

How long is the course?

The course includes 50 videos which range from 2 to 8 minutes in length. The total course length is 3 hours.

What software do I need to install?

Every tip includes a link to run the code in your browser using Binder or Google Colab, so you don't need to install anything! But if you want to run the code on your local machine, you'll need to install scikit-learn (version 0.23 or later), pandas, and matplotlib.

What if I need help during the course?

You can post a question below any video, and I'll do my best to respond!

How do I earn a certificate of completion?

Once you have watched all of the lessons, you can request a certificate of completion.

How long will I have access to the course?

You will have lifetime access to the course.

What course should I take after this one?

You should take my follow-up course, Master Machine Learning with scikit-learn.

Course Outline

Introduction

Welcome to the course!
Download the course notebooks

Data Preprocessing

1. Use ColumnTransformer to apply different preprocessing to different columns
2. Seven ways to select columns using ColumnTransformer
3. What is the difference between "fit" and "transform"?
4. Use "fit_transform" on training data, but "transform" (only) on testing/new data
38. Get the feature names output by a ColumnTransformer
42. Passthrough some columns and drop others in a ColumnTransformer

Using pandas

5. Four reasons to use scikit-learn (not pandas) for ML preprocessing
35. Don't use .values when passing a pandas object to scikit-learn
39. Load a toy dataset into a DataFrame

Categorical Features

6. Encode categorical features using OneHotEncoder or OrdinalEncoder
7. Handle unknown categories with OneHotEncoder by encoding them as zeros
15. Three reasons not to use drop='first' with OneHotEncoder
41. Drop the first category from binary features (only) with OneHotEncoder
43. Use OrdinalEncoder instead of OneHotEncoder with tree-based models

Missing Values

9. Add a missing indicator to encode "missingness" as a feature
11. Impute missing values using KNNImputer or IterativeImputer
14. HistGradientBoostingClassifier natively supports missing values
27. Two ways to impute missing values for a categorical feature

Pipelines

8. Use Pipeline to chain together multiple steps
12. What is the difference between Pipeline and make_pipeline?
13. Examine the intermediate steps in a Pipeline
22. Use the correct methods for each type of Pipeline
28. Save a model or Pipeline using joblib
30. Four ways to examine the steps of a Pipeline
34. Add feature selection to a Pipeline
37. Create an interactive diagram of a Pipeline in Jupyter
48. Access part of a Pipeline using slicing
50. Adapt this pattern to solve many Machine Learning problems

Intermission

Can I ask you a quick favor?

Parameter Tuning

16. Use cross_val_score and GridSearchCV on a Pipeline
17. Try RandomizedSearchCV if GridSearchCV is taking too long
18. Display GridSearchCV or RandomizedSearchCV results in a DataFrame
19. Important tuning parameters for LogisticRegression
25. Prune a decision tree to avoid overfitting
40. Estimators only print parameters that have been changed
44. Speed up GridSearchCV using parallel processing
49. Tune multiple models simultaneously with GridSearchCV

Model Evaluation

20. Plot a confusion matrix
21. Compare multiple ROC curves in a single plot
26. Use stratified sampling with train_test_split
31. Shuffle your dataset when using cross_val_score
32. Use AUC to evaluate multiclass problems

Model Inspection

23. Display the intercept and coefficients for a linear model
24. Visualize a decision tree two different ways

Model Ensembling

46. Ensemble multiple models using VotingClassifer or VotingRegressor
47. Tune the parameters of a VotingClassifer or VotingRegressor

Feature Engineering

29. Vectorize two text columns in a ColumnTransformer
33. Use FunctionTransformer to convert functions into transformers
45. Create feature interactions using PolynomialFeatures

Coding Practices

10. Set a "random_state" to make your code reproducible
36. Most parameters should be passed as keyword arguments

Conclusion

Can I ask you a quick favor?
Request your certificate of completion
Take another course from Data School!
Earn money by promoting Data School's courses!

👋 Welcome to Data School!

My name is Kevin, and I've taught Data Science in Python to over a million students.

My courses explain data science topics in a clear, thorough, and step-by-step manner.

I'd love to teach you, regardless of your educational background or professional experience.

Thanks for joining me! 🙌