50 scikit-learn tips
- Buy now
- Learn more
Introduction

Welcome to the course!
Download the course notebooks
Data Preprocessing

1. Use ColumnTransformer to apply different preprocessing to different columns
2. Seven ways to select columns using ColumnTransformer
3. What is the difference between "fit" and "transform"?
4. Use "fit_transform" on training data, but "transform" (only) on testing/new data
38. Get the feature names output by a ColumnTransformer
42. Passthrough some columns and drop others in a ColumnTransformer
Using pandas

5. Four reasons to use scikit-learn (not pandas) for ML preprocessing
35. Don't use .values when passing a pandas object to scikit-learn
39. Load a toy dataset into a DataFrame
Categorical Features

6. Encode categorical features using OneHotEncoder or OrdinalEncoder
7. Handle unknown categories with OneHotEncoder by encoding them as zeros
15. Three reasons not to use drop='first' with OneHotEncoder
41. Drop the first category from binary features (only) with OneHotEncoder
43. Use OrdinalEncoder instead of OneHotEncoder with tree-based models
Missing Values

9. Add a missing indicator to encode "missingness" as a feature
11. Impute missing values using KNNImputer or IterativeImputer
14. HistGradientBoostingClassifier natively supports missing values
27. Two ways to impute missing values for a categorical feature
Pipelines

8. Use Pipeline to chain together multiple steps
12. What is the difference between Pipeline and make_pipeline?
13. Examine the intermediate steps in a Pipeline
22. Use the correct methods for each type of Pipeline
28. Save a model or Pipeline using joblib
30. Four ways to examine the steps of a Pipeline
34. Add feature selection to a Pipeline
37. Create an interactive diagram of a Pipeline in Jupyter
48. Access part of a Pipeline using slicing
50. Adapt this pattern to solve many Machine Learning problems
Intermission

Can I ask you a quick favor?
Parameter Tuning

16. Use cross_val_score and GridSearchCV on a Pipeline
17. Try RandomizedSearchCV if GridSearchCV is taking too long
18. Display GridSearchCV or RandomizedSearchCV results in a DataFrame
19. Important tuning parameters for LogisticRegression
25. Prune a decision tree to avoid overfitting
40. Estimators only print parameters that have been changed
44. Speed up GridSearchCV using parallel processing
49. Tune multiple models simultaneously with GridSearchCV
Model Evaluation

20. Plot a confusion matrix
21. Compare multiple ROC curves in a single plot
26. Use stratified sampling with train_test_split
31. Shuffle your dataset when using cross_val_score
32. Use AUC to evaluate multiclass problems
Model Inspection

23. Display the intercept and coefficients for a linear model
24. Visualize a decision tree two different ways
Model Ensembling

46. Ensemble multiple models using VotingClassifer or VotingRegressor
47. Tune the parameters of a VotingClassifer or VotingRegressor
Feature Engineering

29. Vectorize two text columns in a ColumnTransformer
33. Use FunctionTransformer to convert functions into transformers
45. Create feature interactions using PolynomialFeatures
Coding Practices

10. Set a "random_state" to make your code reproducible
36. Most parameters should be passed as keyword arguments
Conclusion

Can I ask you a quick favor?
Request your certificate of completion
Take another course from Data School!
Earn money by promoting Data School's courses!

Pipelines

10 Lessons