About Data School
FAQ
ML courses
FREE courses
Login
About Data School
FAQ
ML courses
FREE courses
Login
Master Machine Learning with scikit-learn
Buy now
Learn more
Chapter 1: Introduction
1.1 Course overview
1.2 scikit-learn vs Deep Learning
1.3 Prerequisite skills
1.4 Course setup and software versions
1.5 Course outline
1.6 Course datasets
1.7 Meet your instructor
Download the course files
List of all lessons
Chapter 2: Review of the Machine Learning workflow
2.1 Loading and exploring a dataset
2.2 Building and evaluating a model
2.3 Using the model to make predictions
2.4 Q&A: How do I adapt this workflow to a regression problem?
2.5 Q&A: How do I adapt this workflow to a multiclass problem?
2.6 Q&A: Why should I select a Series for the target?
2.7 Q&A: How do I add the model's predictions to a DataFrame?
2.8 Q&A: How do I determine the confidence level of each prediction?
2.9 Q&A: How do I check the accuracy of the model's predictions?
2.10 Q&A: What do the "solver" and "random_state" parameters do?
2.11 Q&A: How do I show all of the model parameters?
2.12 Q&A: Should I shuffle the samples when using cross-validation?
Chapter 2 Quiz
Chapter 2 Quiz Discussion
Chapter 3: Encoding categorical features
3.1 Introduction to one-hot encoding
3.2 Transformer methods: fit, transform, fit_transform
3.3 One-hot encoding of multiple features
3.4 Q&A: When should I use transform instead of fit_transform?
3.5 Q&A: What happens if the testing data includes a new category?
3.6 Q&A: Should I drop one of the one-hot encoded categories?
3.7 Q&A: How do I encode an ordinal feature?
3.8 Q&A: What's the difference between OrdinalEncoder and LabelEncoder?
3.9 Q&A: Should I encode numeric features as ordinal features?
Chapter 3 Quiz
Chapter 3 Quiz Discussion
Chapter 4: Improving your workflow with ColumnTransformer and Pipeline
4.1 Preprocessing features with ColumnTransformer
4.2 Chaining steps with Pipeline
4.3 Using the Pipeline to make predictions
4.4 Q&A: How do I drop some columns and passthrough others?
4.5 Q&A: How do I transform the unspecified columns?
4.6 Q&A: How do I select columns from a NumPy array?
4.7 Q&A: How do I select columns by data type?
4.8 Q&A: How do I select columns by column name pattern?
4.9 Q&A: Should I use ColumnTransformer or make_column_transformer?
4.10 Q&A: Should I use Pipeline or make_pipeline?
4.11 Q&A: How do I examine the steps of a Pipeline?
Chapter 4 Quiz
Chapter 4 Quiz Discussion
Chapter 5: Workflow review #1
5.1 Recap of our workflow
5.2 Comparing ColumnTransformer and Pipeline
5.3 Creating a Pipeline diagram
Chapter 5 Quiz
Chapter 5 Quiz Discussion
Chapter 6: Encoding text data
6.1 Vectorizing text
6.2 Including text data in the model
6.3 Q&A: Why is the document-term matrix stored as a sparse matrix?
6.4 Q&A: What happens if the testing data includes new words?
6.5 Q&A: How do I vectorize multiple columns of text?
6.6 Q&A: Should I one-hot encode or vectorize categorical features?
Chapter 6 Quiz
Chapter 6 Quiz Discussion
Chapter 7: Handling missing values
7.1 Introduction to missing values
7.2 Three ways to handle missing values
7.3 Missing value imputation
7.4 Using "missingness" as a feature
7.5 Q&A: How do I perform multivariate imputation?
7.6 Q&A: What are the best practices for missing value imputation?
7.7 Q&A: What's the difference between ColumnTransformer and FeatureUnion?
Chapter 7 Quiz
Chapter 7 Quiz Discussion
Chapter 8: Fixing common workflow problems
8.1 Two new problems
8.2 Problem 1: Missing values in a categorical feature
8.3 Problem 2: Missing values in the new data
8.4 Q&A: How do I see the feature names output by the ColumnTransformer?
8.5 Q&A: Why did we create a Pipeline inside of the ColumnTransformer?
8.6 Q&A: Which imputation strategy should I use with categorical features?
8.7 Q&A: Should I impute missing values before all other transformations?
8.8 Q&A: What methods can I use with a Pipeline?
Chapter 8 Quiz
Chapter 8 Quiz Discussion
Chapter 9: Workflow review #2
9.1 Recap of our workflow
9.2 Comparing ColumnTransformer and Pipeline
9.3 Why not use pandas for transformations?
9.4 Preventing data leakage
Chapter 9 Quiz
Chapter 9 Quiz Discussion
Intermission
Can I ask you a quick favor?
Chapter 10: Evaluating and tuning a Pipeline
10.1 Evaluating a Pipeline with cross-validation
10.2 Tuning a Pipeline with grid search
10.3 Tuning the model
10.4 Tuning the transformers
10.5 Using the best Pipeline to make predictions
10.6 Q&A: How do I save the best Pipeline for future use?
10.7 Q&A: How do I speed up a grid search?
10.8 Q&A: How do I tune a Pipeline with randomized search?
10.9 Q&A: What's the target accuracy we are trying to achieve?
10.10 Q&A: Is it okay that our model includes thousands of features?
10.11 Q&A: How do I examine the coefficients of a Pipeline?
10.12 Q&A: Should I split the dataset before tuning the Pipeline?
10.13 Q&A: What is regularization?
Chapter 10 Quiz
Chapter 10 Quiz Discussion
Chapter 11: Comparing linear and non-linear models
11.1 Trying a random forest model
11.2 Tuning random forests with randomized search
11.3 Further tuning with grid search
11.4 Q&A: How do I tune two models with a single grid search?
11.5 Q&A: How do I tune two models with a single randomized search?
Chapter 11 Quiz
Chapter 11 Quiz Discussion
Chapter 12: Ensembling multiple models
12.1 Introduction to ensembling
12.2 Ensembling logistic regression and random forests
12.3 Combining predicted probabilities
12.4 Combining class predictions
12.5 Choosing a voting strategy
12.6 Tuning an ensemble with grid search
12.7 Q&A: When should I use ensembling?
12.8 Q&A: How do I apply different weights to the models in an ensemble?
Chapter 12 Quiz
Chapter 12 Quiz Discussion
Chapter 13: Feature selection
13.1 Introduction to feature selection
13.2 Intrinsic methods: L1 regularization
13.3 Filter methods: Statistical test-based scoring
13.4 Filter methods: Model-based scoring
13.5 Filter methods: Summary
13.6 Wrapper methods: Recursive feature elimination
13.7 Q&A: How do I see which features were selected?
13.8 Q&A: Are the selected features the "most important" features?
13.9 Q&A: Is it okay for feature selection to remove one-hot encoded categories?
Chapter 13 Quiz
Chapter 13 Quiz Discussion
Chapter 14: Feature standardization
14.1 Standardizing numerical features
14.2 Standardizing all features
14.3 Q&A: How do I see what scaling was applied to each feature?
14.4 Q&A: How do I turn off feature standardization within a grid search?
14.5 Q&A: Which models benefit from standardization?
Chapter 14 Quiz
Chapter 14 Quiz Discussion
Chapter 15: Feature engineering with custom transformers
15.1 Why not use pandas for feature engineering?
15.2 Transformer 1: Rounding numerical values
15.3 Transformer 2: Clipping numerical values
15.4 Transformer 3: Extracting string values
15.5 Rules for transformer functions
15.6 Transformer 4: Combining two features
15.7 Revising the transformers
15.8 Q&A: How do I fix incorrect data types within a Pipeline?
15.9 Q&A: How do I create features from datetime data?
15.10 Q&A: How do I create feature interactions?
15.11 Q&A: How do I save a Pipeline with custom transformers?
15.12 Q&A: Can FunctionTransformer be used with any transformation?
Chapter 15 Quiz
Chapter 15 Quiz Discussion
Chapter 16: Workflow review #3
16.1 Recap of our workflow
16.2 What's the role of pandas?
Chapter 17: High-cardinality categorical features
17.1 Recap of nominal and ordinal features
17.2 Preparing the census dataset
17.3 Setting up the encoders
17.4 Encoding nominal features for a linear model
17.5 Encoding nominal features for a non-linear model
17.6 Combining the encodings
17.7 Best practices for encoding
Chapter 17 Quiz
Chapter 17 Quiz Discussion
Chapter 18: Class imbalance
18.1 Introduction to class imbalance
18.2 Preparing the mammography dataset
18.3 Evaluating a model with train/test split
18.4 Exploring the results with a confusion matrix
18.5 Calculating rates from a confusion matrix
18.6 Using AUC as the evaluation metric
18.7 Cost-sensitive learning
18.8 Tuning the decision threshold
Chapter 18 Quiz
Chapter 18 Quiz Discussion
Chapter 19: Class imbalance walkthrough
19.1 Best practices for class imbalance
19.2 Step 1: Splitting the dataset
19.3 Step 2: Optimizing the model on the training set
19.4 Step 3: Evaluating the model on the testing set
19.5 Step 4: Tuning the decision threshold
19.6 Step 5: Retraining the model and making predictions
19.7 Q&A: Should I use an ROC curve or a precision-recall curve?
19.8 Q&A: Can I use a different metric such as F1 score?
19.9 Q&A: Should I use resampling to fix class imbalance?
Chapter 19 Quiz
Chapter 19 Quiz Discussion
Chapter 20: Going further
20.1 Q&A: How do I read the scikit-learn documentation?
20.2 Q&A: How do I stay up-to-date with new scikit-learn features?
20.3 Q&A: How do I improve my Machine Learning skills?
20.4 Q&A: How do I learn Deep Learning?
Chapter 20 Quiz
Chapter 20 Quiz Discussion
Conclusion
Can I ask you a quick favor?
Request your certificate of completion
Take another course from Data School!
Earn money by promoting Data School's courses!
Products
Course
Section
Lesson
13.5 Filter methods: Summary
13.5 Filter methods: Summary
Master Machine Learning with scikit-learn
Buy now
Learn more
Chapter 1: Introduction
1.1 Course overview
1.2 scikit-learn vs Deep Learning
1.3 Prerequisite skills
1.4 Course setup and software versions
1.5 Course outline
1.6 Course datasets
1.7 Meet your instructor
Download the course files
List of all lessons
Chapter 2: Review of the Machine Learning workflow
2.1 Loading and exploring a dataset
2.2 Building and evaluating a model
2.3 Using the model to make predictions
2.4 Q&A: How do I adapt this workflow to a regression problem?
2.5 Q&A: How do I adapt this workflow to a multiclass problem?
2.6 Q&A: Why should I select a Series for the target?
2.7 Q&A: How do I add the model's predictions to a DataFrame?
2.8 Q&A: How do I determine the confidence level of each prediction?
2.9 Q&A: How do I check the accuracy of the model's predictions?
2.10 Q&A: What do the "solver" and "random_state" parameters do?
2.11 Q&A: How do I show all of the model parameters?
2.12 Q&A: Should I shuffle the samples when using cross-validation?
Chapter 2 Quiz
Chapter 2 Quiz Discussion
Chapter 3: Encoding categorical features
3.1 Introduction to one-hot encoding
3.2 Transformer methods: fit, transform, fit_transform
3.3 One-hot encoding of multiple features
3.4 Q&A: When should I use transform instead of fit_transform?
3.5 Q&A: What happens if the testing data includes a new category?
3.6 Q&A: Should I drop one of the one-hot encoded categories?
3.7 Q&A: How do I encode an ordinal feature?
3.8 Q&A: What's the difference between OrdinalEncoder and LabelEncoder?
3.9 Q&A: Should I encode numeric features as ordinal features?
Chapter 3 Quiz
Chapter 3 Quiz Discussion
Chapter 4: Improving your workflow with ColumnTransformer and Pipeline
4.1 Preprocessing features with ColumnTransformer
4.2 Chaining steps with Pipeline
4.3 Using the Pipeline to make predictions
4.4 Q&A: How do I drop some columns and passthrough others?
4.5 Q&A: How do I transform the unspecified columns?
4.6 Q&A: How do I select columns from a NumPy array?
4.7 Q&A: How do I select columns by data type?
4.8 Q&A: How do I select columns by column name pattern?
4.9 Q&A: Should I use ColumnTransformer or make_column_transformer?
4.10 Q&A: Should I use Pipeline or make_pipeline?
4.11 Q&A: How do I examine the steps of a Pipeline?
Chapter 4 Quiz
Chapter 4 Quiz Discussion
Chapter 5: Workflow review #1
5.1 Recap of our workflow
5.2 Comparing ColumnTransformer and Pipeline
5.3 Creating a Pipeline diagram
Chapter 5 Quiz
Chapter 5 Quiz Discussion
Chapter 6: Encoding text data
6.1 Vectorizing text
6.2 Including text data in the model
6.3 Q&A: Why is the document-term matrix stored as a sparse matrix?
6.4 Q&A: What happens if the testing data includes new words?
6.5 Q&A: How do I vectorize multiple columns of text?
6.6 Q&A: Should I one-hot encode or vectorize categorical features?
Chapter 6 Quiz
Chapter 6 Quiz Discussion
Chapter 7: Handling missing values
7.1 Introduction to missing values
7.2 Three ways to handle missing values
7.3 Missing value imputation
7.4 Using "missingness" as a feature
7.5 Q&A: How do I perform multivariate imputation?
7.6 Q&A: What are the best practices for missing value imputation?
7.7 Q&A: What's the difference between ColumnTransformer and FeatureUnion?
Chapter 7 Quiz
Chapter 7 Quiz Discussion
Chapter 8: Fixing common workflow problems
8.1 Two new problems
8.2 Problem 1: Missing values in a categorical feature
8.3 Problem 2: Missing values in the new data
8.4 Q&A: How do I see the feature names output by the ColumnTransformer?
8.5 Q&A: Why did we create a Pipeline inside of the ColumnTransformer?
8.6 Q&A: Which imputation strategy should I use with categorical features?
8.7 Q&A: Should I impute missing values before all other transformations?
8.8 Q&A: What methods can I use with a Pipeline?
Chapter 8 Quiz
Chapter 8 Quiz Discussion
Chapter 9: Workflow review #2
9.1 Recap of our workflow
9.2 Comparing ColumnTransformer and Pipeline
9.3 Why not use pandas for transformations?
9.4 Preventing data leakage
Chapter 9 Quiz
Chapter 9 Quiz Discussion
Intermission
Can I ask you a quick favor?
Chapter 10: Evaluating and tuning a Pipeline
10.1 Evaluating a Pipeline with cross-validation
10.2 Tuning a Pipeline with grid search
10.3 Tuning the model
10.4 Tuning the transformers
10.5 Using the best Pipeline to make predictions
10.6 Q&A: How do I save the best Pipeline for future use?
10.7 Q&A: How do I speed up a grid search?
10.8 Q&A: How do I tune a Pipeline with randomized search?
10.9 Q&A: What's the target accuracy we are trying to achieve?
10.10 Q&A: Is it okay that our model includes thousands of features?
10.11 Q&A: How do I examine the coefficients of a Pipeline?
10.12 Q&A: Should I split the dataset before tuning the Pipeline?
10.13 Q&A: What is regularization?
Chapter 10 Quiz
Chapter 10 Quiz Discussion
Chapter 11: Comparing linear and non-linear models
11.1 Trying a random forest model
11.2 Tuning random forests with randomized search
11.3 Further tuning with grid search
11.4 Q&A: How do I tune two models with a single grid search?
11.5 Q&A: How do I tune two models with a single randomized search?
Chapter 11 Quiz
Chapter 11 Quiz Discussion
Chapter 12: Ensembling multiple models
12.1 Introduction to ensembling
12.2 Ensembling logistic regression and random forests
12.3 Combining predicted probabilities
12.4 Combining class predictions
12.5 Choosing a voting strategy
12.6 Tuning an ensemble with grid search
12.7 Q&A: When should I use ensembling?
12.8 Q&A: How do I apply different weights to the models in an ensemble?
Chapter 12 Quiz
Chapter 12 Quiz Discussion
Chapter 13: Feature selection
13.1 Introduction to feature selection
13.2 Intrinsic methods: L1 regularization
13.3 Filter methods: Statistical test-based scoring
13.4 Filter methods: Model-based scoring
13.5 Filter methods: Summary
13.6 Wrapper methods: Recursive feature elimination
13.7 Q&A: How do I see which features were selected?
13.8 Q&A: Are the selected features the "most important" features?
13.9 Q&A: Is it okay for feature selection to remove one-hot encoded categories?
Chapter 13 Quiz
Chapter 13 Quiz Discussion
Chapter 14: Feature standardization
14.1 Standardizing numerical features
14.2 Standardizing all features
14.3 Q&A: How do I see what scaling was applied to each feature?
14.4 Q&A: How do I turn off feature standardization within a grid search?
14.5 Q&A: Which models benefit from standardization?
Chapter 14 Quiz
Chapter 14 Quiz Discussion
Chapter 15: Feature engineering with custom transformers
15.1 Why not use pandas for feature engineering?
15.2 Transformer 1: Rounding numerical values
15.3 Transformer 2: Clipping numerical values
15.4 Transformer 3: Extracting string values
15.5 Rules for transformer functions
15.6 Transformer 4: Combining two features
15.7 Revising the transformers
15.8 Q&A: How do I fix incorrect data types within a Pipeline?
15.9 Q&A: How do I create features from datetime data?
15.10 Q&A: How do I create feature interactions?
15.11 Q&A: How do I save a Pipeline with custom transformers?
15.12 Q&A: Can FunctionTransformer be used with any transformation?
Chapter 15 Quiz
Chapter 15 Quiz Discussion
Chapter 16: Workflow review #3
16.1 Recap of our workflow
16.2 What's the role of pandas?
Chapter 17: High-cardinality categorical features
17.1 Recap of nominal and ordinal features
17.2 Preparing the census dataset
17.3 Setting up the encoders
17.4 Encoding nominal features for a linear model
17.5 Encoding nominal features for a non-linear model
17.6 Combining the encodings
17.7 Best practices for encoding
Chapter 17 Quiz
Chapter 17 Quiz Discussion
Chapter 18: Class imbalance
18.1 Introduction to class imbalance
18.2 Preparing the mammography dataset
18.3 Evaluating a model with train/test split
18.4 Exploring the results with a confusion matrix
18.5 Calculating rates from a confusion matrix
18.6 Using AUC as the evaluation metric
18.7 Cost-sensitive learning
18.8 Tuning the decision threshold
Chapter 18 Quiz
Chapter 18 Quiz Discussion
Chapter 19: Class imbalance walkthrough
19.1 Best practices for class imbalance
19.2 Step 1: Splitting the dataset
19.3 Step 2: Optimizing the model on the training set
19.4 Step 3: Evaluating the model on the testing set
19.5 Step 4: Tuning the decision threshold
19.6 Step 5: Retraining the model and making predictions
19.7 Q&A: Should I use an ROC curve or a precision-recall curve?
19.8 Q&A: Can I use a different metric such as F1 score?
19.9 Q&A: Should I use resampling to fix class imbalance?
Chapter 19 Quiz
Chapter 19 Quiz Discussion
Chapter 20: Going further
20.1 Q&A: How do I read the scikit-learn documentation?
20.2 Q&A: How do I stay up-to-date with new scikit-learn features?
20.3 Q&A: How do I improve my Machine Learning skills?
20.4 Q&A: How do I learn Deep Learning?
Chapter 20 Quiz
Chapter 20 Quiz Discussion
Conclusion
Can I ask you a quick favor?
Request your certificate of completion
Take another course from Data School!
Earn money by promoting Data School's courses!
Lesson unavailable
Please
login to your account
or
buy the course
.