3.9 Q&A: Should I encode numeric features as ordinal features?

Transcript:

Normally, when you have a continuous numeric feature such as Fare, you pass that feature directly to your Machine Learning model. However, one strategy that is sometimes used with numeric features is to "discretize" or "bin" them into categorical features. In scikit-learn, we can do this using KBinsDiscretizer.

When creating an instance of KBinsDiscretizer, you define the number of bins, the binning strategy, and the method used for encoding the result. Here's the output when we pass the Fare feature to the fit_transform method.

Because we specified 3 bins, every sample has been assigned to bin 0 or 1 or 2. The smallest values were assigned to bin 0, the largest values were assigned to bin 2, and the values in between were assigned to bin 1. Thus, we've taken a continuous numeric feature and encoded it as an ordinal feature, and this ordinal feature could be passed to the model in place of the numeric feature.

The obvious follow-up question is: Should we discretize our numeric features? Theoretically, discretization can benefit linear models by helping them to learn non-linear trends. However, my general recommendation is to not use discretization, for three main reasons.

First, discretization removes all nuance from the data, which makes it harder for a model to learn the actual trends that are present in the data.

Second, discretization reduces the variation in the data, which makes it easier to find trends that don't actually exist.

Third, any possible benefits of discretization are highly dependent on the parameters used with KBinsDiscretizer. Making those decisions by hand creates a risk of overfitting the training data, and making those decisions during a tuning process adds both complexity and processing time, and so neither of those options is particularly attractive to me.