rainbow

In Part I we discussed the advantages of ordinal encoding compared to One-hot for any categorical variables, when used with tree-based algorithms. We introduced the Rainbow method of identifying the most appropriate ordinal encoding for different types of categorical variables.

Here in Part II, we continue exploring the Rainbow method, now from the empirical standpoint. We will illustrate the effectiveness of the Rainbow method using a real project developed in the Data Science organization at MassMutual - a life insurance company with over 170 years of history. MassMutual proudly invests in a large team of data scientists, engineers, and technologists to inform critical business decisions.

Business Use Case

The task is to predict one of five Mindset Segments for each prospective customer. Essentially, it is a multiclass classification problem.

Figure 1

This segmentation framework represents five classes that reflect a person's age, financial stability, and attitude towards financial decisions. The predicted segments are used by the MassMutual marketing team in different types of campaigns for targeting and customization.

For example, Mindset A customers would value more independence and autonomy in deciding on buying a life insurance policy whereas Mindset B customers would value having guidance and thorough explanations of different financial products by a dedicated advisor.

We have a small set of labeled individuals (17.5K persons). The labels are provided by a MassMutual vendor who designed the segment assignment rules. We first attached columns from our main prospect database to this data. The goal is to learn the best model using these target labels and available features and then predict segments for all other (unlabeled) prospective customers.

The main database for this project is provided by Acxiom and covers about 300 columns representing a rich set of demographic characteristics, like composition of the household, propensities of income and net worth, predictions of financial behavior and digital savviness.

Using Acxiom data and the Mindset Segmentation project, we will compare the conventional One-hot encoding with the ordinal encoding via the Rainbow method. For this 5-class classification task, we will demonstrate a few standard metrics - Accuracy, Cohen's Kappa, Macro Avg F1 score, and Macro Avg AUC ROC. Accuracy is used merely for interpretation and comparison purposes, while the rest of the metrics are very helpful for unbalanced multiclass classification problems.

Categorical Variables

We took all the categorical variables in the Acxiom database - interval, ordinal, and nominal - and excluded quantitative and binary variables. The idea is to demonstrate the pure difference in model performance between the two types of encoding for the same set of categorical factors.

We then applied a target stratified 4-fold Cross-Validation split. All the data processing from this point on is done inside the cross-validation loop, i.e. the creation of One-hot features and Rainbow features is learned from each fold train set and applied to each fold validation set.

The total set of 111 variables was transformed into 121 Rainbow features and, alternatively, into 2260 One-hot features (with very slight deviations in this number in 4 different folds).

Table 1

Type of variable N raw N Rainbow encoded N One-hot encoded
Interval 64 64 1670
Ordinal 14 14 178
Nominal 33 43 412
Total 111 121 2260

While interval and ordinal variables have a straightforward Rainbow transformation (one can notice that 64 interval features turned into 64 Rainbows and 14 ordinal features turned into 14 Rainbows) the nominal variables' transformation was a bit more involved. Out of 33 nominal variables, for 23 we found a natural attribute Rainbow, while for each of the 10 remaining variables we created 2 artificial Rainbow features. Since we deal with 5 classes, we applied correlation ordering for a random class and target percent ordering for a random class (see Part I).

For example, for the original categorical variable "Financial_Cluster" we made features
Financial_Cluster_Mindset_B_correlation_rank
and
Financial_Cluster_Mindset_D_target_percent

In this way, 33 raw nominal variables turned into 43 Rainbows. The search for the approach to make natural attribute Rainbows or artificial Rainbows is highly project and context-specific, and is more of an art than a science. We invite you to play with it and rely on model simplicity, performance, and interpretability when making final choice.

Unlike ordinal encoding, One-hot transformation generates over two thousand features. Why do we make One-hot features for interval and ordinal variables here? Because we want to compare Rainbow with One-hot on the full continuum of possible ordering - from perfect order to fuzzy order to no order (or wrong order). Also, sometimes the classification of a variable into ordinal or nominal is a subjective decision.

As we discussed in Part I, Color is considered nominal by some modelers, and ordinal by other modelers. To be fair, from statistical standpoint, Color is nominal, but machine learning scientists might use Hue attribute to turn it into ordinal for modeling purposes.

At first, we pool all the categorical variables together. Later in the article, we separate interval, ordinal, and nominal variables and analyze their outcomes individually.

We ran all XGBoost models covering this hyperparameter space:'objective': 'multi:softprob' 'eval_metric': 'mlogloss' 'num_class': 5 'subsample': 0.8 'max_depth': [2, 3, 5] 'eta': [0.1, 0.3, 0.5] 'n_estimators': [50, 100, 200]

We don't try maximum depth values higher than 5 because of the relatively small data size. The training data size reduces due to cross-validation split and XGBoost subsample parameter. If at the end of each branch we expect to have at least 100 samples, the max depth should be limited at about 5-6. Setting it at a higher value would seriously complicate or overfit the model. And, especially for the goals of this analysis, we'd rather err on the side of simplicity.

All the results below represent cross-validation average metrics.

Aggregate Results

Let us start with the overall averages across all runs. Clearly, the average metrics across all models are higher for Rainbow ordinal encoding. The overall difference is a few percentage points.

Figure 2

Hyperparameters

The following plots show metric dynamics for each hyperparameter keeping all other hyperparameters constant. These plots also clearly demonstrate that Rainbow outcomes exceed One-hot outcomes for each hyperparameter and each metric.

Figure 3

Runtime

Next, let's compare the runtime for each method.

One-hot: 65.059 s
Rainbow:  5.491 s

The average time to run a single Rainbow model is almost 12 times faster than that of the single One-hot model! So, in addition to notably increasing model performance metrics, the Rainbow method can also save data scientists a huge amount of time.

Interval, Ordinal, and Nominal

Next, we ran the models that covered the bundles of interval, ordinal, and nominal features separately. Below are the results.

Figure 5

These results demonstrate again that Rainbow is preferred to One-hot. As expected, interval and ordinal features gain the most from Rainbow encoding - less so for nominal variables.

Clearly, the more defined category order, the higher the benefits of preferring Rainbow to One-hot. For nominal variables, model performance is about the same or negligibly lower for Rainbow than for One-hot. However, even in that case, that same performance is achieved in considerably less time, by using substantially less space, and the resulting model is significantly simpler.

Feature Selection

Finally, to make a comparison fairer in terms of dimensionality, we picked the top 10, top 50, and top 100 features from each feature set (Rainbow and One-hot). We used the feature importance attribute of the XGBoost model and aggregated feature importance scores for 4 cross-validation folds on the best hyperparameter set for each encoding type. Below are the results.

Figure 6

The Rainbow method outperforms the One-hot easily. Especially, for a small number of features. Rainbow reaches peak performance more quickly than One-hot, using fewer features - Rainbow is already near peak with only 10 features while it takes One-hot 50-100 features to reach a similar level!

Rainbow even shows better results for 50 features than One-hot does for 100. Also note that when dropping from 50 to 10 features, the reduction in Macro-F1 if you use One-hot is 6 times that of the Rainbow method (3 times for Kappa and Accuracy, 2 times for Macro-AUC).

Conclusion

In the example of the Mindset Segmentation model at MassMutual, we have illustrated that ordinal encoding via Rainbow method is better than One-hot encoding. It saves a great amount of time for modelers, substantially reduces dimensionality, and provides an organic framework for feature selection. If the chosen Rainbow order agrees with the data generating process, this encoding also contributes to notable improvements in model performance metrics.

In Part III we provide mathematical justification to these empirical results, and formally demonstrate the advantages of the Rainbow ordinal encoding method over One-hot for tree-based algorithms.