# Encoding Categorical Variables. The Rainbow Method

When the nature gives you a rainbow, take it.

"I have 2000 features and I need to get to 50 features and keep model as good or better" was how this all started. This is familiar to anyone in institutional data science where you need to carefully vet every model and look at each feature to make sure it is fine from the standpoint of regulation. What is valuable about this story is that it is based on a real production model that was developed by Anna Arakelyan at Mass Mutual and a method suggested by Dmytro Karabash. We will get to detailed results at the end but here's a preview:

The rainbow method outperforms the one-hot easily. In fact, the rainbow method even shows better result for 50 features than one-hot does for 100. Also note that when droping from 50 to 10 features, the reduction in macro-f1 if you use one-hot is 6 times that of the rainbow method (3 times for kappa and accuracy, 2 times for macro-auc).

Both models actually keep improving as the number of features goes up. Though the improvement is substantially faster and bigger for the rainbow method.

Data encoding is the crucial part of any data science pipeline. No matter which machine learning algorithm is used - there is no avoiding of data cleaning and encoding. Especially with dirty and complex real data sets, ensuring the most appropriate and efficient way of feature encoding and engineering is a nontrivial problem. While encoding quantitative and binary columns is usually a straightforward task, the encoding of categorical variables deserves a deeper look.

One of the most popular encoding methods for categorical variables has been the One-Hot procedure. It generates an indicator variable for every category and, thus, creates a set of K new features, where K is the number of categories.
One significant implication of this method is the dimensionality increase. Depending on the magnitude of K, it can have various undesirable technical consequences: substantial raise of computational complexity, loss of degrees of freedom (N-K, where N is the number of samples), multicollinearity, and, ultimately, an overfitted model. Non-technical consequences include unnecessary complication of the model that should be communicated to final users and that contradict the law of parsimony aka Occam's razor.

We call it rainbow from a very simple analogy: if you have a categorical feature with names "Red", "Orange", "Yellow", "Green", "Blue", "Indigo" and "Violet", in other words - colors of the rainbow, instead of one hot encoding, you can simply create a one feature with encoding:

0 = Red

1 = Orange

2 = Yellow

3 = Green

4 = Blue

5 = Indigo

6 = Violet

This would replace seven one-hot features with one rainbow.

Most of the nominal variables, from the first glance, seem like they cannot be converted to a quantitative scale. This is where we suggest **finding a rainbow**.

The seemingly unordered categories acquire a perfect order if we find a scale, where each category fits into its own unique place. With the color the natural scale is Hue. But it is not the only one. We can think of, say, brightness, saturation, color temperature, etc. We invite you to experiment with a few different rainbows that might capture different nuances of the categorical quality.

You can actually use two or more rainbows on a features depending on the number of categories K and the context.

We don't recommend to use much more than $\sqrt{K}$ rainbows as we explain below.

Why is rainbow better than any other choice? It is first thing that comes to mind and you do not have to think about it. In many cases it is not really that necessary which rainbow you choose (and by this we mean ordering of these colors), most of the would be better than using 1-hot, there are just some natural ones, which are more likely to be better than other (and easier to remember).

Hence our motto:"When a nature gives you a rainbow, take it..."

Important note here: this method is highly efficient in conjunction with the models that rely on variables as ranks rather than exact values. For example, decision trees, random forest, gradient boosting - these algorithms will output the same result if the variable *Number of Children* is coded as [1, 2, 3, 4, 5, ...] or as [-100, -85, 0, 10, 44, ...] as long as the categories have a correct order, and we use correct interpretation of non-meaningful values.
The application of our method for other algorithms such as Linear Regression, Logistic Regression is out of scope of this article. We expect that such way of feature engineering would be still beneficial, but this is a subject of a different investigation.

Note that by using a rainbow rather than One-Hot we are not losing any of the signal, because, even if the categories are wildly different and each one introduces a substantial gain to the model, the algorithm such as *xgboost* would capture that by making all the needed splits when introducing new tree nodes.

The application of our method depends highly on the level of measurement. While **quantitative** variables have a *ratio* scale, i.e. they have a meaningful 0, ordered values, and equal distances between values; **categorical** variables usually have either *interval*, or *ordinal*, or *nominal* scales.
Let us illustrate our method for each of these types of categorical variables.

**Interval** variables have ordered values, equal distances between values, but the value itself is not necessarily meaningful, for example, 0 does not always mean a complete absence of a quality. The common examples of interval variables are Likert scales:

"How likely is the person to buy a smartphone mobile phone?"

1 = Very Unlikely

2 = Somewhat Unlikely

3 = Neither Likely, Nor Unlikely

4 = Somewhat Likely

5 = Very Likely

In a straightforward way, if we simply use the raw values 1 through 5, that will save us dimensionality without losing a single bit of information. Indeed, the algorithm such as *xgboost* will make splits on this scale to introduce a new node any number of times it finds the best, instead of being forced to use the splits predetermined by One-Hot. With a large K, the algorithm would be forced into using big number of predermined splits which might be simply an overfit.

**Ordinal** variables have ordered values that are meaningless, and the distances between values are also not equal or not even explainable. An example:

"What is the highest level of Education completed by the person?"

1 = No High School

2 = High School

3 = Associate Degree

4 = Bachelor's Degree

5 = Master's Degree

6 = Doctoral Degree

Similar to interval variables, ordinal numeric codes can be used in the model without introducing any challenges, as long as the order is correct. In some cases, a variable is intrinsically ordinal, but the given numeric codes are not in the correct order - in these situations we could simply reorder categories and then use this updated variable as a quantitative feature.

Interval and ordinal categorical variables are probably not raising concerns as they are clearly perfect alternative to One-Hot for tree-based algorithms. The more complicated and non-obvious question is how to treat nominal variables.

**Nominal** variables have no obvious order between categories, and are almost always encoded with One-Hot method. A nice example of a nominal variable is

"Color"

(For instance, which color marker a person have chosen in some psychological state)

A = Red

B = Blue

C = Green

D = Yellow

From the first glance it seems like we can't convert this variable to a quantitative scale. This is where our proposed method will help! We suggest **finding a rainbow**.

These seemingly unordered categories acquire a perfect order if we find a certain scale where each category fits into a unique place. With the color, such scale could be the Hue! In other words, the wavelength of the light is the factor the helps us order the colors into a perfect scale - a rainbow! So, following that logic, we would engineer a new feature:

1 = Blue

2 = Green

3 = Yellow

4 = Red

The Hue is not the only way to order the colors. We could think of a few different scales, such as brightness, saturation, color temperature, lightness, etc. We invite you to experiment with a few different "rainbows" that might capture different nuances of the categorical quality.

Note that by using a rainbow rather than One-Hot we are not losing any of the signal, because, even if the categories are wildly different and each one introduces a substantial gain to the model, the algorithm such as *xgboost* would capture that by making all the needed splits when introducing new tree nodes.

Lets us show a few other examples of creating rainbows.

"Marital Status"

A = Married

B = Single

C = Inferred Married

D = Inferred Single

If we think about Single and Married as the two ends of the spectrum, then Inferred Single could be between the two, closer to Single while Inferred Married between the two closer to Married. That would make sense because being Inferred holds some degree of uncertainty. Thus, the following order would be reasonable:

1 = Single

2 = Inferred Single

3 = Inferred Married

4 = Married

In case there are any missing values here, the Unknown category fits exactly in the middle between Single and Married as there is no reason to prefer one end over the other. So the modified scale could look like this:

1 = Single

2 = Inferred Single

3 = Unknown

3 = Inferred Married

4 = Married

Consider another variable:

"Occupation"

1 = Professional/Technical

2 = Administration/Managerial

3 = Sales/Service

4 = Clerical/White Collar

5 = Craftsman/Blue Collar

6 = Student

7 = Homemaker

8 = Retired

9 = Farmer

A = Military

B = Religious

C = Self Employed

D = Other

Finding a rainbow here might be harder, but here are a few ways to do it - we could order occupations by average annual salary, by its prevalence in the geographic area of interest, or by information from some other dataset. That might involve calling a Census API or some other data source, and might be complicated by the fact that these values are not static, but they are still viable solutions.

In some situations though we cannot find a logical order for the rainbow because the variable itself is not interpretable. An example could be a black box column made by a third party:

"Financial Cluster"

1 = Market Watchers

2 = Conservative Wealth

3 = Specific Savers

4 = Tried and True

5 = Trendy Inclinations

6 = Current Consumers

7 = Rural Trust

8 = City Spotlight

9 = Career Conscious

10 = Digital Financiers

11 = Financial Futures

12 = Stable Influentials

13 = Conservatively Rural

In this case we might not have a clear idea how to order categories due to lack of knowledge of what each category entails. What to do with such variables?

We recommend creating an artificial rainbow by looking at how each category is related to the target variable. In case of a binary classification problem we have a binary target variable, and we could construct a rainbow at least two different ways.

First way is to place categories in the order of correlation with target variable. So the category with the highest value of correlation with the dependent variable would acquire numeric code 1, and the category with the lowest correlation would acquire numeric code 13. So our rainbow in this case would mean the relationship between the cluster and the target variable.

Second way is similar to the first one, but instead of correlation, we could look at the percent of target variable taking value of 1 (positive) given each of the categories. Suppose, among Market Watchers percent of positive targets is 0.67, while for Conservative Wealth it is 0.45. In that case, Market Watchers will be ordered higher than Conservative Wealth (or lower, if the target percent scale is ascending). In other words, this rainbow would reflect the prevalence of positive target inside each category.

In case of multiclass classification, we could create rainbows for each class to represent relationship between categories and each class separately. In the case of regression and continuous target, we could rely on the correlation method mostly.

We briefly described the Rainbow method, and below we provide theoretical justification for it and then an empirical application that illustrates its superiority over One-Hot.

We will illustrate the effectiveness of the rainbow encoding method using the real data science project developed in the Customer Journey domain of Data Science group at MassMutual - a life insurance company with a team of over 200 top-notch data scientists, engineers, and technologists.

In a nutshell, the data task is a multiclass classification problem that aims to predict one of the five Mindset Segments for each of the prospective customers.

Segment | Description |
---|---|

Self-Assured | Confident, in control, and middle-aged, often with families |

Juggler | (>= 35 years old) Younger families, less confident in their finances |

Starter | (< 35 years old) Youngest segment, with lower levels of financial confidence due to lack of experience |

Day to Day | Older, worried, and pessimistic about their finances |

Well-Established | In control, and highly satisfied with financial situation |

The segmentation framework represents five classes that reflect a person's age, financial stability, and attitude towards financial decisions. The predicted segments are then used by marketers in different types of campaigns for targeting and customization. For example, Self-Assured customers would value more independence and autonomy in making decision of buying a life insurance policy whereas Day to Day customers would value having a guidance and a long and thorough explanations of different financial products by a dedicated advisor.

The true segment labels come from MassMutual vendors that ran the mindset survey in 2017, 2018, and 2020. The total size of data is 17.5K rows. The main database we use for this problem is provided by Acxiom and covers about 300 columns representing rich set of demographic characteristics, composition of the household, income and net worth, financial behavior, and digital savvyness.

Using Acxiom data and the Mindset Segmentation prediction task, we will compare the conventional One-Hot encoding with the Rainbow method. For a task of 5-class classification we will demonstrate the following metrics - accuracy, Cohen's kappa, and a few other standard metrics

Cohen's Kappa is one the best metrics for an unbalanced multiclass classification problem. Accuracy is used for a simple interpretation purposes.

First of all, we took all existing categorical variables - interval, ordinal, and nominal, and excluded any other variables - quantitative and binary. We would like to compare the model performance using the two types of encoding for the same set of catogorical factors.

We then applied a target stratified 4-fold Cross Validation split. All the data processing from this point on is done inside the cross validation loop, i.e. the creation of One-Hot features and Rainbow features is learned from each fold train set and applied to each fold validation set.

The total set of 111 variables was transformed into **201** Rainbow features and, alternatively, into **2260** One-Hot features (with very slight deviations in N in 4 different folds).

Type of variable | N raw | N Rainbow encoded | N One-Hot encoded |
---|---|---|---|

Interval | 64 | 64 | 1670 |

Ordinal | 14 | 14 | 178 |

Nominal | 33 | 123 | 412 |

Total | 111 | 201 |
2260 |

One can notice that number of raw and Rainbow encoded features is the same (64 and 14), while 33 raw nominal features turned into 123 Rainbow features. That is because interval and ordinal features have a straightforward rainbow tranformation whereas there were two kinds of nominal variables. Out of 33 nominal variables, for 23 we found a natural rainbow, while for 10 variables we applied correlation ordering and target percent ordering. Since we deal with 5 classes, we made 10 new features for each of these variables. For example, given the name of feature "Financial_Cluster" and 5 segment names, we made features

- Financial_Cluster_Self-Assured_correlation_rank
- Financial_Cluster_Juggler_correlation_rank
- Financial_Cluster_Starter_correlation_rank
- Financial_Cluster_Day to Day_correlation_rank
- Financial_Cluster_Well-Establishes_correlation_rank

and - Financial_Cluster_Self-Assured_target_percent
- Financial_Cluster_Juggler_target_percent
- Financial_Cluster_Starter_target_percent
- Financial_Cluster_Day to Day_target_percent
- Financial_Cluster_Well-Establishes_target_percent

In this way, 33 raw nominal variables turned into 123 Rainbows.

It is important to note that the search for natural or non-natural Rainbows is highly project and context specific, and is more of an art than a craft. For instance, for a binary classification problem, there probably would have been only one or two Rainbows for each categorical feature given a single target class.

Let us start from overall averages for all runs:

Clearly, the average metrics across all models are notably higher for Rainbow encoding. The following plots show dynamic of metrics depending on every hyperparameter. These plots also clearly demonstrate the superiority of Rainbow method for every hyperparameter and every metric.

Next, let's compare the runtime for each method. Average model running time is about 9 times faster for Rainbow method than for One-Hot. So, in addition to a substantial increase in model performance metrics, Rainbow method can save data scientists huge amount of time.

Below are the results of the models that applied two types of encoding to interval, ordinal, and nominal features separately.

As expected, interval and ordinal features gain the most from Rainbow encoding, while nominal variables - less so.

Finally, to make a comparison more fair in terms of dimensionality, we picked top 10, top 50, and top 100 features from each feature set - Rainbow and One-Hot. We used feature importance attribute of the XGBoost model and aggregated feature importance scores for 4 cross validation folds on the best hyperparameter set for each encoding type. Below are the results.

*weighted in the table above is class-weighted.

Kappa is significantly higher for the Rainbow method, especially with the lower number of selected features. As mentioned before rainbow with 50 features is better than one-hot with 100 features, we just see this that is true in all the regularly used metrics. Note the drop for macro-f1 score drop from 50 to 10 features in one-hot versus rainbow.

As shown above, the Rainbow method is an elegant and simple way to encode categorical variables, that will significantly reduce the data dimensionality without losing any part of valuable signal, that will likely cause substantial improvements in model performance metrics (or, at the very least, will not cause any reductions in metrics), and that will save great amount of time for modelers.

Finally, we should note that this article just opens the conversation about the Rainbow method, and by no means exhausts the topic. In the potential future investigations, we could explore some other aspects. To name a few: binary and continuous target variables; comparison with other dimensionality reduction methods, such as PCA; how missing values fit the Rainbow framework; is there any limit and implications of combining seemingly unrelated features into a single Rainbow. We hope to open the gate for further questions and feedback on this method.

Aside from reducing number of variables which was our main goal, we would also like to understand why do rainbows produce better results.

We wouldn't completely satisfy Einstein here, but base our foundation on assumption that pruning of the trees that maximize unregularized log-loss producing supperior values. In fact regulariation of trees, pruning of trees and various early stopping rules are all trying to achieve the same goal, which is often summarized in "avoid overfit", yet let us examine what that really means in our case by case basis.

Imagine you have K categories for each the optimal values to be added to the nodes are 1, 2, ..., K in some order not necessarily random. If that was the case both on-hot encoding and rainbow method would require K-1 comparisons. We have a tie. That would be the case where all categories should be distinquished with different values at the end-nodes, which is rarely the case.

To make things simple lets now assume we have 4 features and the optimal values at the end nodes are 1, 1.01, 2, 2.01 Assume further that $\epsilon$ 0.01 is low enough so that the first and second pair of of categories should not be distinguished (or the nodes that distinguished should be pruned).

That is the whole definition of "avoid overfit", the point here is that at some point signal to uncertainty ratio is to low to make a distinction and that is why xgboost has at least 7 hyperparameters that should help in not distinquishing these pairs: max_depth, gamma, min_child_weight, lambda, alpha, min_split_loss, max_leaves

Definitions of these hyperparameters can be found here https://xgboost.readthedocs.io/en/latest/parameter.html

Now to explaination how each one of these would prevent between split of first and second categories:

Taking the above 1, 1.01, 2, 2.01 example let us look at how each of these 7 hyp

max_depth: disallows too many splits so that on each split only best splits would be chosen, so given your max_depth<K-2, you would not get to split the first and second category into different branches (those a next tree might).

gamma, if this one is $<0.01$, it would not split as the difference is smaller than gamma.

min_child_weight, if each category has less children that min_child_weight it would not split down to one category,

lambda: these will help not assign the tree that does split a very small weight

alpha: similar as lambda

min_split_loss: similar to gamma except it would truncate later

max_leaves: similar to max_depth

Above is just one example of how overfit might come into question, there are many other ways and for this particular reasn there are a few more parameters hat are part of xgboost parameters (rather than tree building hyperparameters described above). These are num_boost_round and early_stopping_rounds, which both determine num of trees which will be in the final model. To keep it simple let us stick with num_boost_round and for simplicity let us also make max_depth=1.

So in this case it makes sense to analyse the number of trees you are going to use (sing max_depth=1) it is a simple logistic model based on a number of comparisons. In this case rainbow method would only require one which will split inbetween 2nd and 3rd category, while one-hot would require 2.

What if these were randomly permuted: then one-hot would always require 2 (it actually doesn't care about permutation), and rainbow would depend on the permutation (I will leave only integer parts):

for 1,1,2,2 and 2,2,1,1 it would be 1;

for 1,2,2,1 and 2,1,1,2 it would require 2

and for 1,2,1,2 and 2,1,2,1 it would require 3

So it seems in third of the cases we are better, in third same and in a third worse. So our hope is that natural rainbows are close to 1,1,2,2 situation rather than 1,2,1,2 one. However if we had two rainbows that would list (1,2,3,4) and (1,3,2,4) cagories we would always be able to reduce it to 1 or two comparisons.

The reason we suggest not to use more than $log_2 K$ rainbows is not to surpass a number of encodings in what is called binary-one-hot-encoding. One easy way to create $log_2 K$ out of any rainbow is by rotating binary digits of your first rainbow; note rotating not permuting as if you would permute you would get $(log_2 K)!$ which is greater than K for K>8.

So let us come back to a general situation of K classes and only one rainbow. In case K classes should be split into $g$ groups in terms of tree structure, than one-hot would need $K - G_{max}$ comparisons, where $G_{max}$ is the size of the largest group and rainbow would need the number of consequitive pairs in distinct groups, lets notate it as D. Expected value value of D then can be expressed (here expectation is always with respect to a random permutation of end-values of classes:

\begin{align} \mathbb E[D] &= (K - 1) * (1-\frac{ \sum_i G_i*(G_i-1)} {K(K-1)}) \notag \\ &= (K-1)-\frac{ \sum G_i*(G_i-1)} {K} \notag \end{align}where $G_i$ is the size of the $i^{th}$ group and summation is over all groups. One can in fact show that $K - G_{max} \leq \mathbb E[D]$ via the following inequality:

\begin{align} \frac{ \sum G_i*(G_i-1)} {K} &\leq \frac{ \sum G_i*(G_{max}-1)} {K} \notag \\ & = ( \frac{ \sum G_i} {K} ) * (G_{max}-1)=G_{max}-1 \tag{1} \end{align}hence continuing expression of $\mathbb E[D]$ from before we get:

\begin{align} \mathbb E[D] &= (K-1)-\frac{ \sum G_i*(G_i-1)} {K} \notag\\ &\geq (K-1)-G_{max}-1=K-G_{max} \tag{2} \end{align}so one random rainbow in this sense is worse than one-hot encodings but that is not a surpise as you have 1-feature versus K features. Now this is when rainbow was chosen completely at random, it we got lucky we would only need $D_{min}=g-1$ and in this case we are always better off:

\begin{align} D_{min} & = g-1 = (\sum_{i=1}^{g} 1) - 1 \notag \\ &\leq (\sum_{i=1}^{g} G_i) - G_{max} = K - G_{max} \tag{3} \end{align}We also see that $G_{max}-(\mathbb E [C]+1)$ is greatests when $G_{i}$'s and $G_{max}$ are far apart as the only part in the above equation where we had non strict equality was where we subsituted $G_i$ by $G_{max}$.

note:the number of samples N is not in the formulas (and shouldn't be) it is all about K, the number of classes here.

Let us call this number of rainbows $R$ and assume it is not too large, then corresponding $D$ can on average be

\begin{align}
\mathbb E[D] & = (K - 1) * (1-\frac{ \sum_i G_i*(G_i-1)} {K(K-1)})^R \notag\\
& \approx (K-1)- R* \frac{ \sum G_i*(G_i-1)} {K} \tag{4}
\end{align}
So from (1) we get that

$$\mathbb E[D] \leq K-G_{max}$$ if we let $R$ be greater than $\frac{G_{max}}{G_i}$ for all $i$ or in other words if: $$ R \geq \frac{G_{max}}{G_{min}} $$ This is of course less than $G_{max}$. Similar computatoin can also show that it also has to be less than $g$, but then from $g*G_{max} \leq K$ we can show that $R \leq \sqrt{K}$. In fact with a bit more detailed calculation one can show $R \lessapprox \frac {\sqrt{K}}{2}$ as $K$ gets larger and this actually would imply that $R$ is not too large in the approximation we used above.

Hence we showed that single perfect rainbow outperforms, $K$ one-hot encoding and that $\frac{\sqrt{K}}{2}$ random rainbows would outperform $K$ one-hot encodings.

While the topic of one-hot and binary-one-hot encoding has been touched by many authors:

- https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159
- https://towardsdatascience.com/stop-one-hot-encoding-your-categorical-variables-bbb0fba89809

The approach presented here is rarely used in industry or academia, while has shown superior results and by definition creates less features.

Before proceeding with models, we applied basic feature selection to each feature set:

- Dropped variables with all values missing
- No imputations was done – missingness is handled by XGBoost (see XGBoost is not black magic)
- Droped variables with variance being exactly 0
- Dropped perfect duplicates
- Dropped perfect rank duplicates

These procedures reduced our Rainbow feature set and One-Hot feauture set to **183** and **2095** respectively.

We ran all possible XGBoost multiclass classification models covering this space:

```
'objective': 'multi:softprob'
'eval_metric': 'mlogloss'
'num_class': 5
'subsample': 0.8
'max_depth': [2, 3, 5]
'eta': [0.1, 0.3, 0.5]
'n_estimators': [50, 100, 200]
```

Thus, we ran 3 max_depth * 3 eta * 3 n_estimators * 4 folds * 2 encoding methods = 216 models. Below we report average cross validation metrics for both encoding methods.