Handling Categorical Data [ML -Python]

Photo by Jaime Dantas | Unsplash

The main goal of this article is to introduce for non-technical people and beginners concepts about categorical variable and how we can handle it to use in machine learning algorithms.

In Statistics, categorical variables are those that can be separated into groups. They are useful in cases where the main interest is to classify the data in a qualitative way, rather than quantitative.

In a nutshell: is all information regarding what we can observe from our study object, and it can’t be measure by a numerical metric.

For instance, hair is a categorical variable. We could label it in discrete category features such as its length (short, long, medium), color (red, brown, black, blonde), type (straight, curly, wavy), and so on.

Ordinal features are those that have some sort of natural ordering. Taking once again hair as an example, the hair’s length is an ordinal feature, since is possible to order the hair as long, medium, and short based on its length, once long is greater than medium, and medium is greater than short.

The same doesn't make sense with the color, which is a nominal feature because it doesn’t imply any order. In other words, you can not say brown is greater than white.

Let’s create a panda DataFrame as an example.

Pandas DataFrame

When we are talking about Machine Learning, is important to keep in mind that algorithms are driven by numbers. So, in order to build a model using categorical data, we need to wrang the features in a way that makes it possible for the algorithm to understand and measure the data. For that, we need to convert the categorical features into numbers.

This process of changing data into a new format using a methodology is called encoding, and this process must be reversible.

Is good to know that encoding is not a way to secure data. For that matter, we use encryption, especially because decoding (the reverse process of encoding) should be an easy process to make.

Considering there are infinite possibilities of ordinal variables, there is no unique function that will work for every case. In this case, we need to do it manually.

For length, let’s use integer encoding, meaning categorizing the features by a number from 1 to n (n is the total number of unique values), respecting the length order, resulting in Short as 1, Medium as 2, and Long as 3.

Integer encoding.

If you want to reverse the encoding all you have to do is to invert the key-value pair in length_mapping dictionary, and mapping the reserved dictionary again. You can do it manually, creating a new dictionary by yourself, or using dictionary comprehension, as you can see below.

Decoding mapping with dictionary comprehension.

As you can see from the example above, we choose the integers not randomly but considering that Long is greater than Medium, and Medium greater than Short.

When we are talking about nominal features, it doesn’t matter which integer we choose for each feature. The important thing is making sure that you don’t follow any kind of numerical order.

Otherwise, since algorithms are always trying to find patterns inside the data, your encoding could mislead the algorithm, and make it think that Brown is greater than Black, for example.

One easy way to encoding nominal variables s through the get_dummies method implemented in Pandas.

This method does a one-hot encoding in such a way that the method will return new columns where the encoded featured is now represent with different Boolean variables (1 for positive and 0 for false).

One-hot encoding.

Is important to bear in mind that when we use one-hot encoding, we add multicollinearity to our dataset. Multicollinearity occurs when two or more independent features in the dataset are correlated with each other.

In order to decrease the multicollinearity among variables, we can remove one feature column that was encoded by this method.

One-hot encoding with drop_first.

As you can see, the color Black and the type Curly was dropped. At first, you may think that we lost data by doing that, but if all colors are labeled as 0, that means the color is black and the same works for the type, if all types are set as 0, that means the type is curly.

There are multiple approaches to encode categorical variables. To choose the right technique you would take into consideration some criterias, such as the data set size, or the type of information we want to extract.

If you want to go for an in-depth analysis, I recommend this post. There you can find explanations regarding others encoding techniques like Frequency, Backward Difference, Mean Encoding, and many others.

Good luck with your studies!

Petroleum Engineer enthusiastic in Data Science.