Handling Categorical Data

More often than not a dataset is comprised of both numeric, and categorical data types. The supervisor divergence functions can handle both, but it needs to know which columns are categorical so that it can handle it properly. This notebook shows you how to do so when using the supervisor divergence package.

Dataset with Mixed Data Types

Create a dataset

To demonstrate, we will create a simple dataset with a mix of categorical and numeric columns.

[1]:
import pandas as pd
import numpy as np


size = 100000

data = pd.DataFrame()
data['latitude'] =np.random.randint(0, 360, size=size)
data['fruit'] = np.random.choice(a=['apple', 'orange', 'plum', 'raspberry', 'blueberry'],
                          p=[0.1,      0.3,      0.3,    0.25,          0.05], size=size)
data['temp'] =np.random.randint(-10, 120, size=size)
data['city'] = np.random.choice(a=['London', 'Paris', 'Newport', 'Bradfield', 'Coldport', 'Filly Downs'],
                          p=[0.15,      0.2,      0.1,      0.1,         0.3,         0.15], size=size)


data['longitude'] = np.random.randint(0, 360, size=size)

data.head(5)
[1]:
latitude fruit temp city longitude
0 239 apple 104 Filly Downs 257
1 181 apple 11 Coldport 303
2 246 raspberry 99 Filly Downs 60
3 187 raspberry 91 Coldport 90
4 97 raspberry 26 Filly Downs 108

In the dataset, the fruit and city columns are categorical, while latitude, temp and longitude are numeric.

Create a comparison dataset

We will create a dataset to compare by taking the original dataset and modify some of the values. In this case, we will set a couple of columns to a constant value, which would result in the new dataset being of a different distribution from the original dataset.

[2]:
data_shifted = data.copy()
data_shifted['temp'] = 1
data_shifted.fruit = 'apple'

Calculating Divergence

[3]:
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    from mvtk.supervisor.divergence import calc_tv_knn

The divergence functions have a parameter called categorical_columns which you need to use to specify which columns are not numeric. The functions will throw an error if categorical columns are passed but not specified.

So, if you know which columns are categorical, then you need to pass a list of the column indexes. Both the a and b datasets should have the columns in the exact order.

[4]:
calc_tv_knn(data, data_shifted, categorical_columns=[1,3])
[4]:
0.8506579001037404
[5]:
calc_tv_knn(data, data, categorical_columns=[1,3])
[5]:
0.2598375876037403

mvtk.supervisor.utils.column_indexes

With the utility function column_indexes you can get a list of the ccategorical columns in the dataframe.

[6]:
from mvtk.supervisor.utils import column_indexes

column_indexes(data, cols=['fruit', 'city'])
[6]:
[1, 3]

You can also run the column_indexes function inline as a function parameter.

[7]:
calc_tv_knn(data, data,
            categorical_columns=column_indexes(data, cols=['fruit', 'city']))
[7]:
0.25967482718707363