Handling Categorical Data

More often than not a dataset is comprised of both numeric, and categorical data types. The supervisor divergence functions can handle both, but it needs to know which columns are categorical so that it can handle it properly. This notebook shows you how to do so when using the supervisor divergence package.

Dataset with Mixed Data Types

Create a dataset

To demonstrate, we will create a simple dataset with a mix of categorical and numeric columns.

[1]:

import pandas as pd
import numpy as np


size = 100000

data = pd.DataFrame()
data['latitude'] =np.random.randint(0, 360, size=size)
data['fruit'] = np.random.choice(a=['apple', 'orange', 'plum', 'raspberry', 'blueberry'],
                          p=[0.1,      0.3,      0.3,    0.25,          0.05], size=size)
data['temp'] =np.random.randint(-10, 120, size=size)
data['city'] = np.random.choice(a=['London', 'Paris', 'Newport', 'Bradfield', 'Coldport', 'Filly Downs'],
                          p=[0.15,      0.2,      0.1,      0.1,         0.3,         0.15], size=size)


data['longitude'] = np.random.randint(0, 360, size=size)

data.head(5)

[1]:

	latitude	fruit	temp	city	longitude
0	239	apple	104	Filly Downs	257
1	181	apple	11	Coldport	303
2	246	raspberry	99	Filly Downs	60
3	187	raspberry	91	Coldport	90
4	97	raspberry	26	Filly Downs	108

In the dataset, the fruit and city columns are categorical, while latitude, temp and longitude are numeric.

Create a comparison dataset

We will create a dataset to compare by taking the original dataset and modify some of the values. In this case, we will set a couple of columns to a constant value, which would result in the new dataset being of a different distribution from the original dataset.

[2]:

data_shifted = data.copy()
data_shifted['temp'] = 1
data_shifted.fruit = 'apple'

Calculating Divergence

[3]:

import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    from mvtk.supervisor.divergence import calc_tv_knn

The divergence functions have a parameter called categorical_columns which you need to use to specify which columns are not numeric. The functions will throw an error if categorical columns are passed but not specified.

So, if you know which columns are categorical, then you need to pass a list of the column indexes. Both the a and b datasets should have the columns in the exact order.

[4]:

calc_tv_knn(data, data_shifted, categorical_columns=[1,3])

[4]:

0.8506579001037404

[5]:

calc_tv_knn(data, data, categorical_columns=[1,3])

[5]:

0.2598375876037403

mvtk.supervisor.utils.column_indexes

With the utility function column_indexes you can get a list of the ccategorical columns in the dataframe.

[6]:

from mvtk.supervisor.utils import column_indexes

column_indexes(data, cols=['fruit', 'city'])

[6]:

[1, 3]

You can also run the column_indexes function inline as a function parameter.

[7]:

calc_tv_knn(data, data,
            categorical_columns=column_indexes(data, cols=['fruit', 'city']))

[7]:

0.25967482718707363