Handling Categorical Data
More often than not a dataset is comprised of both numeric, and categorical data types. The supervisor divergence functions can handle both, but it needs to know which columns are categorical so that it can handle it properly. This notebook shows you how to do so when using the supervisor divergence package.
Dataset with Mixed Data Types
Create a dataset
To demonstrate, we will create a simple dataset with a mix of categorical and numeric columns.
[1]:
import pandas as pd
import numpy as np
size = 100000
data = pd.DataFrame()
data['latitude'] =np.random.randint(0, 360, size=size)
data['fruit'] = np.random.choice(a=['apple', 'orange', 'plum', 'raspberry', 'blueberry'],
p=[0.1, 0.3, 0.3, 0.25, 0.05], size=size)
data['temp'] =np.random.randint(-10, 120, size=size)
data['city'] = np.random.choice(a=['London', 'Paris', 'Newport', 'Bradfield', 'Coldport', 'Filly Downs'],
p=[0.15, 0.2, 0.1, 0.1, 0.3, 0.15], size=size)
data['longitude'] = np.random.randint(0, 360, size=size)
data.head(5)
[1]:
latitude | fruit | temp | city | longitude | |
---|---|---|---|---|---|
0 | 239 | apple | 104 | Filly Downs | 257 |
1 | 181 | apple | 11 | Coldport | 303 |
2 | 246 | raspberry | 99 | Filly Downs | 60 |
3 | 187 | raspberry | 91 | Coldport | 90 |
4 | 97 | raspberry | 26 | Filly Downs | 108 |
In the dataset, the fruit and city columns are categorical, while latitude, temp and longitude are numeric.
Create a comparison dataset
We will create a dataset to compare by taking the original dataset and modify some of the values. In this case, we will set a couple of columns to a constant value, which would result in the new dataset being of a different distribution from the original dataset.
[2]:
data_shifted = data.copy()
data_shifted['temp'] = 1
data_shifted.fruit = 'apple'
Calculating Divergence
[3]:
import warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
from mvtk.supervisor.divergence import calc_tv_knn
The divergence functions have a parameter called categorical_columns which you need to use to specify which columns are not numeric. The functions will throw an error if categorical columns are passed but not specified.
So, if you know which columns are categorical, then you need to pass a list of the column indexes. Both the a and b datasets should have the columns in the exact order.
[4]:
calc_tv_knn(data, data_shifted, categorical_columns=[1,3])
[4]:
0.8506579001037404
[5]:
calc_tv_knn(data, data, categorical_columns=[1,3])
[5]:
0.2598375876037403
mvtk.supervisor.utils.column_indexes
With the utility function column_indexes you can get a list of the ccategorical columns in the dataframe.
[6]:
from mvtk.supervisor.utils import column_indexes
column_indexes(data, cols=['fruit', 'city'])
[6]:
[1, 3]
You can also run the column_indexes function inline as a function parameter.
[7]:
calc_tv_knn(data, data,
categorical_columns=column_indexes(data, cols=['fruit', 'city']))
[7]:
0.25967482718707363