{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Handling Categorical Data\n", "\n", "More often than not a dataset is comprised of both **numeric**, and **categorical** data types. The supervisor divergence functions can handle both, but it needs to know which columns are categorical so that it can handle it properly. This notebook shows you how to do so when using the **supervisor** divergence package." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset with Mixed Data Types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a dataset\n", "To demonstrate, we will create a simple dataset with a mix of categorical and numeric columns. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latitudefruittempcitylongitude
0239apple104Filly Downs257
1181apple11Coldport303
2246raspberry99Filly Downs60
3187raspberry91Coldport90
497raspberry26Filly Downs108
\n", "
" ], "text/plain": [ " latitude fruit temp city longitude\n", "0 239 apple 104 Filly Downs 257\n", "1 181 apple 11 Coldport 303\n", "2 246 raspberry 99 Filly Downs 60\n", "3 187 raspberry 91 Coldport 90\n", "4 97 raspberry 26 Filly Downs 108" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "\n", "size = 100000\n", "\n", "data = pd.DataFrame()\n", "data['latitude'] =np.random.randint(0, 360, size=size)\n", "data['fruit'] = np.random.choice(a=['apple', 'orange', 'plum', 'raspberry', 'blueberry'],\n", " p=[0.1, 0.3, 0.3, 0.25, 0.05], size=size)\n", "data['temp'] =np.random.randint(-10, 120, size=size)\n", "data['city'] = np.random.choice(a=['London', 'Paris', 'Newport', 'Bradfield', 'Coldport', 'Filly Downs'],\n", " p=[0.15, 0.2, 0.1, 0.1, 0.3, 0.15], size=size)\n", "\n", "\n", "data['longitude'] = np.random.randint(0, 360, size=size)\n", "\n", "data.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the dataset, the **fruit** and **city** columns are *categorical*, while **latitude**, **temp** and **longitude** are *numeric*. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a comparison dataset\n", "We will create a dataset to compare by taking the original dataset and modify some of the values. In this case, we will set a couple of columns to a constant value, which would result in the new dataset being of a different distribution from the original dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "data_shifted = data.copy()\n", "data_shifted['temp'] = 1\n", "data_shifted.fruit = 'apple'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculating Divergence" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "with warnings.catch_warnings():\n", " warnings.simplefilter(\"ignore\")\n", " from mvtk.supervisor.divergence import calc_tv_knn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The divergence functions have a parameter called **categorical_columns** which you need to use to specify which columns are not numeric. The functions will throw an error if categorical columns are passed but not specified.\n", "\n", "So, if you know which columns are categorical, then you need to pass a list of the column indexes. Both the a and b datasets should have the columns in the exact order." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8506579001037404" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "calc_tv_knn(data, data_shifted, categorical_columns=[1,3])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.2598375876037403" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "calc_tv_knn(data, data, categorical_columns=[1,3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## mvtk.supervisor.utils.column_indexes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the utility function **column_indexes** you can get a list of the ccategorical columns in the dataframe." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 3]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mvtk.supervisor.utils import column_indexes\n", "\n", "column_indexes(data, cols=['fruit', 'city'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also run the **column_indexes** function inline as a function parameter." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.25967482718707363" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "calc_tv_knn(data, data, \n", " categorical_columns=column_indexes(data, cols=['fruit', 'city']))" ] } ], "metadata": { "kernelspec": { "display_name": "supervisor", "language": "python", "name": "supervisor" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }