{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Handling Categorical Data\n",
    "\n",
    "More often than not a dataset is comprised of both **numeric**, and **categorical** data types. The supervisor divergence functions can handle both, but it needs to know which columns are categorical so that it can handle it properly. This notebook shows you how to do so when using the **supervisor** divergence package."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset with Mixed Data Types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a dataset\n",
    "To demonstrate, we will create a simple dataset with a mix of categorical and numeric columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>latitude</th>\n",
       "      <th>fruit</th>\n",
       "      <th>temp</th>\n",
       "      <th>city</th>\n",
       "      <th>longitude</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>239</td>\n",
       "      <td>apple</td>\n",
       "      <td>104</td>\n",
       "      <td>Filly Downs</td>\n",
       "      <td>257</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>181</td>\n",
       "      <td>apple</td>\n",
       "      <td>11</td>\n",
       "      <td>Coldport</td>\n",
       "      <td>303</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>246</td>\n",
       "      <td>raspberry</td>\n",
       "      <td>99</td>\n",
       "      <td>Filly Downs</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>187</td>\n",
       "      <td>raspberry</td>\n",
       "      <td>91</td>\n",
       "      <td>Coldport</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>97</td>\n",
       "      <td>raspberry</td>\n",
       "      <td>26</td>\n",
       "      <td>Filly Downs</td>\n",
       "      <td>108</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   latitude      fruit  temp         city  longitude\n",
       "0       239      apple   104  Filly Downs        257\n",
       "1       181      apple    11     Coldport        303\n",
       "2       246  raspberry    99  Filly Downs         60\n",
       "3       187  raspberry    91     Coldport         90\n",
       "4        97  raspberry    26  Filly Downs        108"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "size = 100000\n",
    "\n",
    "data = pd.DataFrame()\n",
    "data['latitude'] =np.random.randint(0, 360, size=size)\n",
    "data['fruit'] = np.random.choice(a=['apple', 'orange', 'plum', 'raspberry', 'blueberry'],\n",
    "                          p=[0.1,      0.3,      0.3,    0.25,          0.05], size=size)\n",
    "data['temp'] =np.random.randint(-10, 120, size=size)\n",
    "data['city'] = np.random.choice(a=['London', 'Paris', 'Newport', 'Bradfield', 'Coldport', 'Filly Downs'],\n",
    "                          p=[0.15,      0.2,      0.1,      0.1,         0.3,         0.15], size=size)\n",
    "\n",
    "\n",
    "data['longitude'] = np.random.randint(0, 360, size=size)\n",
    "\n",
    "data.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the dataset, the **fruit** and **city** columns are *categorical*, while **latitude**, **temp** and **longitude** are *numeric*. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a comparison dataset\n",
    "We will create a dataset to compare by taking the original dataset and modify some of the values. In this case, we will set a couple of columns to a constant value, which would result in the new dataset being of a different distribution from the original dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_shifted = data.copy()\n",
    "data_shifted['temp'] = 1\n",
    "data_shifted.fruit = 'apple'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculating Divergence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "with warnings.catch_warnings():\n",
    "    warnings.simplefilter(\"ignore\")\n",
    "    from mvtk.supervisor.divergence import calc_tv_knn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The divergence functions have a parameter called **categorical_columns** which you need to use to specify which columns are not numeric. The functions will throw an error if categorical columns are passed but not specified.\n",
    "\n",
    "So, if you know which columns are categorical, then you need to pass a list of the column indexes. Both the a and b datasets should have the columns in the exact order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8506579001037404"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "calc_tv_knn(data, data_shifted, categorical_columns=[1,3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.2598375876037403"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "calc_tv_knn(data, data, categorical_columns=[1,3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## mvtk.supervisor.utils.column_indexes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the utility function **column_indexes** you can get a list of the ccategorical columns in the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 3]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from mvtk.supervisor.utils import column_indexes\n",
    "\n",
    "column_indexes(data, cols=['fruit', 'city'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also run the **column_indexes** function inline as a function parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.25967482718707363"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "calc_tv_knn(data, data, \n",
    "            categorical_columns=column_indexes(data, cols=['fruit', 'city']))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "supervisor",
   "language": "python",
   "name": "supervisor"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}