Airlines Dataset: Divergence Applications

In this example, we will demonstrate change point detection and clustering on the public Airlines data set.

[1]:
import pandas
pandas.options.display.max_rows=5 # restrict to 5 rows on display

df = pandas.read_csv("https://raw.githubusercontent.com/Devvrat53/Flight-Delay-Prediction/master/Data/flight_data.csv")
df['date'] = pandas.to_datetime(df[['year', 'month', 'day']])
df['day_index'] = (df['date'] - df['date'].min()).dt.days
df['DayOfWeek'] = df['date'].dt.day_name()
df['Month'] = df['date'].dt.month_name()
df
[1]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ... dest air_time distance hour minute time_hour date day_index DayOfWeek Month
0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA ... IAH 227.0 1400 5 15 1/1/2013 5:00 2013-01-01 0 Tuesday January
1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA ... IAH 227.0 1416 5 29 1/1/2013 5:00 2013-01-01 0 Tuesday January
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
336774 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ ... CLE NaN 419 11 59 30-09-2013 11:00 2013-09-30 272 Monday September
336775 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ ... RDU NaN 431 8 40 30-09-2013 08:00 2013-09-30 272 Monday September

336776 rows × 23 columns

Destination Airport

Let’s just look at how the dest feature changes from month to month. In this example, we will use total variation. Since dest is a categorical variable, it is natural to use a density estimator based approach to calculating statistical divergences such as total variation. We use the calc_tv_mle function for this, which computes maximum likelihood based density estimators using histograms, and subsequently feeds this into the formula for total variation.

[2]:
import numpy

from mvtk.supervisor.divergence import calc_tv_mle

covariate = ['dest']
max_likelihood_drift_series = []
def encode(v, d={}):
    if not v in d:
        d[v] = len(d)
    return d[v]
grouped = df.groupby('Month')
batches = [g[1][covariate].values.reshape(-1, 1) for g in grouped]
max_likelihood_drift_series = []
for (month, _), new_batch, old_batch in zip(grouped, batches[1:], batches[:-1]):
    max_likelihood_drift_series.append(calc_tv_mle([old_batch], [new_batch]))
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Python version 3.8.11 (default, Aug  6 2021 08:56:27)
Spark context Web UI available at http://c02yn2epjg5j.corp.root.nasd.com:4040
Spark context available as 'sc' (master = local[*], app id = local-1630586243406).
SparkSession available as 'spark'.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[3]:
df[covariate].nunique()
[3]:
dest    105
dtype: int64

Density vs Variational

Now let’s compare these results to those produced by variational estimators. Variational estimation of statistical divergences is an elegant approach well suited for high dimensional floating point data. However, categorical data needs to be one hot encoded to be made compatible with the variational approach. Since dest has 105 unique values, this requires turning a one dimensional dataset into a 105 dimensional one. We can increase the batch size and number of epochs to try to compensate for this. As we shall see, this makes the problem complex enough that the variational estimates do not agree well with the histogram based ones.

[4]:
from sklearn.preprocessing import OneHotEncoder
from mvtk.supervisor.divergence import calc_tv

covariate = ['dest']

encoder = OneHotEncoder(sparse=False)
encoder.fit(df[covariate].values)

variational_drift_series = []
previous_ohc = None
months = []
for i, (month, group1) in enumerate(df.groupby('Month')):
    ohc = [encoder.transform(group1[covariate])]
    if previous_ohc is not None:
        variational_drift_series.append(calc_tv(ohc, previous_ohc, num_epochs=16, num_batches=32, batch_size=512))
        months.append(month)
    previous_ohc = ohc
[5]:
import matplotlib.pylab as plt
from matplotlib.ticker import MaxNLocator

plt.rcParams['figure.figsize'] = [10, 5]
fig, ax = plt.subplots()

plt.plot(max_likelihood_drift_series, label='Maximum Likelihood')
plt.plot(variational_drift_series, label='Variational Approximation')

ax.xaxis.set_major_formatter(lambda x_val, x_pos: months[int(x_pos)])
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
plt.legend()
plt.show()
../../_images/notebooks_divergence_Airlines_7_0.png

Generally, the variational estimator degrades in accuracy for categorical data with large numbers of categories. However, it works quite well for this example after increasing the batch size and number of epochs for more accurate estimates.

[6]:
numpy.corrcoef(variational_drift_series, max_likelihood_drift_series)[1, 0]
[6]:
0.9831698719629802

Clustering

Now let’s construct a distance matrix between data aggregated by each day of the week and cluster it.

[7]:
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import numpy

from mvtk.supervisor.divergence import calc_tv_mle
from mvtk.supervisor.divergence import get_distance_matrix
from mvtk.supervisor.utils import parallel
from mvtk.supervisor.utils import compute_divergence_crosstabs

feature_names = ['origin', 'dest']
features = df[feature_names + ['DayOfWeek']]
distance_matrix = compute_divergence_crosstabs(features,
                                               'DayOfWeek',
                                               divergence=calc_tv_mle)
sns.heatmap(distance_matrix, cmap='coolwarm', linewidths=0.30, annot=True)
plt.show()
../../_images/notebooks_divergence_Airlines_11_1.png

As we might expect, it seems Saturday, and to a lesser extent Sunday, are somewhat different from the other days of the week.

[8]:
from sklearn.cluster import AgglomerativeClustering

def agglomerative_clustering(distance_matrix, n_clusters, affinity='precomputed', linkage='complete', **kwargs):
    return AgglomerativeClustering(n_clusters=n_clusters, **kwargs).fit_predict(distance_matrix)

print('Two clusters', agglomerative_clustering(distance_matrix, n_clusters=2, linkage='complete'))
print('Three clusters', agglomerative_clustering(distance_matrix, n_clusters=3, linkage='complete'))
Two clusters [0 0 1 0 0 0 0]
Three clusters [0 0 1 2 0 0 0]
/opt/anaconda3/lib/python3.8/site-packages/scipy/cluster/hierarchy.py:834: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
  return linkage(y, method='ward', metric='euclidean')
/opt/anaconda3/lib/python3.8/site-packages/scipy/cluster/hierarchy.py:834: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
  return linkage(y, method='ward', metric='euclidean')

More data

[9]:
import os
import pandas

df = df[['origin', 'dest', 'day_index']]
[10]:
import itertools
import numpy

from mvtk.supervisor.divergence import calc_tv_mle

df.sort_values('day_index', inplace=True)
features = df[feature_names + ['day_index']]
distance_matrix = compute_divergence_crosstabs(features, 'day_index', divergence=calc_tv_mle)
/opt/anaconda3/lib/python3.8/site-packages/pandas/util/_decorators.py:311: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)
[11]:
import matplotlib.pyplot as plt

plt.imshow(distance_matrix)
plt.show()
../../_images/notebooks_divergence_Airlines_17_0.png