Training Dataset Drift Detection

In this example, we will demonstrate how to detect drift from training data set using the public Airlines data set.

[1]:

import pandas
pandas.options.display.max_rows=5 # restrict to 5 rows on display

df = pandas.read_csv("https://raw.githubusercontent.com/Devvrat53/Flight-Delay-Prediction/master/Data/flight_data.csv")
df['date'] = pandas.to_datetime(df[['year', 'month', 'day']])
df['day_index'] = (df['date'] - df['date'].min()).dt.days
df['DayOfWeek'] = df['date'].dt.day_name()
df['Month'] = df['date'].dt.month_name()
df

[1]:

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	...	dest	air_time	distance	hour	minute	time_hour	date	day_index	DayOfWeek	Month
0	2013	1	1	517.0	515	2.0	830.0	819	11.0	UA	...	IAH	227.0	1400	5	15	1/1/2013 5:00	2013-01-01	0	Tuesday	January
1	2013	1	1	533.0	529	4.0	850.0	830	20.0	UA	...	IAH	227.0	1416	5	29	1/1/2013 5:00	2013-01-01	0	Tuesday	January
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336774	2013	9	30	NaN	1159	NaN	NaN	1344	NaN	MQ	...	CLE	NaN	419	11	59	30-09-2013 11:00	2013-09-30	272	Monday	September
336775	2013	9	30	NaN	840	NaN	NaN	1020	NaN	MQ	...	RDU	NaN	431	8	40	30-09-2013 08:00	2013-09-30	272	Monday	September

336776 rows × 23 columns

Data Splitting

Let’s assume that we trained our model with the January through November data, and we want to run our model using the data we get each day in December. To set this up, we split the data among different data frames.

[2]:

df_train = df[df['month'] <= 11]
df_daily = df[df['month'] > 11]

Drift Detection

Now we want to compare our training data set to each set of daily data that we plan to feed to our model. Let’s start with the categorical features.

[3]:

from mvtk.supervisor.divergence import calc_tv_mle

columns = ['carrier', 'origin', 'dest']

df_train_transformed = df_train[columns]

grouped = df_daily.groupby('day')
batches = [g[1][columns] for g in grouped]

categorical_drift_series = []
for (day, _), batch in zip(grouped, batches):
    categorical_drift_series.append(calc_tv_mle([df_train_transformed], [batch]))

WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

[4]:

import matplotlib.pylab as plt

plt.rcParams['figure.figsize'] = [10, 5]
fig, ax = plt.subplots()

days = range(1, len(categorical_drift_series) + 1)

plt.plot(days, categorical_drift_series, label='Maximum Likelihood No Drift')

plt.legend()
plt.show()

../../_images/notebooks_divergence_TrainingDatasetDrift_6_0.png

As you can see from the plot above, there is not a lot of difference between the training set and each day’s set of data (floats around 0.1 to 0.2).

Let’s now assume that starting on day 25, flights are increasingly routed to LAX more and more each day, with all flights going to LAX from day 29 and on. Then we will run the divergences again and plot our results.

[5]:

df_daily = df_daily.reset_index(drop=True)

[6]:

x = 5

for i in range(25, 32):
    df_daily.loc[(df_daily['day'] == i) & (df_daily.index % x == 0), ['dest']] = 'LAX'
    if x > 1:
        x -= 1

Below is the percentage of destination airports that are LAX per day in our updated daily data set

[7]:

day = 1
for df_day in df_daily.groupby('day'):
    day_pct = df_day[1]['dest'].value_counts(normalize=True) * 100
    print('Day ' + str(day) + ': ' + str(round(day_pct.loc[day_pct.index == 'LAX'][0])) + '%')
    day += 1

Day 1: 5%
Day 2: 5%
Day 3: 5%
Day 4: 5%
Day 5: 5%
Day 6: 5%
Day 7: 5%
Day 8: 5%
Day 9: 5%
Day 10: 5%
Day 11: 5%
Day 12: 5%
Day 13: 5%
Day 14: 5%
Day 15: 5%
Day 16: 5%
Day 17: 5%
Day 18: 5%
Day 19: 5%
Day 20: 5%
Day 21: 5%
Day 22: 5%
Day 23: 5%
Day 24: 6%
Day 25: 24%
Day 26: 29%
Day 27: 37%
Day 28: 53%
Day 29: 100%
Day 30: 100%
Day 31: 100%

[8]:

grouped = df_daily.groupby('day')
batches = [g[1][columns] for g in grouped]

categorical_drift_series_updated = []
for (day, _), batch in zip(grouped, batches):
    categorical_drift_series_updated.append(calc_tv_mle([df_train_transformed], [batch]))

[9]:

plt.rcParams['figure.figsize'] = [10, 5]
fig, ax = plt.subplots()

plt.plot(days, categorical_drift_series, label='Maximum Likelihood No Drift')
plt.plot(days, categorical_drift_series_updated, label='Maximum Likelihood With Drift')

plt.legend()
plt.show()

../../_images/notebooks_divergence_TrainingDatasetDrift_13_0.png

As we can see above, our drift detection increases starting from 25, alerting us of the drift.

Now let’s try the same exercise on the continuous features.

[10]:

from mvtk.supervisor.divergence import calc_tv_knn

columns = ['dep_time', 'sched_dep_time', 'dep_delay', 'arr_time', 'sched_arr_time', 'arr_delay', 'air_time', 'distance', 'hour', 'minute']

df_train_transformed = df_train[columns].fillna(0)

grouped = df_daily.groupby('day')
batches = [g[1][columns].fillna(0) for g in grouped]

continuous_drift_series = []
for (day, _), batch in zip(grouped, batches):
    continuous_drift_series.append(calc_tv_knn([df_train_transformed], [batch], k=20))

[11]:

plt.rcParams['figure.figsize'] = [10, 5]
fig, ax = plt.subplots()

plt.plot(days, continuous_drift_series, label='Total Variation KNN No Drift')

plt.legend()
plt.show()

../../_images/notebooks_divergence_TrainingDatasetDrift_16_0.png

Once again, there is not a lot of difference between the training set and each day’s set of data (does not exceed 0.5).

Let’s now assume that starting on day 25, flights are increasingly delayed more and more each day, with all flights delayed from day 29 and on (ignoring null values). Then we will run the divergences again and plot our results.

[12]:

x = 5

for i in range(25, 32):
    df_daily.loc[(df_daily['day'] == i) & (df_daily.index % x == 0), ['dep_delay']] += 1000
    if x > 1:
        x -= 1

Below is the percentage of departure delays that are 100 seconds or more per day in our updated daily data set

[13]:

day = 1
for df_day in df_daily.groupby('day'):
    day_pct = df_day[1]['dep_delay'].value_counts(normalize=True, dropna=False) * 100
    print('Day ' + str(day) + ': ' + str(round(day_pct.loc[day_pct.index >= 100].sum())) + '%')
    day += 1

Day 1: 1%
Day 2: 2%
Day 3: 2%
Day 4: 1%
Day 5: 18%
Day 6: 4%
Day 7: 1%
Day 8: 6%
Day 9: 11%
Day 10: 6%
Day 11: 1%
Day 12: 1%
Day 13: 1%
Day 14: 8%
Day 15: 4%
Day 16: 3%
Day 17: 11%
Day 18: 2%
Day 19: 3%
Day 20: 3%
Day 21: 5%
Day 22: 7%
Day 23: 10%
Day 24: 1%
Day 25: 22%
Day 26: 27%
Day 27: 35%
Day 28: 51%
Day 29: 98%
Day 30: 99%
Day 31: 98%

[14]:

grouped = df_daily.groupby('day')
batches = [g[1][columns].fillna(0) for g in grouped]

continuous_drift_series_updated = []
for (day, _), batch in zip(grouped, batches):
    continuous_drift_series_updated.append(calc_tv_knn([df_train_transformed], [batch], k=20))

[15]:

plt.rcParams['figure.figsize'] = [10, 5]
fig, ax = plt.subplots()

plt.plot(days, continuous_drift_series, label='Total Variation KNN No Drift')
plt.plot(days, continuous_drift_series_updated, label='Total Variation KNN With Drift')

plt.legend()
plt.show()

../../_images/notebooks_divergence_TrainingDatasetDrift_22_0.png

As we can see above, our drift detection increases this time starting from day 28, alerting us of the drift.