Dataset Bug Detection

In this example, we will demonstrate how to detect bugs in a data set using the public Airlines data set.

[ ]:
# Since we use the category_encoders library to perform binary encoding on some of the features in this demo,
# we'll need to install it.
!pip install category_encoders
[1]:
import pandas
pandas.options.display.max_rows=5 # restrict to 5 rows on display

df = pandas.read_csv("https://raw.githubusercontent.com/Devvrat53/Flight-Delay-Prediction/master/Data/flight_data.csv")
df['date'] = pandas.to_datetime(df[['year', 'month', 'day']])
df['day_index'] = (df['date'] - df['date'].min()).dt.days
df['DayOfWeek'] = df['date'].dt.day_name()
df['Month'] = df['date'].dt.month_name()
df
[1]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ... dest air_time distance hour minute time_hour date day_index DayOfWeek Month
0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA ... IAH 227.0 1400 5 15 1/1/2013 5:00 2013-01-01 0 Tuesday January
1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA ... IAH 227.0 1416 5 29 1/1/2013 5:00 2013-01-01 0 Tuesday January
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
336774 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ ... CLE NaN 419 11 59 30-09-2013 11:00 2013-09-30 272 Monday September
336775 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ ... RDU NaN 431 8 40 30-09-2013 08:00 2013-09-30 272 Monday September

336776 rows × 23 columns

Prepare daily data

Let’s assume that we run new data each day through our model. For simplicity we will just look at the last 10 days of data.

[2]:
df_daily = df[df['month'] > 11]
[3]:
df_daily = df_daily[df_daily['day'] > 20]
[4]:
df_daily
[4]:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ... dest air_time distance hour minute time_hour date day_index DayOfWeek Month
101780 2013 12 21 2.0 2359 3.0 445.0 445 0.0 B6 ... PSE 206.0 1617 23 59 21-12-2013 23:00 2013-12-21 354 Saturday December
101781 2013 12 21 29.0 2040 229.0 138.0 2220 198.0 WN ... MDW 117.0 725 20 40 21-12-2013 20:00 2013-12-21 354 Saturday December
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
111294 2013 12 31 NaN 600 NaN NaN 735 NaN UA ... ORD NaN 719 6 0 31-12-2013 06:00 2013-12-31 364 Tuesday December
111295 2013 12 31 NaN 830 NaN NaN 1154 NaN UA ... LAX NaN 2475 8 30 31-12-2013 08:00 2013-12-31 364 Tuesday December

9516 rows × 23 columns

Bug Detection

Now we want to find any bugs in any of our daily sets of data that we feed to our model. Note that we are performing binary encoding on the categorical columns (carrier, origin, and dest) so that we can pass the data to the variational estimation function directly. We are doing this for performance reasons vs. the hybrid estimation, and to strike a balance between plain index encoding and one-hot encoding.

[5]:
import category_encoders as ce
from mvtk.supervisor.utils import compute_divergence_crosstabs
from mvtk.supervisor.divergence import calc_tv_knn

columns = ['dep_time', 'sched_dep_time', 'dep_delay', 'arr_time', 'sched_arr_time', 'arr_delay', 'air_time', 'distance', 'hour', 'minute', 'carrier', 'origin', 'dest']

encoder = ce.BinaryEncoder(cols=['carrier', 'origin', 'dest'])
encoder.fit(df_daily[columns + ['day']])
df_daily_encoded = encoder.transform(df_daily[columns + ['day']].fillna(0))

f = lambda x, y: calc_tv_knn(x, y, k = 26)
result = compute_divergence_crosstabs(df_daily_encoded, datecol='day', divergence=f)
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[6]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.heatmap(result, cmap='coolwarm', linewidths=0.30, annot=True)
plt.show()
../../_images/notebooks_divergence_BugDetection_9_0.png

As you can see from the heatmap above, although there are some divergences between the days, there is nothing that is too alarming.

Let’s now update our data set to contain a “bug” in the “sched_dep_time” feature. For day 30, all of the values of that feature are null (which we are then translating to 0).

[7]:
df_daily.loc[df_daily['day'] == 30, ['sched_dep_time']] = None

Below is the percentage of scheduled departure times that are empty per day in our updated daily data set

[8]:
day = 21
for df_day in df_daily.groupby('day'):
    day_pct = df_day[1]['sched_dep_time'].value_counts(normalize=True, dropna=False) * 100
    pct = day_pct.loc[day_pct.index.isnull()].values
    if (len(pct) == 0):
        pct = 0
    else:
        pct = pct[0]
    print('Day ' + str(day) + ': ' + str(round(pct)) + '%')
    day += 1
Day 21: 0%
Day 22: 0%
Day 23: 0%
Day 24: 0%
Day 25: 0%
Day 26: 0%
Day 27: 0%
Day 28: 0%
Day 29: 0%
Day 30: 100%
Day 31: 0%
[9]:
from mvtk.supervisor.divergence import calc_tv_knn

encoder = ce.BinaryEncoder(cols=['carrier', 'origin', 'dest'])
encoder.fit(df_daily[columns + ['day']])
df_daily_encoded = encoder.transform(df_daily[columns + ['day']].fillna(0))

f = lambda x, y: calc_tv_knn(x, y, k = 26)
result = compute_divergence_crosstabs(df_daily_encoded, datecol='day', divergence=f)
[10]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.heatmap(result, cmap='coolwarm', linewidths=0.30, annot=True)
plt.show()
../../_images/notebooks_divergence_BugDetection_15_0.png

As we can see above, our heatmap now clearly shows that we have a “bug” in our day 30 dataset.