{ "cells": [ { "cell_type": "markdown", "id": "440be274-51e1-447f-aa70-ef832e5f9449", "metadata": {}, "source": [ "# Bias-Variance Visualizations\n", "\n", "In this example, we will look at four different models with the different possible combinations of bias and variance (high and low). Histograms will be constructed for error over five iterations of training and testing. Then we will calculate the average loss, average bias, average variance, and net variance over 100 iterations of training and testing." ] }, { "cell_type": "code", "execution_count": 1, "id": "84f10ff2", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n", "I0000 00:00:1701450524.267373 1 tfrt_cpu_pjrt_client.cc:349] TfrtCpuClient created.\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "import torch\n", "import torch.nn as nn\n", "\n", "from sklearn.datasets import fetch_california_housing\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.utils import resample\n", "\n", "from mvtk.bias_variance import bias_variance_compute, bias_variance_mse, bootstrap_train_and_predict\n", "from mvtk.bias_variance.estimators import EstimatorWrapper, PyTorchEstimatorWrapper, SciKitLearnEstimatorWrapper" ] }, { "cell_type": "code", "execution_count": 2, "id": "bbebb612-f174-43f4-ba3e-248162bdf145", "metadata": {}, "outputs": [], "source": [ "random_state=123\n", "trials_graph=5\n", "trials_full=100\n", "bins=20" ] }, { "cell_type": "markdown", "id": "056ddfd2", "metadata": {}, "source": [ "## Load the example dataset and create helper functions" ] }, { "cell_type": "code", "execution_count": 3, "id": "1b6763e6", "metadata": {}, "outputs": [], "source": [ "housing = fetch_california_housing()\n", "X = pd.DataFrame(housing.data, columns=housing.feature_names)\n", "y = housing.target\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_state)" ] }, { "cell_type": "code", "execution_count": 4, "id": "ddf47d23-ec6d-4870-b77d-4269347bb86e", "metadata": {}, "outputs": [], "source": [ "import warnings\n", "\n", "def predict_trials(estimator, X_train, y_train, X_test, iterations, random_state, fit_kwargs=None, predict_kwargs=None):\n", " with warnings.catch_warnings():\n", " warnings.simplefilter('ignore')\n", " predictions = np.zeros((iterations, y_test.shape[0]), dtype=np.float64)\n", " \n", " for i in range(iterations):\n", " predictions[i] = bootstrap_train_and_predict(estimator, X_train, y_train, X_test, random_state=random_state, \n", " fit_kwargs=fit_kwargs, predict_kwargs=predict_kwargs)\n", " \n", " return predictions" ] }, { "cell_type": "code", "execution_count": 5, "id": "5f1b9ca5-1ea6-4b89-af8c-f8386f2a8220", "metadata": {}, "outputs": [], "source": [ "def graph_trials(predictions, y_test, bins):\n", " error_graph = np.swapaxes(predictions - y_test, 0, 1)\n", "\n", " plt.hist(error_graph, bins, density=True, label=[f'Trial {x}' for x in range(1, predictions.shape[0] + 1)])\n", " plt.xlabel('mean squared error')\n", " plt.legend()" ] }, { "cell_type": "markdown", "id": "f2e1d891-fde6-4635-997b-346b385d4560", "metadata": {}, "source": [ "## Label Distribution\n", "\n", "First, let's take a look at the distribution of the labels. Notice that the majority of label values are around 1 and 2, and much less around 5." ] }, { "cell_type": "code", "execution_count": 6, "id": "0197d6ae-7ec3-400e-8540-efbd7da1d08f", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.hist(y, density=True)\n", "plt.savefig('bias_variance_label_distribution.png')" ] }, { "cell_type": "markdown", "id": "73a13343", "metadata": {}, "source": [ "## High Bias Low Variance Example\n", "\n", "We will introduce an artificial bias to a Scikit-Learn Linear Regression model by adding 10 to every label of the training label set. Given that values of greater than 5 in the entire label set are considered outliers, we are fitting the model against outliers." ] }, { "cell_type": "code", "execution_count": 7, "id": "5ac18d40-42f4-45a3-854f-31311e39eab0", "metadata": {}, "outputs": [], "source": [ "model_bias = LinearRegression()\n", "model_bias_wrapped = SciKitLearnEstimatorWrapper(model_bias)\n", "\n", "# add artificial bias to training labels\n", "y_train_bias = y_train + 10" ] }, { "cell_type": "code", "execution_count": 8, "id": "8b5a704a-fb61-4adf-9790-28730205a2d0", "metadata": {}, "outputs": [], "source": [ "pred_bias = predict_trials(model_bias_wrapped, X_train, y_train_bias, X_test, trials_graph, random_state)" ] }, { "cell_type": "code", "execution_count": 9, "id": "1a6d9156-a75e-419d-8f18-8dd7606e9ecf", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "graph_trials(pred_bias, y_test, bins)\n", "plt.savefig('high_bias_low_variance.png')" ] }, { "cell_type": "markdown", "id": "f4b6fc0e-30a3-4add-b32c-adb7f5f4eefd", "metadata": {}, "source": [ "Notice in the figure above that the model error is very consistent among the trials and is not centered around 0.\n", "\n", "Next we calculate the values over 100 trials." ] }, { "cell_type": "code", "execution_count": 10, "id": "f304dcb0-6d5d-4331-b01a-cd42c4ef72d3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "average loss: 100.73667218\n", "average bias: 100.64990963\n", "average variance: 0.08676256\n", "net variance: 0.08676256\n" ] } ], "source": [ "avg_loss, avg_bias, avg_var, net_var = bias_variance_compute(model_bias_wrapped, X_train, y_train_bias, X_test, y_test, \n", " iterations=trials_full, random_state=random_state, \n", " decomp_fn=bias_variance_mse)\n", "\n", "print(f'average loss: {avg_loss:10.8f}')\n", "print(f'average bias: {avg_bias:10.8f}')\n", "print(f'average variance: {avg_var:10.8f}')\n", "print(f'net variance: {net_var:10.8f}')" ] }, { "cell_type": "markdown", "id": "436f0166-7c55-4687-a747-cafa9ac0e591", "metadata": {}, "source": [ "## Low Bias High Variance Example\n", "\n", "To simulate a higher variance, we will introduce 8 random \"noise\" features to the data set. We will also reduce the size of the training set and train a PyTorch neural network over a low number of epochs." ] }, { "cell_type": "code", "execution_count": 11, "id": "e0dc50a3-3b03-4c28-85f1-f9d603e5d6f3", "metadata": {}, "outputs": [], "source": [ "class ModelPyTorch(nn.Module):\n", " def __init__(self):\n", " super().__init__()\n", " self.linear1 = nn.Linear(16, 64)\n", " self.linear2 = nn.Linear(64, 32)\n", " self.linear3 = nn.Linear(32, 16)\n", " self.linear4 = nn.Linear(16, 8)\n", " self.linear5 = nn.Linear(8, 1)\n", " \n", " def forward(self, x):\n", " x = self.linear1(x)\n", " x = self.linear2(x)\n", " x = self.linear3(x)\n", " x = self.linear4(x)\n", " x = self.linear5(x)\n", " return x" ] }, { "cell_type": "code", "execution_count": 12, "id": "3ecc2c3e-cddc-4a40-ae0c-c3870d180abc", "metadata": {}, "outputs": [], "source": [ "model_variance = ModelPyTorch()\n", "optimizer = torch.optim.Adam(model_variance.parameters(), lr=0.001)\n", "loss_fn = nn.MSELoss()" ] }, { "cell_type": "code", "execution_count": 13, "id": "fd0a853d-e74a-43e4-8d50-ba286323f13c", "metadata": {}, "outputs": [], "source": [ "def optimizer_generator(x):\n", " return torch.optim.Adam(x.parameters(), lr=0.001)\n", "\n", "model_variance_wrapped = PyTorchEstimatorWrapper(model_variance, optimizer_generator, loss_fn)" ] }, { "cell_type": "code", "execution_count": 14, "id": "88c4a23c-f44f-4c93-9664-7b828d2a3200", "metadata": {}, "outputs": [], "source": [ "X_train_torch = torch.FloatTensor(np.append(X_train.values[:100], 1000 * np.random.random_sample((100, 8)), axis=1))\n", "X_test_torch = torch.FloatTensor(np.append(X_test.values, 1000 * np.random.random_sample((6192, 8)), axis=1))\n", "y_train_torch = torch.FloatTensor(y_train[:100]).reshape(-1, 1)\n", "y_test_torch = torch.FloatTensor(y_test).reshape(-1, 1)" ] }, { "cell_type": "code", "execution_count": 15, "id": "036c2e2d-245c-4ab6-92fd-da72ba3dc5c2", "metadata": {}, "outputs": [], "source": [ "pred_variance = predict_trials(model_variance_wrapped, X_train_torch, y_train_torch, X_test_torch, trials_graph, random_state,\n", " fit_kwargs={'epochs': 20})" ] }, { "cell_type": "code", "execution_count": 16, "id": "3cb83233-f8aa-4053-84a9-417325cc89ba", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "graph_trials(pred_variance, y_test, bins)\n", "plt.savefig('low_bias_high_variance.png')" ] }, { "cell_type": "markdown", "id": "8eff5c60-1cd6-45a3-bac1-cd9fe55d9486", "metadata": {}, "source": [ "Notice in the figure above that the model error has different distributions among the trials and centers mainly around 0.\n", "\n", "Next we calculate the values over 100 trials." ] }, { "cell_type": "code", "execution_count": 17, "id": "a4b27030-2e59-4209-8ec1-bf9635abea73", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "average loss: 95.16939162\n", "average bias: 2.09532483\n", "average variance: 93.07406679\n", "net variance: 93.07406679\n" ] } ], "source": [ "avg_loss, avg_bias, avg_var, net_var = bias_variance_compute(model_variance_wrapped, X_train_torch, y_train_torch, \n", " X_test_torch, y_test, iterations=trials_full, \n", " random_state=random_state, decomp_fn=bias_variance_mse, \n", " fit_kwargs={'epochs': 20})\n", "\n", "print(f'average loss: {avg_loss:10.8f}')\n", "print(f'average bias: {avg_bias:10.8f}')\n", "print(f'average variance: {avg_var:10.8f}')\n", "print(f'net variance: {net_var:10.8f}')" ] }, { "cell_type": "markdown", "id": "35824e2b-59f0-476e-a9df-7f2ecd168771", "metadata": {}, "source": [ "## High Bias High Variance Example\n", "\n", "We will perform a combination of the techniques from the high bias low variance example and the low bias high variance example and train another PyTorch neural network." ] }, { "cell_type": "code", "execution_count": 18, "id": "91897cf7-c5ca-4a1f-8d72-7e50e96d96a6", "metadata": {}, "outputs": [], "source": [ "model_bias_variance = ModelPyTorch()\n", "optimizer = torch.optim.Adam(model_bias_variance.parameters(), lr=0.001)\n", "loss_fn = nn.MSELoss()" ] }, { "cell_type": "code", "execution_count": 19, "id": "633a2bfa-58b5-4f9b-8a67-93ff1cfd8c9b", "metadata": {}, "outputs": [], "source": [ "# Add artificial bias to the training labels\n", "y_train_torch_bias_variance = torch.FloatTensor(y_train[:100] + 10).reshape(-1, 1)" ] }, { "cell_type": "code", "execution_count": 20, "id": "65aab378-ce2f-45e4-b209-3d45fd7c998e", "metadata": {}, "outputs": [], "source": [ "def optimizer_generator(x):\n", " return torch.optim.Adam(x.parameters(), lr=0.001)\n", "\n", "model_bias_variance_wrapped = PyTorchEstimatorWrapper(model_bias_variance, optimizer_generator, loss_fn)" ] }, { "cell_type": "code", "execution_count": 21, "id": "209816f1-8a37-4261-84d5-5c4e4678e696", "metadata": {}, "outputs": [], "source": [ "pred_bias_variance = predict_trials(model_bias_variance_wrapped, X_train_torch, y_train_torch_bias_variance, \n", " X_test_torch, trials_graph, random_state,\n", " fit_kwargs={'epochs': 20})" ] }, { "cell_type": "code", "execution_count": 22, "id": "0606bb52-1847-4a67-9056-94ca07edb175", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "graph_trials(pred_bias_variance, y_test, bins)\n", "plt.savefig('high_bias_high_variance.png')" ] }, { "cell_type": "markdown", "id": "0ca02163-0d18-48ca-892d-8bea81e10669", "metadata": {}, "source": [ "Notice in the figure above that the model error has different distributions among the trials and is not centered around 0.\n", "\n", "Next we calculate the values over 100 trials." ] }, { "cell_type": "code", "execution_count": 23, "id": "8c63ab2a-8f56-4f5d-afb0-f9bfa31dec4d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "average loss: 172.72894068\n", "average bias: 79.38170440\n", "average variance: 93.34723628\n", "net variance: 93.34723628\n" ] } ], "source": [ "avg_loss, avg_bias, avg_var, net_var = bias_variance_compute(model_bias_variance_wrapped, X_train_torch, y_train_torch_bias_variance, \n", " X_test_torch, y_test, iterations=trials_full, \n", " random_state=random_state, decomp_fn=bias_variance_mse, \n", " fit_kwargs={'epochs': 20})\n", "\n", "print(f'average loss: {avg_loss:10.8f}')\n", "print(f'average bias: {avg_bias:10.8f}')\n", "print(f'average variance: {avg_var:10.8f}')\n", "print(f'net variance: {net_var:10.8f}')" ] }, { "cell_type": "markdown", "id": "d044dbcd-8325-4c24-a4a3-83b73c61ef01", "metadata": {}, "source": [ "## Low Bias Low Variance Example\n", "\n", "Now we will train a Scikit Learn Linear Regression with no artificial bias." ] }, { "cell_type": "code", "execution_count": 24, "id": "78403d2e-2655-40c0-9357-837a2ed1e252", "metadata": {}, "outputs": [], "source": [ "# Low bias low variance\n", "model = LinearRegression()\n", "\n", "model_wrapped = SciKitLearnEstimatorWrapper(model)" ] }, { "cell_type": "code", "execution_count": 25, "id": "2cae33f2-6961-415a-aa0c-5406f7be8e92", "metadata": {}, "outputs": [], "source": [ "pred = predict_trials(model_wrapped, X_train, y_train, X_test, trials_graph, random_state)" ] }, { "cell_type": "code", "execution_count": 26, "id": "21673f12-c5e5-4d59-8497-f76c6ab74d84", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "graph_trials(pred, y_test, bins)\n", "plt.savefig('low_bias_low_variance.png')" ] }, { "cell_type": "markdown", "id": "e71ab6f6-fb30-4f77-b9b7-ecb3dcf294b8", "metadata": {}, "source": [ "Notice in the figure above that the model error is very consistent among the trials and centers mainly around 0.\n", "\n", "Next we calculate the values over 100 trials." ] }, { "cell_type": "code", "execution_count": 27, "id": "590ad983-c81a-426f-b709-5d6c7642d0f6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "average loss: 0.60725048\n", "average bias: 0.52048793\n", "average variance: 0.08676256\n", "net variance: 0.08676256\n" ] } ], "source": [ "avg_loss, avg_bias, avg_var, net_var = bias_variance_compute(model_wrapped, X_train, y_train, X_test, y_test, iterations=trials_full, \n", " random_state=random_state, decomp_fn=bias_variance_mse)\n", "\n", "print(f'average loss: {avg_loss:10.8f}')\n", "print(f'average bias: {avg_bias:10.8f}')\n", "print(f'average variance: {avg_var:10.8f}')\n", "print(f'net variance: {net_var:10.8f}')" ] }, { "cell_type": "code", "execution_count": null, "id": "b5091f45-727c-4ae2-bcc7-a2af108dadc1", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }