{ "cells": [ { "cell_type": "markdown", "id": "f71242e0", "metadata": {}, "source": [ "# Countering Sample Bias" ] }, { "cell_type": "markdown", "id": "87174b32", "metadata": {}, "source": [ "## Introduction\n", "\n", "This notebook will illustrate how to estimate performance using a baised sample. This is regarding the statistical of notion biased in the sense that a sample is not representative of the data at large. For example, if you are aggregating several surveys from different regions of a country and want to draw conclusions about the population of the entire country, you may wish to assign different weights to different surveys depending on the proportion of the population within the region the survey addressed and the number of people that answered the survey." ] }, { "cell_type": "markdown", "id": "c01837a8", "metadata": {}, "source": [ "## Example Data\n", "\n", "Let's say we have two regions and two surveys. All surveys asked the same questions:\n", "\n", "1. What is your age?\n", "2. How many years of education do you have?\n", "\n", "We want to build a model that predicts number of years of education given age for people throughout the country. However, the country's population is split as follows:\n", "\n", "* Region 1: 80%\n", "* Region 2: 20%\n", "\n", "Each survey had different proportions of respondants from different regions of the country, and different total numbers of respondants. We can't generally just combine all the survey results because they wouldn't generally be representative of the population as a whole. However, we can assign _weights_ to respondents in such a way that we can simulate the effect of a representative population given the proportions of respondants of each survey from each region." ] }, { "cell_type": "code", "execution_count": 1, "id": "dc6517e8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | age | \n", "education | \n", "region | \n", "
---|---|---|---|
0 | \n", "22.028976 | \n", "0.0 | \n", "1 | \n", "
1 | \n", "44.912114 | \n", "8.0 | \n", "1 | \n", "
2 | \n", "26.045553 | \n", "3.0 | \n", "1 | \n", "
3 | \n", "31.072699 | \n", "2.0 | \n", "1 | \n", "
4 | \n", "43.903140 | \n", "12.0 | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
799995 | \n", "80.995531 | \n", "1.0 | \n", "2 | \n", "
799996 | \n", "54.082933 | \n", "3.0 | \n", "2 | \n", "
799997 | \n", "35.139835 | \n", "0.0 | \n", "2 | \n", "
799998 | \n", "80.768232 | \n", "3.0 | \n", "2 | \n", "
799999 | \n", "60.227129 | \n", "0.0 | \n", "2 | \n", "
4000000 rows × 3 columns
\n", "\n", " | age | \n", "education | \n", "region | \n", "survey | \n", "
---|---|---|---|---|
570896 | \n", "90.032094 | \n", "10.0 | \n", "2 | \n", "1 | \n", "
1908479 | \n", "40.606486 | \n", "14.0 | \n", "1 | \n", "2 | \n", "
587338 | \n", "40.794540 | \n", "2.0 | \n", "2 | \n", "2 | \n", "
1564287 | \n", "26.087168 | \n", "0.0 | \n", "1 | \n", "1 | \n", "
728612 | \n", "95.867544 | \n", "2.0 | \n", "2 | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
750767 | \n", "67.466088 | \n", "6.0 | \n", "2 | \n", "1 | \n", "
671308 | \n", "29.609484 | \n", "1.0 | \n", "2 | \n", "1 | \n", "
229642 | \n", "24.379894 | \n", "0.0 | \n", "2 | \n", "1 | \n", "
2139865 | \n", "18.579931 | \n", "1.0 | \n", "1 | \n", "2 | \n", "
632943 | \n", "47.067253 | \n", "2.0 | \n", "2 | \n", "1 | \n", "
29845 rows × 4 columns
\n", "\n", " | General Population | \n", "Surveyed Population | \n", "
---|---|---|
1 | \n", "0.8 | \n", "0.300385 | \n", "
2 | \n", "0.2 | \n", "0.699615 | \n", "
\n", " | age | \n", "education | \n", "region | \n", "survey | \n", "
---|---|---|---|---|
570896 | \n", "90.032094 | \n", "10.0 | \n", "2 | \n", "1 | \n", "
1908479 | \n", "40.606486 | \n", "14.0 | \n", "1 | \n", "2 | \n", "
587338 | \n", "40.794540 | \n", "2.0 | \n", "2 | \n", "2 | \n", "
1564287 | \n", "26.087168 | \n", "0.0 | \n", "1 | \n", "1 | \n", "
728612 | \n", "95.867544 | \n", "2.0 | \n", "2 | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
750767 | \n", "67.466088 | \n", "6.0 | \n", "2 | \n", "1 | \n", "
671308 | \n", "29.609484 | \n", "1.0 | \n", "2 | \n", "1 | \n", "
229642 | \n", "24.379894 | \n", "0.0 | \n", "2 | \n", "1 | \n", "
2139865 | \n", "18.579931 | \n", "1.0 | \n", "1 | \n", "2 | \n", "
632943 | \n", "47.067253 | \n", "2.0 | \n", "2 | \n", "1 | \n", "
29845 rows × 4 columns
\n", "