Assessing Credibility From Sample Size

The goal of this notebook is to be able to answer two types of questions:

  1. For my KPI (precision, recall, true positive rate, false negative rate, ROC AUC, etc), what is the probability that after many more measurements, it could in fact be very different from what I am currently reporting?

  2. How can I assign and interpret error bars that quantify the expected variation of my KPIs (precision, recall, true positive rate, false negative rate, ROC AUC, etc) given the size of the dataset I am using?

Throughout this tutorial, you should keep in mind the differences between “experimental probability” and “theoretical probability” and how experimental probability will approach some theoretical value (though not necessarily your theoretical value!) as you gather more and more measurements.

Quantifying Credibility of Recall

Recall is defined as the proportion of positive instances that your binary classifier labeled as being positive. This could be an important metric for something like medical diagnosis for a rare condition since there would likely be great concern that the model is catching the vast majority of the positive cases. Suppose a data scientist that you are working with says the model has a recall of 97%, meaning it catches 97% of cases that would be labeled as potential positive diagnoses at this phase of a pipeline. Since recall is concerned with positive instances, you should be wary that they had enough positive examples to make that claim with any degree of certainly. You might ask your data scientist “So what was your sample size?”. And your data scientist might confidently tout that they used over a 100,000 instances! Wow that’s a lot! But since you are a smart engineer, you follow up, “So how many of those were positive?” because again, recall is only concerned with positive instances. They will probably report a much smaller number. “Well, since positive cases are so rare, we only had 100 examples in our dataset”. This would be typical.

We have two possibilities for every positive instance: the model gives it a number that lies above the data scientist’s threshold, and therefore is reported as positive, or not. 97% recall means that 97% of the 100 positive examples were reported as positive. You can ask your data scientist to confirm that the model reported 97 true positives, and 3 false negatives. The question is now if we’ve seen 97 true positives, and 3 false negatives what is the probability that after an infinite number of samples the proportions would look very different?. You might intuitively understand that we cannot be sure a coin is fair (would have an equal number of heads and tails after an infinite number of flips) after 3 flips. This tutorial is about going a step further and assigning a probability that after an infinite number of flips the coin would have any given amount of bias (e.g. one fourth of the flips would be heads after an infinite number of flips).

So going back to recall, let’s say you asked what proportion of missed positive instances would be unacceptably high, and you were informed that more than 5% would not be acceptable. This means that a recall lower than 95% would be unacceptable. The question is, given 97 correctly identified positive instances, and 3 false negative, what is the probability that after an infinite number of samples, the recall would be 95% instead of the current estimate of 97%?

The Beta Distribution

A very popular and successful approach to answering this question is by querying the beta distribution. Much like a Gaussian distribution (bell curve) reporting frequency from mean and standard deviation, the beta distribution describes the probability that a coin will have a given degree of bias, and is characterized by the number of heads and tails (rather than mean and standard deviation) we have seen so far. In this case, recall is like our “coin” and each “flip” is either a true positive or a false negative. The beta distribution is implemented in scipy.stats, and in the code below, we will ask it what the probability is that the recall is less than 95% given 97 true positives and 3 false negatives.

Recall

What is the probability that recall is unaceptably low?

[1]:
from mvtk import credibility

true_positives = 97
false_negatives = 3
cutoff = 95/100
credibility.prob_below(true_positives, false_negatives, cutoff) * 100
[1]:
25.085987521946958

Interpreting the Result

Inside of the credibility.prob_below function, there is a call to beta that takes the following as arguments: 1. The number of instances of heads (true positives) 2. The number of instances of tails (false negatives) We add 1 to each count to encode a belief that both outcomes are at least possible, and is a common rule of thumb in practice. This creates an instance of a beta distribution specified by the number of true positives (heads if this were a coin) and false negatives (tails if this were a coin). .cdf calculates the probability that after an infinite number of samples, the proportion of true positives (heads) would be less than our desired cutoff of 95/100 = 0.95.

We multiply the resulting probability by 100% to report the answer as a percentage. Based on the above readout, there is ~25% chance that after collecting more and more data, the proportion of true_positives would ultimataly fall below the cutoff of 95%.

Contrary to common belief, there is nothing special about 95% in any of this analysis. The cutoff and how the resulting probability should be handled is entirely up to the experts in the field that you are predicting classifications for. All you can do is ask “What would be unacceptable?” and report a probability that after gathering more data a given KPI would in fact be unacceptable. It is up to those experts to request more data be gathered or the model be retrained.

Credible Intervals of Recall

Your manager is suprised at how likely it is that recall is unacceptably low. It raises a new question: If there is a 25% chance that recall is below 95%, could we prescribe an interpretable notion of margin of error to the original recall estimate? That is, could we say something like, “While the best estimate of recall is 97%, there is a high probabliity that the true value is between this lower bound and this upper bound”. In general, there are lots of such intervals. We could for example say there is a 50% chance the true recall will be between “0 to the median of beta(1 + true_positives, 1 + false_negatives)”. We could just as accurately say there is a 50% chance the true recall will be between the median and 100%. However, it should be somewhat intuitive that the shortest interval that correctly answers the question “For what interval is there a 50% chance that true value lies within it?” is probably most convenient to work with.

The following snippet can be used to find the shortest interval for a given “credibility level” (such as 50% for the examples mentioned so far) for a beta distribution.

[2]:
# this is where we say "we want the shortest interval with a 50% chance of containing the true recall"
lower_bound, upper_bound = credibility.credible_interval(true_positives, false_negatives, 0.5)
print(lower_bound * 100, upper_bound * 100)
95.84026690789888 98.20175262774434

In this instance, there is a 50% probability the true recall lies between 95.8% and 98.2%.

ROC AUC

What is the probability that ROC AUC is unaceptably low?

As you may recall, ROC AUC is a very important KPI for binary classifiers. It is calculated by computing the area under a curve (AUC) that reports true positive rate (proportion of positive instances reported by the model as being positive) against the false positive rate (proportion of negative instances reported as being positive). This curve is plotted over different thresholds (as you may remember, we can adjust our thresholds to balance false positives and false negatives). However, for mathematical reasons we will not delve into further, ROC AUC turns out to be equal to the probability that a randomly selected positive instance would be scored higher than a randomly selected negative instance. Therefore, our “how fair is this coin?” analysis still holds, but with a twist.

In this example, your data scientist reports a 92% ROC AUC with a validation set that consists of 10 positive instances and 90 negative instances. They say they would have disqualified the model with an AUC of less than 90%. You can again apply beta distributions to determine the probability that given an infinite validation set, the AUC would be less than 90%.

If we know AUC is identical to the probability that a randomly chosen positive instance would be scored higher than a randomly chosen negative instance, we could equivalently compute it by comparing all positive instances to all negative instances and counting up the proportion of positive instances that are scored higher by our model than negative instances.

[3]:
positives = 10
negatives = 90
roc_auc = 0.92
auc_positives, auc_negatives = credibility.roc_auc_preprocess(positives, negatives, roc_auc)
cutoff = 90/100
credibility.prob_below(auc_positives, auc_negatives, cutoff) * 100
[3]:
2.2701308278261356

Remember, ROC AUC is effectively a ranking metric. It may be calculated by taking an area under a curve, but it reports the percentage of combinations of positive and negative instances that the model has correctly ranked (with the positive instance ranked higher than the negative one). With 10 positive instances and 90 negative ones, that means we are computing this proportion from 10 x 90 = 900 unique combinations. If the roc_auc = 0.92, then 92% of those combinations will be correctly ranked when scored by the model (that is, with the positive instance scored higher than the negative one). We can then compute the probability that after an infinite number of samples (and therefore an infinite number of unique combinations of positive and negative instances) the proportion of correctly ranked pairs would be less than 90%.

In this case, there is a ~2.3% chance of the AUC ending up below 90% after an infinite number of samples.

Credible Intervals of ROC AUC

We will now demonstrate how to compute credible intervals of ROC AUC in a similar fashion to the credible interval constructed for recall.

[4]:
# this is where we say "we want the shortest interval with a 90% chance of containing the true ROC AUC"
lower_bound, upper_bound = credibility.credible_interval(auc_positives, auc_negatives, 0.9)
print(lower_bound * 100, upper_bound * 100)
90.5853410758263 93.62928468861222

In this instance, while our best estimate of the ROC AUC was 92%, there is a 90% probability that the true ROC AUC lies between 90.6% and 93.6%.

[ ]: