divergence
- Approximator(input_size, depth=3, width=None, output_size=1, linear=<function Dense>, residual=True, activation=<function <lambda>>, rng=Array([0, 0], dtype=uint32))[source]
Basic Neural network based function \(\mathbb{R}^N\rightarrow\mathbb{R}^M\) function approximator.
- Parameters:
input_size (int) – Size of input dimension.
depth (int, optional) – Depth of network. Defaults to
3
.width (int, optional) – Width of network. Defaults to
input_size + 1
.output_size (int, optional) – Number of outputs. Defaults to
1
.linear (
torch.nn.Module
, optional) – Linear layer drop in replacement. Defaults tojax.example_libraries.stax.Dense
.residual (bool, optional) – Turn on ResNet blocks. Defaults to
True
.activation (optional) – A map from \((-\infty, \infty)\) to an appropriate domain (such as the domain of a convex conjugate). Defaults to the identity.
rng (optional) – Jax
PRNGKey
key. Defaults to jax.random.PRNGKey(0)`.
- Returns:
initial parameter values, neural network function
- NormalizedLinear(out_dim, W_init=<function variance_scaling.<locals>.init>, b_init=<function normal.<locals>.init>)[source]
Linear layer with positive weights with columns that sum to one.
- arrayify(item)[source]
Convert the value to at least dim 3. If is dataframe it converts it to a list of values.
- Parameters:
item – ndarray or a list of ndarray, or a dataframe, a series or a list of dataframes or series
- Returns:
a list of dataframes/series or array of dim 3
- balanced_binary_cross_entropy(y_true, y_pred)[source]
Compute cross entropy loss while compensating for class imbalance
- Parameters:
y_true (array) – Ground truth, binary or soft labels.
y_pred (array) – Array of model scores.
- cal_div_knn(divergence, sample_distribution_p, sample_distribution_q, bias=<function <lambda>>, num_samples=2048, categorical_columns=(), nprng=RandomState(MT19937) at 0x7FBAA5D9CB40, k=128)[source]
\(f\)-divergence from knn density estimators
- Parameters:
divergence – \(f\) that defines the \(f\)-divergence.
sample_distribution_p (list) – A numpy array, pandas dataframe, or pandas series or a list of numpy arrays, dataframes or series. If it is a list then will sample from each in the list For example,
[[batch1, batch2, batch3], [batch4, batch5], [batch6, batch7]]
Assumingbatch1
came from distribution \(p_1\),batch2
from \(p_2\), etc, this function will simulate a system in which a latent N=3 sided die role that determines whether to draw a sample from \(\frac{p_1 + p_2 + p_3}{3}\), \(\frac{p_4 + p_5}{2}\), or \(\frac{p_6 + p_7}{2}\). The outer most list is typically a singleton.sample_distribution_q (list) –
bias (function) – function of the number of samples and number of nearest neighbors that compensates for expected bias of plugin estimator.
num_samples (int, optional) – Number of subsamples to take from each distribution on which to construct kdtrees and otherwise make computations. Defaults to 2046.
k (int, optional) – Number of nearest neighbors. As a rule of thumb, you should multiply this by two with every dimension past one. Defaults to 128.
- calc_div_variational(data_stream, loss, model_generator=<function Approximator>, summary='')[source]
Calculate an \(f\)-divergence or integral probability metric using a variational of hybrid variational estimator.
Variational estimates will generally (but, not always, thanks to the training proceedure!) be a lower bound on the true value
- Parameters:
data_stream (generator) – The data stream generator.
loss (function) – Loss function that takes as arguments the model outputs. Returns a scalar.
model_generator – A function that takes a Jax
PRNGKey
the number of dimensions of the support and returns a Jax model to be used for variational approximations. The function this model is trained to approximate is sometimes known as the witness function–especially when dealing with integral probability metrics. Specifically, the function returns a tuple that contains the initial parameters and a function that maps those parameters and the model inputs to the model outputs. Defaults tosupervisor.divergence.Approximator()
.summary (string) – Summary of divergence to appear in docstring
- Returns:
function for computing divergence
- calc_em(*sample_distributions, categorical_columns=(), model_generator_kwargs={}, loss_kwargs={}, nprng=None, batch_size=16, num_batches=128, num_epochs=4, effective_sample_size=None, train_test_split=0.75, step_size=0.0125)
Wasserstein-1 (Earth Mover’s) metric
\(\int dxdx^\prime d(x, x^\prime)\gamma(x, x^\prime)\)
, with
\(d(x, x^\prime)=\|x - x^\prime\|_1\)
subject to constraints
\(\int dx^\prime\gamma(x, x^\prime) = p(x)\)
\(\int dx\gamma(x, x^\prime) = q(x^\prime)\)
Via Kantorovich-Rubinstein duality, this is equivalent to
\(\sup\limits_{f \in \mathcal{F}} \vert \mathbb{E}_{x \sim p}\left[f(x)\right] - \mathbb{E}_{x^\prime \sim q}\left[f(x^\prime)\right] \vert\)
, with \(\mathcal{F} = \{f: \|f(x) - f(x^\prime)\|_1 \le \|x - x^\prime\|_1 \}\)
According to Joel Tropp’s thesis section 4.3.1, the operator norm of a linear transformation from an \(L^1\) metric space to an \(L^1\) metric space is bounded above by the \(L^1\) norms of its columns. This is realized by normalzing the weight columns with an \(L^1\) norm and excluding residual connections before applying them.
- Args:
- *sample_distributions (list): Sample distributions. A numpy array,
pandas dataframe, or pandas series or a list of numpy arrays, dataframes or series. If it is a list then will sample from each in the list For example,
[[batch1, batch2, batch3], [batch4, batch5], [batch6, batch7]]
Assumingbatch1
came from distribution \(p_1\),batch2
from \(p_2\), etc, this function will simulate a system in which a latent N=3 sided die role that determines whether to draw a sample from \(\frac{p_1 + p_2 + p_3}{3}\), \(\frac{p_4 + p_5}{2}\), or \(\frac{p_6 + p_7}{2}\). The outer most list is typically a singleton.- model_generator_kwargs (optional): Dictionary of optional kwargs to pass to
model_generator.
width
anddepth
are useful. Seesupervisor.divergence.Approximator()
for more details.- loss_kwargs (optional): Dictionary of optional kwargs to pass to
loss function.
weights
is commonly used for reweighting expectations. See hybrid estimation for details.- categorical_columns (optional): List of indices of columns which should
be treated as categorical.
nprng (optional): Numpy
RandomState
batch_size (int): mini batch size. Defaults to 16. num_batches (int): number of batches per epoch. Defaults to 128. num_epochs (int): number of epochs to train for. Defaults to 4. effective_sample_size (optional): Size of subsample over which Epochlosses are computed. This determines how large a sample a divergence is computed over.
- train_test_split (optional): If not None, specifies the
proportion of samples devoted to training as opposed to validation. If None, no split is used. Defaults to 0.75.
step_size (float): step size for Adam optimizer
- Returns:
Estimate of divergence.
- calc_hl(*sample_distributions, categorical_columns=(), model_generator_kwargs={}, loss_kwargs={}, nprng=None, batch_size=16, num_batches=128, num_epochs=4, effective_sample_size=None, train_test_split=0.75, step_size=0.0125)
Hellinger distance calculator
\(f(x) = 1 - \sqrt{x}\)
\(f^{*}(y) = \sup\limits_x\left[xy - f(x)\right]\)
\(\frac{d}{dx}\left[xy - f(x)\right] = 0\)
\(x = \frac{1}{4y ^ 2}\)
\(f^{*}(y) = \frac{1}{2\vert y \vert} + \frac{1}{4y} - 1\)
Since the Fenchel–Moreau theorem requires the convex conjugate to be lower semicontinuous for bicongugacy to hold, we take \(y < 0\).
This in turn simplifies the expression of \(f^{*}\) to
\(f^{*}(y) = -\frac{1}{4y} - 1\)
- Args:
- *sample_distributions (list): Sample distributions. A numpy array,
pandas dataframe, or pandas series or a list of numpy arrays, dataframes or series. If it is a list then will sample from each in the list For example,
[[batch1, batch2, batch3], [batch4, batch5], [batch6, batch7]]
Assumingbatch1
came from distribution \(p_1\),batch2
from \(p_2\), etc, this function will simulate a system in which a latent N=3 sided die role that determines whether to draw a sample from \(\frac{p_1 + p_2 + p_3}{3}\), \(\frac{p_4 + p_5}{2}\), or \(\frac{p_6 + p_7}{2}\). The outer most list is typically a singleton.- model_generator_kwargs (optional): Dictionary of optional kwargs to pass to
model_generator.
width
anddepth
are useful. Seesupervisor.divergence.Approximator()
for more details.- loss_kwargs (optional): Dictionary of optional kwargs to pass to
loss function.
weights
is commonly used for reweighting expectations. See hybrid estimation for details.- categorical_columns (optional): List of indices of columns which should
be treated as categorical.
nprng (optional): Numpy
RandomState
batch_size (int): mini batch size. Defaults to 16. num_batches (int): number of batches per epoch. Defaults to 128. num_epochs (int): number of epochs to train for. Defaults to 4. effective_sample_size (optional): Size of subsample over which Epochlosses are computed. This determines how large a sample a divergence is computed over.
- train_test_split (optional): If not None, specifies the
proportion of samples devoted to training as opposed to validation. If None, no split is used. Defaults to 0.75.
step_size (float): step size for Adam optimizer
- Returns:
Estimate of divergence.
- calc_hl_density(density_p, density_q)[source]
Hellinger distance calculated from histograms.
Hellinger distance is defined as
\(\sqrt{\frac{1}{2}\sum\limits_{x\in\mathcal{X}}\left(\sqrt{p(x)} - \sqrt{q(x)}\right)^2}\).
- Parameters:
density_p (list) – probability mass function of p
density_q (list) – probability mass function of q
- calc_hl_mle(sample_distribution_p, sample_distribution_q)[source]
Hellinger distance calculated via histogram based density estimators.
Hellinger distance is defined as
\(\sqrt{\frac{1}{2}\sum\limits_{x\in\mathcal{X}}\left(\sqrt{p(x)} - \sqrt{q(x)}\right)^2}\).
- Parameters:
sample_distribution_p (list) – A numpy array, pandas dataframe, or pandas series or a list of numpy arrays, dataframes or series. If it is a list then will sample from each in the list For example,
[[batch1, batch2, batch3], [batch4, batch5], [batch6, batch7]]
Assumingbatch1
came from distribution \(p_1\),batch2
from \(p_2\), etc, this function will simulate a system in which a latent N=3 sided die role that determines whether to draw a sample from \(\frac{p_1 + p_2 + p_3}{3}\), \(\frac{p_4 + p_5}{2}\), or \(\frac{p_6 + p_7}{2}\). The outer most list is typically a singleton.sample_distribution_q (list) –
- calc_js(*sample_distributions, categorical_columns=(), model_generator_kwargs={}, loss_kwargs={}, nprng=None, batch_size=16, num_batches=128, num_epochs=4, effective_sample_size=None, train_test_split=0.75, step_size=0.0125)
Jensen-Shannon divergence calculator
\(f(x) = -\log_2(x)\)
\(f^{*}(y) = \sup\limits_x \left[xy - f(x)\right]\)
\(\frac{d}{dx}\left[xy - f(x)\right] = 0\)
\(x = \frac{-1}{y\log(2)}\)
\(f^{*}(y) = -\frac{\log\left(-y\log(2)\right) + 1}{\log(2)}\)
Note that the domain of this function (when assumed to be real valued) is naturally \(y < 0\).
- Args:
- *sample_distributions (list): Sample distributions. A numpy array,
pandas dataframe, or pandas series or a list of numpy arrays, dataframes or series. If it is a list then will sample from each in the list For example,
[[batch1, batch2, batch3], [batch4, batch5], [batch6, batch7]]
Assumingbatch1
came from distribution \(p_1\),batch2
from \(p_2\), etc, this function will simulate a system in which a latent N=3 sided die role that determines whether to draw a sample from \(\frac{p_1 + p_2 + p_3}{3}\), \(\frac{p_4 + p_5}{2}\), or \(\frac{p_6 + p_7}{2}\). The outer most list is typically a singleton.- model_generator_kwargs (optional): Dictionary of optional kwargs to pass to
model_generator.
width
anddepth
are useful. Seesupervisor.divergence.Approximator()
for more details.- loss_kwargs (optional): Dictionary of optional kwargs to pass to
loss function.
weights
is commonly used for reweighting expectations. See hybrid estimation for details.- categorical_columns (optional): List of indices of columns which should
be treated as categorical.
nprng (optional): Numpy
RandomState
batch_size (int): mini batch size. Defaults to 16. num_batches (int): number of batches per epoch. Defaults to 128. num_epochs (int): number of epochs to train for. Defaults to 4. effective_sample_size (optional): Size of subsample over which Epochlosses are computed. This determines how large a sample a divergence is computed over.
- train_test_split (optional): If not None, specifies the
proportion of samples devoted to training as opposed to validation. If None, no split is used. Defaults to 0.75.
step_size (float): step size for Adam optimizer
- Returns:
Estimate of divergence.
- calc_js_density(*densities)[source]
Jensen-Shannon divergence calculated from histograms.
For two distributions, \(p\) and \(q\) defined over the same probability space, mathcal{X}, the Jensen-Shannon divergence is defined as the average of the KL divergences between each probability mass function and the average of all probability mass functions being compared. This is well defined for more than two probability masses, and will be zero when all probability masses have disjoint support and 1 when they are all identical and the KL divergences are taken using a logarithmic base equal to the number of probability masses being compared. Typically, there will be only two probability mass functions, and the logarithmic base is therefore taken to be 2.
- Parameters:
*densities (list) – probability mass functions
- calc_js_mle(*sample_distributions)[source]
Jensen-Shannon divergences calculated via histogram based density estimators.
For two distributions, \(p\) and \(q\) defined over the same probability space, mathcal{X}, the Jensen-Shannon divergence is defined as the average of the KL divergences between each probability mass function and the average of all probability mass functions being compared. This is well defined for more than two probability masses, and will be zero when all probability masses have disjoint support and 1 when they are all identical and the KL divergences are taken using a logarithmic base equal to the number of probability masses being compared. Typically, there will be only two probability mass functions, and the logarithmic base is therefore taken to be 2.
- Parameters:
*sample_distributions (list) – A numpy array, pandas dataframe, or pandas series or a list of numpy arrays, dataframes or series. If it is a list then will sample from each in the list For example,
[[batch1, batch2, batch3], [batch4, batch5], [batch6, batch7]]
Assumingbatch1
came from distribution \(p_1\),batch2
from \(p_2\), etc, this function will simulate a system in which a latent N=3 sided die role that determines whether to draw a sample from \(\frac{p_1 + p_2 + p_3}{3}\), \(\frac{p_4 + p_5}{2}\), or \(\frac{p_6 + p_7}{2}\). The outer most list is typically a singleton.
- calc_kl_density(density_p, density_q)[source]
Kullback–Leibler (KL) divergence calculated from histograms.
- For two distributions, \(p\) and \(q\) defined over the same
probability space, mathcal{X}, the total variation is defined as
\(\sum\limits_{x\in\mathcal{X}}p(x)\log\left(\frac{p(x)}{q(x)}\right)\).
- Parameters:
density_p (list) – probability mass function of \(p\)
density_q (list) – probability mass function of \(q\)
- calc_kl_mle(sample_distribution_p, sample_distribution_q)[source]
Kullback–Leibler (KL) divergence calculated via histogram based density estimators.
For two distributions, \(p\) and \(q\) defined over the same probability space, mathcal{X}, the KL divergence is defined as
\(\sum\limits_{x\in\mathcal{X}}p(x)\log\left(\frac{p(x)}{q(x)}\right)\).
- Parameters:
sample_distribution_p (list) – A numpy array, pandas dataframe, or pandas series or a list of numpy arrays, dataframes or series. If it is a list then will sample from each in the list For example,
[[batch1, batch2, batch3], [batch4, batch5], [batch6, batch7]]
Assumingbatch1
came from distribution \(p_1\),batch2
from \(p_2\), etc, this function will simulate a system in which a latent N=3 sided die role that determines whether to draw a sample from \(\frac{p_1 + p_2 + p_3}{3}\), \(\frac{p_4 + p_5}{2}\), or \(\frac{p_6 + p_7}{2}\). The outer most list is typically a singleton.sample_distribution_q (list) –
- calc_tv(*sample_distributions, categorical_columns=(), model_generator_kwargs={}, loss_kwargs={}, nprng=None, batch_size=16, num_batches=128, num_epochs=4, effective_sample_size=None, train_test_split=0.75, step_size=0.0125)
Total variation - \(f\)-divergence form
\(\frac{1}{2}\int dx \vert p\left(x\right) - q\left(x\right) \vert = \sup\limits_{f : \|f\|_\infty \le \frac{1}{2}} \mathbb{E}_{x \sim p}\left[f(x)\right] - \mathbb{E}_{x^\prime \sim q}\left[f(x^\prime)\right]\)
https://arxiv.org/abs/1606.00709
- Args:
- *sample_distributions (list): Sample distributions. A numpy array,
pandas dataframe, or pandas series or a list of numpy arrays, dataframes or series. If it is a list then will sample from each in the list For example,
[[batch1, batch2, batch3], [batch4, batch5], [batch6, batch7]]
Assumingbatch1
came from distribution \(p_1\),batch2
from \(p_2\), etc, this function will simulate a system in which a latent N=3 sided die role that determines whether to draw a sample from \(\frac{p_1 + p_2 + p_3}{3}\), \(\frac{p_4 + p_5}{2}\), or \(\frac{p_6 + p_7}{2}\). The outer most list is typically a singleton.- model_generator_kwargs (optional): Dictionary of optional kwargs to pass to
model_generator.
width
anddepth
are useful. Seesupervisor.divergence.Approximator()
for more details.- loss_kwargs (optional): Dictionary of optional kwargs to pass to
loss function.
weights
is commonly used for reweighting expectations. See hybrid estimation for details.- categorical_columns (optional): List of indices of columns which should
be treated as categorical.
nprng (optional): Numpy
RandomState
batch_size (int): mini batch size. Defaults to 16. num_batches (int): number of batches per epoch. Defaults to 128. num_epochs (int): number of epochs to train for. Defaults to 4. effective_sample_size (optional): Size of subsample over which Epochlosses are computed. This determines how large a sample a divergence is computed over.
- train_test_split (optional): If not None, specifies the
proportion of samples devoted to training as opposed to validation. If None, no split is used. Defaults to 0.75.
step_size (float): step size for Adam optimizer
- Returns:
Estimate of divergence.
- calc_tv_density(density_p, density_q)[source]
Total variation calculated from histograms.
For two distributions, \(p\) and \(q\) defined over the same probability space, \(\mathcal{X}\), the total variation is defined as
\(\frac{1}{2}\sum\limits_{x\in\mathcal{X}}\vert p(x) - q(x)\vert\).
- Parameters:
density_p (list) – probability mass function of p
density_q (list) – probability mass function of q
- calc_tv_knn(sample_distribution_p, sample_distribution_q, **kwargs)[source]
Total variation from knn density estimators
- Parameters:
divergence – \(f\) that defines the \(f\)-divergence.
sample_distribution_p (list) – A numpy array, pandas dataframe, or pandas series or a list of numpy arrays, dataframes or series. If it is a list then will sample from each in the list For example,
[[batch1, batch2, batch3], [batch4, batch5], [batch6, batch7]]
Assumingbatch1
came from distribution \(p_1\),batch2
from \(p_2\), etc, this function will simulate a system in which a latent N=3 sided die role that determines whether to draw a sample from \(\frac{p_1 + p_2 + p_3}{3}\), \(\frac{p_4 + p_5}{2}\), or \(\frac{p_6 + p_7}{2}\). The outer most list is typically a singleton.sample_distribution_q (list) –
bias (function) – function of the number of samples and number of nearest neighbors that compensates for expected bias of plugin estimator.
num_samples (int, optional) – Number of subsamples to take from each distribution on which to construct kdtrees and otherwise make computations. Defaults to 2046.
k (int, optional) – Number of nearest neighbors. As a rule of thumb, you should multiply this by two with every dimension past one. Defaults to 128.
- calc_tv_lower_bound(log_loss)[source]
Lower bound of total variation. A model (not provided) must be trained to classify data as belonging to one of two datasets using log loss, ideally compensating for class imbalance during training. This function will compute an lower bound on the total variation of the two datasets the model was trained to distinguish using the loss from the validation set.
- Parameters:
log_loss (float) – Binary cross entropy loss with class imbalance compensated.
- calc_tv_mle(sample_distribution_p, sample_distribution_q)[source]
Total variation calculated via histogram based density estimators. All columns are assumed to be categorical.
For two distributions, \(p\) and \(q\) defined over the same probability space, mathcal{X}, the total variation is defined as
\(\frac{1}{2}\sum\limits_{x\in\mathcal{X}}\vert p(x) - q(x)\vert\).
- Parameters:
sample_distribution_p (list) – A numpy array, pandas dataframe, or pandas series or a list of numpy arrays, dataframes or series. If it is a list then will sample from each in the list For example,
[[batch1, batch2, batch3], [batch4, batch5], [batch6, batch7]]
Assumingbatch1
came from distribution \(p_1\),batch2
from \(p_2\), etc, this function will simulate a system in which a latent N=3 sided die role that determines whether to draw a sample from \(\frac{p_1 + p_2 + p_3}{3}\), \(\frac{p_4 + p_5}{2}\), or \(\frac{p_6 + p_7}{2}\). The outer most list is typically a singleton.sample_distribution_q (list) –
- fdiv_data_stream(nprng, batch_size, sample_distributions, categorical_columns=())[source]
Data stream generator for f-divergence.
- Parameters:
nprng – Numpy
RandomState
used to generate random samplesbatch_size – size of batch
sample_distributions – list of lists of samples to compare for each partition of the data. For example,
[[batch1, batch2, batch3], [batch4, batch5], [batch6, batch7]]
categorical_columns (tuple) – list or tuple of column indices that are considered categorical.
- Returns:
The output of this function will be
N
samples of sizebatch_size
, whereN = len(sample_distributions)
Following the example above, assumingbatch1
came from distribution p_1,batch2
from \(p_2\), etc, This function will output a tuple ofN = 3
samples of sizebatch_size
, wherebatch1
is sampled from \(\frac{p_1 + p_2 + p_3}{3}\),batch2
is sampled from \(\frac{p_4 + p_5}{2}\), andbatch3
is sampled from \(\frac{p_6 + p_7}{2}\).
- fdiv_loss(convex_conjugate)[source]
General template for \(f\)-divergence losses given convex conjugate.
- Parameters:
convex_conjugate – The convex conjugate of the function, \(f\).
- js_data_stream(nprng, batch_size, sample_distributions, categorical_columns=())[source]
Data stream generator for Jensen-Shannon divergence of N distributions. Jensen-Shannon divergence measures the information of knowing which of those N distributions a sample will be drawn from before it is drawn. So if we rolled a fair N sided die to determine which distribution we will draw a sample from, JS divergence reports how many bits of information will be revealed from the die. This scenario is ultimately simulated in this function. However, in real life, we may only have examples of samples from each distribution we wish to compare. In the most general case, each distribution we wish to compare is represented by M samples of samples (with potentially different sizes) from M similar distributions whose average is most interesting. Just as we might simulate sampling from a single distribution by randomly sampling a batch of examples with replacement, we can effectively sample from an average of distributions by randomly sampling each batch (which may be representative of a single distribution), then randomly sampling elements of the chosen batch. This can ultimately be thought of a more data efficient means to the same end as downsampling large batch sizes.
- Parameters:
nprng – Numpy
RandomState
used to generate random samples batch_size: size of batch*sample_distributions – list of lists of samples to compare. For example,
[[batch1, batch2, batch3], [batch4, batch5], [batch6, batch7]]
Assumingbatch1
came from distribution \(p_1\),batch2
from \(p_2\), etc, this function will simulate a system in which a latent N=3 sided die role that determines whether to draw a sample from \(\frac{p_1 + p_2 + p_3}{3}\), \(\frac{p_4 + p_5}{2}\), or \(\frac{p_6 + p_7}{2}\).categorical_columns (tuple) – list or tuple of column indices that are considered categorical.
- Returns:
The output of this function will be two samples of size batch_size with samples, \(x\), drawn from batch_size roles, \(z\), of our \(N\) sided die. Following the example above for which \(N=3\), the first of these two output samples will be of the form \((x, z)\), where x is the sample drawn and z is the die roll. The second of these two samples will be of the form \((x, z^{\prime})\) where x is the same sample as before, but \(z^\prime\) is a new set of otherwise unrelated roles of the same \(N=3\) sided die.