An Introduction to Sample Bias Correction in Data Science


In applied data science work, a classic problem is to draw conclusions about a population using a dataset which is not representative of that population. A common example is political polling by landline, where respondents frequently skew older and female. Depending upon how age and gender relate to the response in question, this sampling bias can skew both summary statistics like the overall support for a particular candidate (such as in the Literary Digest Debacle of 1936) and predictive models that try to separate supporters of one candidate from the other.

To draw an example from the commercial world, modelling potential customers from current customers can present a comparable issue. Many companies seek to expand their customer base to a demographic segment in which they are not currently popular. Basing a statistical analysis for the purpose of growing the business on the existing customers will produce results which describe only the existing customers, and may be misleading regarding the new desired audience. Additionally, models built on the existing data would naturally be optimised to perform best over the demographic distribution on which they were trained rather than the demographic distribution to which they will be ultimately applied.

The following figure, which will form a running example, illustrates the effect that sample bias can have on the decision boundary between two classes with predictors X1 and X2. The distribution in each figure is biased relative to the other.

Blog graph

In this blog post, we will discuss modelling people’s political preference for one of two candidates from a non-representative training set using the above example data. We’ll also touch on how to correct a summary statistic like the overall proportion of support for a candidate. 

Training a classifier on biased data

Suppose that for everyone in our database we have data on two variables X1 and X2. We want to create a model to estimate preference for Candidate A over Candidate B using X1 and X2. Additionally, the model should perform well on a particular group of interest, such as likely voters.

To create this model we need to obtain candidate preferences for a number of people so that we can estimate the relationship between X1, X2, and candidate preference. The dataset that we use to fit this model will be called the training set. We will refer to the population of interest as the test set.

If we could directly get candidate preferences for a representative sample of the test population then this would be easy, but unfortunately that is often not practical. Perhaps people in the test set don’t respond well to surveys, or maybe the survey was already completed before we knew that we were particularly interested in this group of people. Either way, we will need to find a way to construct a classifier on the non-representative training set that performs well on the test set.

The plot below again shows the training and test sets, and adds the classifier we want to learn. The labels for the test set would be unknown in practice. 

Blog graph 2


We can see that the differences between these datasets makes the classifier that’s best on the training set perform suboptimally on the test set. What can be done about this?

A common approach is to modify the training of the classifier, using individual sample weights that alter the penalty imposed by the loss function when samples are misclassified. The classifier will suffer an increased penalty when up-weighted samples are misclassified and a lower penalty when down-weighted samples are misclassified. In this manner, the classifier will ‘see’ the training set as if it were properly distributed over X1 and X2, and learn a classification scheme which is optimal for the test set distribution. 

Perhaps the most straight-forward approach to this problem is to separately estimate the joint covariate distributions for the test set and the training set, and then assign each training point in covariate space a weight given by the ratio of its test set density to its training set density.

So how do we do this here? In general, it usually works best to model the joint distributions using a nonparametric approach, and below we will discuss the Iterative Proportional Fitting Procedure (IPFP) technique for this purpose. However, our simple example lends itself to modelling the covariates of the training and test sets as bivariate Gaussians, so we just need to estimate the mean vectors and covariance matrices. We then refit the training classifier with these weights. The result is in the next figure.

Blog graph 3

The outcome is that the decision boundary is far closer to the ideal boundary for the test classifier than where we started.

Estimating Weights with IPFP

It won’t be true in general that the covariate distributions can be well-approximated through a simple parametric model so in the more general case the binning-based Iterative Proportional Fitting Procedure (IPFP) can be used to arrive at weights.

IPFP is an incremental learning process in which an initialisation weight close to one is first assigned to every sample. In the procedure, we iterate over lower-dimensional marginal (binned) distributions. Within each marginal, we calculate the sample weights which would be necessary for each bin to have exactly the same probability in the test set as in the training set. We then update each sample’s weight in proportion to the difference between the current weight and the optimal for this marginal, according to some learning rate. Iteration happens until an error tolerance condition on the difference between the population distribution and the weighted sample distribution is met.

This algorithm has in its favor extreme simplicity, numerical stability and a long history of publications demonstrating its algorithmic properties such as conditions of convergence. Nevertheless, authors have erred towards modesty, restricting themselves to corrections over a small number of features.

Perhaps that is wise. An intrepid data scientist who uses IPFP to correct over a large feature set will be implicitly dealing with a very noisy high-dimensional PDF and will find their diligence punished by an enormous penalty in standard error. The data scientist may want to try limiting the noise through a dimensionality reduction algorithm, arriving at approximately two orthogonal feature embeddings to correct over.


Pitfalls of Bias Correction with Weights

Deriving explicit weights has its limitations:

  • High-dimensional joint density estimation is a very hard problem in general and can introduce a lot of uncertainty into the point estimates unless we restrict ourselves to considering only a few features (but if we have to first select those features then we still have added uncertainty). 
  • It may not be very natural to handle a mix of continuous and discrete variables without discretizing the continuous variables.
  • It is possible to derive extremely large weights that give certain observations too much influence, severely harming the effective sample size and causing an explosion in model variance. We find that better results are often obtained by shrinking weights towards uniformity.

Finally, when all we care about is the ratio of two densities, it seems inefficient to separately estimate two joint densities as a precursor. Indeed, there are other techniques such as kernel mean matching that directly approximate the quantity of interest.


Necessary Assumptions for Bias Correction

More generally, when does correction work and when doesn’t it work? There are two assumptions that need to mostly hold for our approaches to be successful. 

  1. Nested supports: every element of the population of interest needs to have a non-zero probability of appearing in the training set.
  2. Covariate shift: conditioned upon all of our covariates, the responses are independent of the sampling bias.

Nested support is necessary for our weights to be well-defined, because otherwise the denominator in our ratio of densities will be zero in some places. Covariate shift is more subtle: in our example, if sampling were biased by X1 and X2 but not by the actual response, then this assumption would be satisfied. Because the response may also be correlated with X1 and X2, we may find that the response is marginally associated with the sampling bias, but covariate shift is only violated if the sampling bias is correlated with the response even after we’ve controlled for the predictors. This assumption is important, because otherwise the optimal correcting weights would require knowledge of the test set labels — something we don’t have.


Correcting a Statistic

What if instead of a corrected classifier, we just want to correct our estimate of the proportion of Candidate A supporters? In the survey sampling literature, a popular technique for this is multilevel regression and poststratification (MRP). If we were to use this technique, we would first discretize our sample bias correction features to form a grid. We would then use a multilevel logistic regression to assign a modelled level of support to each cell of the grid. The final result would come from an average of these support levels, weighted according to each cell’s probability under the test distribution. In the standard MRP setting, the test distribution must be externally known, because no individual observations are available from the population of interest. Often, the external distribution is derived from a census.

In our case, however, we have individual-level data from the population of interest. This allows us to present a slightly modified approach, which does not depend upon sourcing population marginal or joint distributions and does not require binning of the correction features.

Under covariate shift, the conditional probability that a person prefers Candidate A over B given their values of X1 and X2 does not depend on the sampling bias. To that end, we will estimate these probabilities using an uncorrected logistic regression on the training set. Now we can take a large number of samples with replacement from the test set, compute the predicted conditional probability for each, and then average them. Provided that our test set is sizeable enough that these bootstrapped samples closely follow the true underlying distribution, then this average converges to the marginal probability of support for Candidate A in the test set. This relies upon the additional assumption that the logistic regression predictions are consistent. This consistency assumption is extremely unlikely to fully hold in practice, but if the model performs decently then this will provide a nice and simple way to approximately correct this statistic, without needing any external distribution information. 

Looking at our example, the sample proportion of Candidate A supporters is 49.9% in the test set, but only 36.8% in the training set. But if we use the above approximation then we get a point estimate of 50.7%, a marked improvement.

Note that we have not touched on how to compute correct standard errors. Frequently, the required covariance computations become intractable with estimated weights because the weights cannot sensibly be assumed to be independent. Indeed, the need for proper inference is a big part of the reason for the difference between the approaches in the machine learning community, such as kernel mean matching, and approaches like MRP that are used in a more inferential survey sampling context.

In the next post we will take a deeper look at sample bias correction techniques, such as kernel mean matching, and show how they all appeal to the same fundamental correcting factor: the Radon-Nikodym derivative.