0

Imagine that every month, I have a "virgin" data set consisting of data points (e.g. people that have stopped paying a subscription) with certain features (e.g. geo-demographic information and payment history for a subscription) and which have been clustered into, say, 3 groups.

On this virgin data set I apply a model (e.g. logistic regression) that gives each data point a score that optimises "success" (e.g. the probability that if I send those people a personalised email, they will renew the subscription) and I use an allocation algorithm that chooses, say, the top 950 people to send personalised emails to; I also choose 50 data points at random from the "virgin" data set (i.e. ignore the scoring) to send personalised emails to. Then, I collect data on how successful those personalised emails were (at getting people to renew the subscription). I do this every month for a year.

At the end of the year, I have two new data sets: a "model" data set (with 11,400 data points) and a "random" data set (with 600 data points); both of these have the original features and, in addition, the information on the outcome of receiving a personalised email.

The "random" data set is smaller in number of data points than the "model" data set, but has the same distribution as the "virgin" data set (e.g. if the proportions of the data points in the 3 clusters in the "virgin" data are 30%, 30% and 40%, then this is also the case in the "random" data set - since it was a random sample). Also, the "random" data set has the average true effect of the personalised email. The "model" data set is much larger, but is biased in two ways: the scoring model and allocation algorithm optimise for both the best performing clusters, as well as the best performing members within the clusters. (This would mean that

if, say, the best performing cluster is number 1, then it would make up, say 70% of the data points in the "model" data set, as opposed to 30% in the "virgin" data set; and

say people in cluster 1 have a "true" average of 2% of renewing the subscription after receiving a personalised email, as would be shown in the "random" data set, but in the "model" data set, they have, e.g. a 12% probability of renewal.)

**My question is: is it possible to reconstruct an unbiased data set from the "model" data set, somehow using the "random" data set?**

One way I can think of is using "brute force": compare the distributions of the outcomes for each cluster between the "model" and the "random" and assign weights (of importance/un-importance) to each data point in the "model" data set; as well as, possibly, doing over and under sampling. This would achieve both the "correct" distribution of outcomes, as well as the "correct" distribution of clusters.

**Are there any established, statistical or otherwise, methods to do this?** Thanks for any tips!