Statistical Test for Two Populations Before/After Control/Test Design

We are commonly have a problem where we do A/B test over a population of users but it seems like population sometimes differ even before the tests starts.
So let's say I have over multiple days the KPI per group (control & test), how would you test for the treatment significance?

This is like a two-way design because I have both before/after test starts and both control/test.

Data can be in user level or aggregated.

The data looks like this: kpi_group(t) where t<start is before and t>start is after and kpi is the kpi and group is control or test.
In some cases before/after are the same users so maybe a paired test is due?

currently what I do is:
1. if it is paired (same user): I compute the difference avg_kpi(after)-avg_kpi(before) and then test with Mann–Whitney U test if the difference in test is larger than in control.
2. if the data is not paired, I compute the daily difference between test and control kpi_test(t)-kpi_control(t) and then run Mann–Whitney U test on after vs before.

Maybe a Bayesian test is due? (yet I hate the fact I need to come up with arbitrary priors)

see screenshot which shows how wilcoxon test before shows significance between control and test.
Screen Shot 2019-07-23 at 17.25.07.png


Last edited: