I have a question but maybe it's the wrong question so I'll state the task first...
I want to make data that looks like the data I'm working with without actually being the data itself. So I want to maintain structure as much as possible and generate an n row data set with similar correlations between variables and distributional shapes of column variables. Let's say this is the data I have (in R):
Is there a way to figure out the distribution and parameters of the data set in order to generate a new similar data set? I was thinking you could use the Kolmogorov-Smirnov and just compare to 10ish common distributions and select the one with the lowest highest p-value. But I realized I'd have to know the parameters of the distribution in advance.
yields:
So I want to create data that looks similar to data I already have with similar correlation matrix and similar distributions.
If it was all normal the task of generating similar data is pretty easy using something like this:
But it'd be better if we could mimic something like a uniform or poison distribution more closely if that's what the data was more closely shaped like.
I want to make data that looks like the data I'm working with without actually being the data itself. So I want to maintain structure as much as possible and generate an n row data set with similar correlations between variables and distributional shapes of column variables. Let's say this is the data I have (in R):
Code:
set.seed(10)
dat <- data.frame(
pois_10 = rpois(100, 10),
binom_5_.2 = rbinom(100, 5,.2),
binom_1_.2 = rbinom(100, 1,.2),
runif_0_1 = runif(100),
chisq_30 = rchisq(100, 30),
chisq_10 = rchisq(100, 10),
logistic_0 = rlogis(100),
logistic_10 = rlogis(100, 10)
)
head(dat)
pois_10 binom_5_.2 binom_1_.2 runif_0_1 chisq_30 chisq_10 logistic_0 logistic_10
1 10 0 1 0.3791907 30.72111 11.386525 -0.0951142 9.329344
2 9 2 1 0.9144744 50.45788 16.955645 3.9631567 9.288006
3 5 1 0 0.4774175 41.51492 8.591872 -1.1165473 7.216612
4 8 1 1 0.2141185 23.79510 15.054063 4.1400485 6.733500
5 9 0 1 0.7683779 25.35576 8.666049 -2.5407420 9.550825
6 10 2 1 0.9273926 49.16303 5.554433 -1.4603903 12.853794
12.853794
Code:
x <- rnorm(500)
y <- runif(500)
ks.test(x, "pnorm")
ks.test(y, "pnorm")
ks.test(y, "punif")
ks.test(x, "punif")
ks.test(x, "pt", 4)
Code:
> ks.test(x, "pnorm")
One-sample Kolmogorov-Smirnov test
data: x
D = 0.039414, p-value = 0.419
alternative hypothesis: two-sided
> ks.test(y, "pnorm")
One-sample Kolmogorov-Smirnov test
data: y
D = 0.50147, p-value < 0.00000000000000022
alternative hypothesis: two-sided
> ks.test(y, "punif")
One-sample Kolmogorov-Smirnov test
data: y
D = 0.031638, p-value = 0.6988
alternative hypothesis: two-sided
> ks.test(x, "punif")
One-sample Kolmogorov-Smirnov test
data: x
D = 0.476, p-value < 0.00000000000000022
alternative hypothesis: two-sided
>
> ks.test(x, "pt", 4)
One-sample Kolmogorov-Smirnov test
data: x
D = 0.055559, p-value = 0.09129
alternative hypothesis: two-sided
If it was all normal the task of generating similar data is pretty easy using something like this:
Code:
mvrnormR <- function(n, mu, sigma) {
ncols <- ncol(sigma)
mu <- rep(mu, each = n) ## not obliged to use a matrix (recycling)
mu + matrix(rnorm(n * ncols), ncol = ncols) %*% chol(sigma)
}