Simulate a normal data set \(S = (X, X_{cat})\) that includes categorical variables.

Creates a toy data set \(S = (X, X_{cat})\) where the columns of \(X\) are sampled from an independent Gaussian distribution with mean \(\mu_i\) and standard deviation \(\sigma_i\), i.e. \(N(\mu_i, \sigma_i^2)\), and the columns of \(X_{cat}\) are categorical, sampled with replacement from a given number of categories (indexed by integers). The final dimension will be \(n \times (p_1 + p_2)\), where \(p_1\) is the number of columns in \(X\) and \(p_2\) is the number of columns in \(X_{cat}\), with the number of data points \(n\) to be specified.

generate_X_cat(
  n = 100,
  mu = rep(0, 10),
  sigma = rep(1, 10),
  no_of_cat = c(4, 5)
)

Arguments

n	The desired number of data points in the data set.
mu	A \(p_1\)-dimensional vector of means for \(\mu\).
sigma	A \(p_1\)-dimensional vector of non-negative standard deviations for \(\sigma\).
no_of_cat	A \(p_2\)-dimensional vector where the entries indicate the number of categories desired for each column of \(X_{cat}\).

Value

An \(n \times (p_1 + p_2)\) dimensional data frame given by \(S = (X, X_{cat})\). In the default case, the columns of \(X\) are sampled from \(N(0,1)\), \(n = 100\) and \(p_1 = 10, p_2 = 2\), i.e. two additional categorical columns of \(X_{cat}\) are added. The columns of \(X_{cat}\) are factors.

Examples

generate_X_cat()
#> # A tibble: 100 × 12
#>        X1     X2       X3      X4     X5      X6       X7      X8      X9
#>     <dbl>  <dbl>    <dbl>   <dbl>  <dbl>   <dbl>    <dbl>   <dbl>   <dbl>
#>  1  0.109 -0.441  0.552    0.136  -0.591 -0.141   0.238    1.53    1.68  
#>  2  0.231 -2.18  -0.162   -1.04    0.857 -1.83   -1.30     0.286   1.01  
#>  3 -1.27  -0.678  1.35     1.37    0.509  0.404   1.74     0.966   0.143 
#>  4  0.621  0.729 -0.00533  0.212   0.397 -0.0138  1.46     0.0417  0.0254
#>  5 -0.642 -1.24  -2.41    -0.420   0.387 -0.102  -1.34    -0.187  -0.678 
#>  6  0.662 -0.588  0.990   -0.515   0.370  0.202  -0.00785 -1.35   -0.372 
#>  7  0.246  1.11  -1.23    -0.481   1.04   0.803  -0.709   -0.953   1.21  
#>  8 -0.848 -0.502  0.0274  -0.0229  2.36  -0.588   1.42    -1.28   -0.251 
#>  9 -0.435  0.347 -0.391   -0.236  -0.914  2.11    2.10     1.35   -0.541 
#> 10  0.883  1.75   2.40    -0.0429  1.28   0.106   0.0564   0.0875 -0.0933
#> # … with 90 more rows, and 3 more variables: X10 <dbl>, X11 <fct>, X12 <fct>

generate_X_cat(n = 40, mu = 1:6, sigma = rep(1, 6), no_of_cat = c(2,3,5))
#> # A tibble: 40 × 9
#>        X1    X2    X3    X4    X5    X6 X7    X8    X9   
#>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct>
#>  1  0.332 1.74   4.87  4.45  4.99  6.55 1     2     3    
#>  2 -0.995 3.76   2.00  4.51  4.07  4.89 1     3     2    
#>  3  0.643 1.74   3.28  4.05  5.83  5.28 1     1     1    
#>  4  1.59  2.53   1.97  5.39  3.40  6.96 2     1     2    
#>  5  1.05  0.836  3.00  2.71  4.86  6.96 2     1     3    
#>  6  2.54  3.45   3.02  4.65  4.96  5.40 1     3     1    
#>  7 -0.201 3.35   3.08  3.38  4.73  7.08 1     1     1    
#>  8  1.30  1.72   2.55  5.29  4.23  5.11 2     2     5    
#>  9  0.415 1.79   2.31  3.93  5.08  5.82 2     2     2    
#> 10  1.20  2.13   3.24  4.35  5.50  6.96 1     1     1    
#> # … with 30 more rows