Normal Data Sets • stat545lamke07

Setup and Overview

library(stat545lamke07)

The purpose of the stat545lamke07 package is to be able to quickly test a method of interest on a toy data set. To that end, the functions starting with generate_ aim to create data sets based on the normal distribution.

Generating only X

The key component of the stat545lamke07 package is the generate_X() function, which generates a data set \(S = (X)\) where the columns of \(X\) are normally distributed. The usage of such a data set is extremely flexible, as we can transform the data set quickly. To run the generate_X(), all we require is the number of data samples, as well as the right parametrization of \(\mu\) and \(\sigma\) (mu and sigma).

df_X <- generate_X(n = 10, mu = rep(0,5), sigma = rep(2, 5))
print(head(df_X))
#> # A tibble: 6 × 5
#>       X1     X2     X3     X4    X5
#>    <dbl>  <dbl>  <dbl>  <dbl> <dbl>
#> 1  0.342  1.90  -1.35   0.639 -1.06
#> 2  0.235  2.72  -0.431  1.70   1.47
#> 3 -2.05  -0.672 -0.916  3.52  -2.90
#> 4 -0.419  4.95   2.13  -5.20  -3.66
#> 5  0.430 -3.76  -3.48   1.44   1.95
#> 6 -1.15   2.99  -1.19  -0.710 -1.20

It is then possible to perform experiments of interest, such as the eigendecomposition of the correlation matrix.

eigen(cor(df_X))
#> eigen() decomposition
#> $values
#> [1] 2.3133896 1.4371762 0.6176409 0.4409332 0.1908601
#> 
#> $vectors
#>             [,1]        [,2]       [,3]       [,4]        [,5]
#> [1,]  0.08683208  0.72001821 -0.5771387 -0.1393711 -0.34859810
#> [2,] -0.54457977 -0.05660604  0.0665676 -0.8336233 -0.02948992
#> [3,] -0.56821829  0.23252697 -0.2758458  0.3096227  0.67164131
#> [4,]  0.49154796 -0.33366594 -0.5420542 -0.3585093  0.47402108
#> [5,]  0.36249364  0.55943524  0.5408938 -0.2474931  0.44923455

Generating only X and Y

Suppose we would like to understand the effect of including more variables in our linear model. In addition to just generating \(X\) using generate_X(), we can now specify the exact linear coefficients using the beta_coefficients parameter to obtain \[Y = X^T \beta\] which leads to the data set \(S = (X,Y)\). Note that we need to ensure that the number of columns of \(X\) are the same as the number of coefficients in beta_coefficients. With the use of generate_XY(), we first generate the data set.

df <- generate_XY(n = 1000, mu = rep(0,10), sigma = rep(2,10), beta_coefficients = 1:10)
print(head(df))
#> # A tibble: 6 × 11
#>       X1    X2     X3     X4     X5     X6     X7     X8       X9      X10     Y
#>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>    <dbl> <dbl>
#> 1 -1.60  -2.26  0.533  1.55  -1.18  -2.56  -1.94   1.37   4.16     1.10     26.3
#> 2  0.747  4.72 -2.66  -1.79   1.13  -0.936  0.764  0.672 -0.739    1.91     18.2
#> 3  0.812  1.01  0.371 -4.52   2.37   0.318 -0.720 -3.24   0.00691  1.26    -18.7
#> 4 -3.09   1.40  0.649 -0.501  0.719  0.261  0.310 -2.60   1.01    -1.94    -24.2
#> 5 -2.92   2.92  1.08  -2.14  -0.725 -1.78  -1.56   0.315 -1.26    -2.64    -62.9
#> 6  3.69  -2.47 -2.78  -1.12  -0.327  3.86  -1.47  -1.38   3.16    -0.00195  14.5

Having generated the data set, we can now fit some linear models as well.

# Test a linear model with 3 variables
m1 <- lm(Y~ X1 + X2 + X3, data = df)
summary(m1)
#> 
#> Call:
#> lm(formula = Y ~ X1 + X2 + X3, data = df)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -100.930  -26.983   -1.675   25.733  158.541 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  -0.5227     1.2157  -0.430 0.667290    
#> X1            1.9452     0.6142   3.167 0.001587 ** 
#> X2            2.1923     0.6004   3.651 0.000274 ***
#> X3            2.1997     0.6200   3.548 0.000407 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 38.4 on 996 degrees of freedom
#> Multiple R-squared:  0.03458,    Adjusted R-squared:  0.03168 
#> F-statistic: 11.89 on 3 and 996 DF,  p-value: 1.175e-07

# Test a linear model with 6 variables
m2 <- lm(Y~ X1 + X2 + X3 + X4 + X5 + X6, data = df)
summary(m2)
#> 
#> Call:
#> lm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6, data = df)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -105.971  -23.033    0.319   21.669  121.787 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  -1.1086     1.0528  -1.053  0.29257    
#> X1            1.6671     0.5324   3.131  0.00179 ** 
#> X2            2.2426     0.5204   4.310 1.80e-05 ***
#> X3            2.2642     0.5369   4.217 2.70e-05 ***
#> X4            3.4446     0.5384   6.398 2.42e-10 ***
#> X5            5.5184     0.5130  10.758  < 2e-16 ***
#> X6            6.4553     0.5125  12.596  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 33.22 on 993 degrees of freedom
#> Multiple R-squared:  0.2797, Adjusted R-squared:  0.2754 
#> F-statistic: 64.27 on 6 and 993 DF,  p-value: < 2.2e-16

We can quickly modify our assumptions about the data set by changing the relevant parameters in the generate_XY() function, namely mu, sigma, and beta_coefficients.

df <- generate_XY(n = 1000, mu = 51:55, sigma = seq(10,15, length.out = 5), beta_coefficients = 21:25)
print(head(df))
#> # A tibble: 6 × 6
#>      X1    X2    X3    X4    X5     Y
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  59.4  55.2  38.3  68.2  45.1 6108.
#> 2  56.9  47.3  43.1  77.8  25.2 5722.
#> 3  49.3  57.0  49.8  59.5  61.5 6398.
#> 4  54.9  49.4  38.3  59.9  58.9 6032.
#> 5  33.9  38.5  61.3  56.3  55.2 5701.
#> 6  52.7  44.3  50.6  47.3  33.8 5223.

Generating X with categorical variables

So far we have assumed that \(X\) contains only continuous variables. However, it is also possible to include categorical variables in the data set. To this end, we have written the generate_X_cat() function that additionally computes categorical factors, which can be achieved through the no_of_cat parameter. For example, no_of_cat = c(4,5) is a vector where each entry corresponds to the number of categories in each column. In this case, we would have one column with 4 categories and 5 categories each.

df_cat <- generate_X_cat(n = 40, mu = 1:5, sigma = rep(1, 5), no_of_cat = c(4,5))
print(head(df_cat))
#> # A tibble: 6 × 7
#>       X1    X2    X3    X4    X5 X6    X7   
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
#> 1  0.402 1.21   3.46  3.03  5.24 4     5    
#> 2  2.48  1.38   2.30  3.99  4.30 2     5    
#> 3 -0.192 3.11   3.96  3.81  6.55 3     1    
#> 4  0.183 1.20   4.48  3.67  5.50 2     5    
#> 5  1.32  1.01   4.80  5.41  5.10 2     1    
#> 6  1.46  0.862  3.70  2.92  5.01 4     5