normal_datasets.Rmd
library(stat545lamke07)
The purpose of the stat545lamke07
package is to be able to quickly test a method of interest on a toy data set. To that end, the functions starting with generate_
aim to create data sets based on the normal distribution.
The key component of the stat545lamke07
package is the generate_X()
function, which generates a data set \(S = (X)\) where the columns of \(X\) are normally distributed. The usage of such a data set is extremely flexible, as we can transform the data set quickly. To run the generate_X()
, all we require is the number of data samples, as well as the right parametrization of \(\mu\) and \(\sigma\) (mu
and sigma
).
df_X <- generate_X(n = 10, mu = rep(0,5), sigma = rep(2, 5))
print(head(df_X))
#> # A tibble: 6 × 5
#> X1 X2 X3 X4 X5
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.342 1.90 -1.35 0.639 -1.06
#> 2 0.235 2.72 -0.431 1.70 1.47
#> 3 -2.05 -0.672 -0.916 3.52 -2.90
#> 4 -0.419 4.95 2.13 -5.20 -3.66
#> 5 0.430 -3.76 -3.48 1.44 1.95
#> 6 -1.15 2.99 -1.19 -0.710 -1.20
It is then possible to perform experiments of interest, such as the eigendecomposition of the correlation matrix.
eigen(cor(df_X))
#> eigen() decomposition
#> $values
#> [1] 2.3133896 1.4371762 0.6176409 0.4409332 0.1908601
#>
#> $vectors
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.08683208 0.72001821 -0.5771387 -0.1393711 -0.34859810
#> [2,] -0.54457977 -0.05660604 0.0665676 -0.8336233 -0.02948992
#> [3,] -0.56821829 0.23252697 -0.2758458 0.3096227 0.67164131
#> [4,] 0.49154796 -0.33366594 -0.5420542 -0.3585093 0.47402108
#> [5,] 0.36249364 0.55943524 0.5408938 -0.2474931 0.44923455
Suppose we would like to understand the effect of including more variables in our linear model. In addition to just generating \(X\) using generate_X()
, we can now specify the exact linear coefficients using the beta_coefficients
parameter to obtain \[Y = X^T \beta\] which leads to the data set \(S = (X,Y)\). Note that we need to ensure that the number of columns of \(X\) are the same as the number of coefficients in beta_coefficients
. With the use of generate_XY()
, we first generate the data set.
df <- generate_XY(n = 1000, mu = rep(0,10), sigma = rep(2,10), beta_coefficients = 1:10)
print(head(df))
#> # A tibble: 6 × 11
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -1.60 -2.26 0.533 1.55 -1.18 -2.56 -1.94 1.37 4.16 1.10 26.3
#> 2 0.747 4.72 -2.66 -1.79 1.13 -0.936 0.764 0.672 -0.739 1.91 18.2
#> 3 0.812 1.01 0.371 -4.52 2.37 0.318 -0.720 -3.24 0.00691 1.26 -18.7
#> 4 -3.09 1.40 0.649 -0.501 0.719 0.261 0.310 -2.60 1.01 -1.94 -24.2
#> 5 -2.92 2.92 1.08 -2.14 -0.725 -1.78 -1.56 0.315 -1.26 -2.64 -62.9
#> 6 3.69 -2.47 -2.78 -1.12 -0.327 3.86 -1.47 -1.38 3.16 -0.00195 14.5
Having generated the data set, we can now fit some linear models as well.
# Test a linear model with 3 variables
m1 <- lm(Y~ X1 + X2 + X3, data = df)
summary(m1)
#>
#> Call:
#> lm(formula = Y ~ X1 + X2 + X3, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -100.930 -26.983 -1.675 25.733 158.541
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.5227 1.2157 -0.430 0.667290
#> X1 1.9452 0.6142 3.167 0.001587 **
#> X2 2.1923 0.6004 3.651 0.000274 ***
#> X3 2.1997 0.6200 3.548 0.000407 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 38.4 on 996 degrees of freedom
#> Multiple R-squared: 0.03458, Adjusted R-squared: 0.03168
#> F-statistic: 11.89 on 3 and 996 DF, p-value: 1.175e-07
# Test a linear model with 6 variables
m2 <- lm(Y~ X1 + X2 + X3 + X4 + X5 + X6, data = df)
summary(m2)
#>
#> Call:
#> lm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -105.971 -23.033 0.319 21.669 121.787
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.1086 1.0528 -1.053 0.29257
#> X1 1.6671 0.5324 3.131 0.00179 **
#> X2 2.2426 0.5204 4.310 1.80e-05 ***
#> X3 2.2642 0.5369 4.217 2.70e-05 ***
#> X4 3.4446 0.5384 6.398 2.42e-10 ***
#> X5 5.5184 0.5130 10.758 < 2e-16 ***
#> X6 6.4553 0.5125 12.596 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 33.22 on 993 degrees of freedom
#> Multiple R-squared: 0.2797, Adjusted R-squared: 0.2754
#> F-statistic: 64.27 on 6 and 993 DF, p-value: < 2.2e-16
We can quickly modify our assumptions about the data set by changing the relevant parameters in the generate_XY()
function, namely mu
, sigma
, and beta_coefficients
.
df <- generate_XY(n = 1000, mu = 51:55, sigma = seq(10,15, length.out = 5), beta_coefficients = 21:25)
print(head(df))
#> # A tibble: 6 × 6
#> X1 X2 X3 X4 X5 Y
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 59.4 55.2 38.3 68.2 45.1 6108.
#> 2 56.9 47.3 43.1 77.8 25.2 5722.
#> 3 49.3 57.0 49.8 59.5 61.5 6398.
#> 4 54.9 49.4 38.3 59.9 58.9 6032.
#> 5 33.9 38.5 61.3 56.3 55.2 5701.
#> 6 52.7 44.3 50.6 47.3 33.8 5223.
So far we have assumed that \(X\) contains only continuous variables. However, it is also possible to include categorical variables in the data set. To this end, we have written the generate_X_cat()
function that additionally computes categorical factors, which can be achieved through the no_of_cat
parameter. For example, no_of_cat = c(4,5)
is a vector where each entry corresponds to the number of categories in each column. In this case, we would have one column with 4 categories and 5 categories each.
df_cat <- generate_X_cat(n = 40, mu = 1:5, sigma = rep(1, 5), no_of_cat = c(4,5))
print(head(df_cat))
#> # A tibble: 6 × 7
#> X1 X2 X3 X4 X5 X6 X7
#> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
#> 1 0.402 1.21 3.46 3.03 5.24 4 5
#> 2 2.48 1.38 2.30 3.99 4.30 2 5
#> 3 -0.192 3.11 3.96 3.81 6.55 3 1
#> 4 0.183 1.20 4.48 3.67 5.50 2 5
#> 5 1.32 1.01 4.80 5.41 5.10 2 1
#> 6 1.46 0.862 3.70 2.92 5.01 4 5