Note: This package was created as part of the STAT545B Assignment 4 submission. The corresponding website can be found under this link: https://lamke07.github.io/stat545lamke07/index.html.
The goal of stat545lamke07
is to have a collection of functions that can quickly create toy data sets to test a statistical or machine learning model on. Many times one is interested in using simulated data, but these often need to be written out quickly. The stat545lamke07
package aims to make this process easier and in an orderly way and provides simple data sets, including data sets based on the normal distribution and causal data sets.
You can install the released version of stat545lamke07
from the GitHub repository with:
devtools::install_github("lamke07/stat545lamke07")
Note: when using devtools::check()
, you might need to have qpdf
installed locally, otherwise you may run into a warning with the following message.
WARNING
‘qpdf’ is needed for checks on size reduction of PDFs
This is a basic example which shows you how to solve a common problem: the generate_XY()
function creates a data set where Y is a linear combination of the columns in X. As such, a linear model on the full data set is expected to give a perfect fit.
library(stat545lamke07)
# Obtain a quick data set S = (X,Y)
<- generate_XY()
df print(head(df))
# Test a linear model
<- lm(Y ~., data = df)
m1 summary(m1)
It is possible to specify the individual parameters of the normal distribution for the columns of X:
n
.mu
.sigma
.beta_coefficients
.Below we have given an example of how one could possibly specify the parameters. We need to make sure that all the dimensions are correct.
# Obtain a quick data set S = (X,Y)
<- generate_XY(n = 1000, mu = 1:10, sigma = 1:10, beta_coefficients = 1:10)
df print(head(df))
# Test a linear model
<- lm(Y~ X1 + X2 + X3 + X4, data = df)
m1 summary(m1)
We have also included functions to create causal toy data sets, causal_XTY_binary()
and causal_XTY_multiple()
where the treatment effect is additive and the relationship between the outcomes Y and covariates X is linear.
# Obtain a quick causal data set.
<- causal_XTY_binary()
df_causal_binary print(head(df_causal_binary))
<- causal_XTY_multiple()
df_causal_multiple print(head(df_causal_multiple))