Simulating a causal data set \(S = (X,Y_i, T, Y_{obs})\) with multiple potential outcomes.

Creates a causal data set \(S = (X, Y_i, T, Y_{obs})\) for causal inference. The \(p\) columns of \(X\) are sampled from an independent Gaussian distribution with mean \(\mu_i\) with standard deviation \(\sigma_i\), i.e. \(N(\mu_i, \sigma_i^2)\). A treatment \(T\) is sampled, where more than 2 treatments are possible. The observations \(Y_i\) correspond to the outcome if the treatment \(i\) is applied. The outcome \(Y = X^T \beta\) is assumed to depend on \(X\) in a linear fashion, and the treatment effect of treatment \(T = i\) is additive. See Causality (Pearl 2009) for further details and a general introduction to causal inference.

causal_XTY_multiple(
  n = 100,
  mu = rep(0, 3),
  sigma = rep(1, 3),
  beta_coefficients = 1:3,
  treatment_prob = rep(0.25, 4),
  treatment_effect = c(10, 20, 30, 40)
)

Arguments

n	desired number of data points in the data set.
mu	a \(p\)-dimensional vector of means for \(\mu\).
sigma	a \(p\)-dimensional vector of non-negative standard deviations for \(\sigma\).
beta_coefficients	a \(p\)-dimensional vector of coefficients for \(\beta\).
treatment_prob	a probability vector with weights summing to 1, corresponding to the probability of treatment.
treatment_effect	a vector corresponding to the additive treatment effect of each treatment on the outcome \(Y\).

Value

A causal data set \(S = (X,Y_i, T, Y_{obs})\) with multiple potential outcomes. In the default case, the \(p\) columns \(X_i\) are sampled from \(N(0,1)\), with \(beta\)-coefficients 1 to 3 for the base outcome \(Y\). We also have \(n = 100\), \(p = 3\), where \(p\) corresponds to the number of columns in \(X\). The treatment probabilities are equally likely.

Examples

causal_XTY_multiple()
#> # A tibble: 100 × 10
#>         X1     X2      X3       Y    Y1    Y2    Y3    Y4 treatment Y_observed
#>      <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>     <int>      <dbl>
#>  1  0.0639  0.354 -0.276  -0.0562  9.94  19.9  29.9  39.9         4       39.9
#>  2 -0.919   0.530  0.384   1.29   11.3   21.3  31.3  41.3         1       11.3
#>  3  0.901  -0.311  1.45    4.62   14.6   24.6  34.6  44.6         4       44.6
#>  4 -0.798  -0.244  1.73    3.92   13.9   23.9  33.9  43.9         4       43.9
#>  5  0.668  -0.292  0.456   1.45   11.5   21.5  31.5  41.5         3       31.5
#>  6  0.155  -1.13   0.708   0.0216 10.0   20.0  30.0  40.0         4       40.0
#>  7  0.129  -1.12   2.06    4.07   14.1   24.1  34.1  44.1         1       14.1
#>  8 -1.53    2.05   0.0239  2.65   12.6   22.6  32.6  42.6         4       42.6
#>  9  0.202  -0.910  0.246  -0.879   9.12  19.1  29.1  39.1         2       19.1
#> 10 -0.718   0.458  0.272   1.02   11.0   21.0  31.0  41.0         3       31.0
#> # … with 90 more rows

causal_XTY_multiple(n = 40, mu = rep(2, 7), sigma = 1:7,
                    beta_coefficients = 1:7,
                    treatment_prob = c(0.4, 0.1, 0.1, 0.2, 0.2),
                    treatment_effect = 1:5)
#> # A tibble: 40 × 15
#>        X1     X2     X3    X4     X5     X6     X7      Y     Y1     Y2     Y3
#>     <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1 0.294  -1.10  -0.118  7.32  1.70  -4.90   6.16   49.3   50.3   51.3   52.3 
#>  2 1.14    3.56  -3.57  -1.25 -7.32   0.370  9.76   26.5   27.5   28.5   29.5 
#>  3 1.86    4.14   1.26   9.18 -4.37   4.74  -4.60   25.0   26.0   27.0   28.0 
#>  4 1.68    1.63   0.993  4.75 -6.91   1.90   6.90   52.1   53.1   54.1   55.1 
#>  5 1.83    5.12   1.25   2.36 -0.543 -1.25  -0.973   8.22   9.22  10.2   11.2 
#>  6 0.764   1.57   3.38   3.30 -6.68   7.25  -6.00   -4.67  -3.67  -2.67  -1.67
#>  7 0.0977  3.86   0.620  2.29  2.20   6.45   3.63   93.9   94.9   95.9   96.9 
#>  8 1.91    2.82   2.18   2.93  1.38   1.01   0.675  43.5   44.5   45.5   46.5 
#>  9 2.03   -0.560 -0.228  9.95 -1.06  -2.51  -9.64  -47.8  -46.8  -45.8  -44.8 
#> 10 2.46    0.435 -4.49  -1.81  2.80  -5.52  15.4    71.4   72.4   73.4   74.4 
#> # … with 30 more rows, and 4 more variables: Y4 <dbl>, Y5 <dbl>,
#> #   treatment <int>, Y_observed <dbl>