Introduction

Causal inference methods rely on assumptions that are often untestable. Matching requires the conditional independence assumption (no unobserved confounders). Difference-in-differences requires parallel trends. Instrumental variables require the exclusion restriction. Regression discontinuity requires continuity of potential outcomes at the cutoff. Synthetic control requires that a weighted combination of donor units can approximate the treated unit’s counterfactual.

Because these assumptions cannot be verified with observed data alone, sensitivity analysis asks: how robust are our conclusions to plausible violations of the identifying assumptions? A finding that crumbles under even mild violations provides much weaker evidence than one that withstands substantial departures from the ideal.

This vignette covers a comprehensive toolkit of sensitivity and robustness methods:

  1. Omitted variable bias (sensemakr): How strong must an unobserved confounder be to overturn the result?
  2. Rosenbaum bounds: How sensitive is a matched comparison to hidden bias?
  3. E-values: What is the minimum confounding strength needed to explain away an observed effect?
  4. HonestDiD: How robust is a DID estimate to violations of parallel trends?
  5. Placebo tests: Using known-null comparisons to probe assumptions.
  6. Specification curve analysis: Exploring the multiverse of plausible model choices.
  7. Bounds approaches: Partial identification when point identification fails.
  8. Multiple testing corrections: Controlling false discoveries across many tests.
  9. Randomization inference: Exact p-values without distributional assumptions.
  10. Bootstrap methods: Robust inference for clustered and complex data structures.

Throughout, we distinguish between sensitivity analyses (which formally parameterize assumption violations) and robustness checks (which informally probe the fragility of results).


Omitted Variable Bias: sensemakr

The sensemakr package (Cinelli and Hazlett, 2020) provides a formal framework for assessing sensitivity to omitted variable bias (OVB) in linear regression. The key insight is to express the strength of unobserved confounders in terms of partial R-squared values, which are directly interpretable and can be benchmarked against observed covariates.

Theory

Consider a linear model where we regress outcome YY on treatment DD and observed controls XX:

Y=τD+Xβ+γZ+εY = \tau D + X\beta + \gamma Z + \varepsilon

where ZZ is an unobserved confounder. The bias of the OLS estimate τ̂\hat{\tau} (from the short regression without ZZ) relative to the true τ\tau depends on two quantities:

  • RYZ|D,X2R^2_{Y \sim Z | D, X}: the partial R-squared of ZZ with YY, after partialing out DD and XX.
  • RDZ|X2R^2_{D \sim Z | X}: the partial R-squared of ZZ with DD, after partialing out XX.

The robustness value (RV) tells us the minimum strength of confounding (in terms of partial R-squared with both treatment and outcome) that would be needed to reduce the estimated effect to zero (or to make it statistically insignificant). A higher RV indicates a more robust finding.

Implementation

library(sensemakr)

# Simulate analysis data
set.seed(100)
n_sens <- 500
analysis_data <- data.frame(
  age       = rnorm(n_sens, 40, 10),
  income    = rnorm(n_sens, 50000, 15000),
  education = rnorm(n_sens, 14, 3),
  region    = factor(sample(1:4, n_sens, replace = TRUE))
)
analysis_data$treatment <- 0.3 * scale(analysis_data$income) +
  0.2 * scale(analysis_data$age) + rnorm(n_sens)
analysis_data$treatment <- as.numeric(analysis_data$treatment > median(analysis_data$treatment))
analysis_data$outcome <- 0.5 * analysis_data$treatment +
  0.3 * scale(analysis_data$income) +
  0.2 * scale(analysis_data$age) +
  0.1 * scale(analysis_data$education) + rnorm(n_sens)

# Fit the baseline model
model <- lm(outcome ~ treatment + age + income + education + region,
            data = analysis_data)

# Run sensitivity analysis
sens <- sensemakr(
  model       = model,
  treatment   = "treatment",
  benchmark_covariates = "income",
  kd = 1:3,   # multiples of the benchmark for treatment equation
  ky = 1:3,   # multiples of the benchmark for outcome equation
  q  = 1,     # fraction of effect to be explained away (1 = full effect)
  alpha = 0.05,
  reduce = TRUE
)

# Summary
summary(sens)

The summary reports the estimated treatment effect, the robustness value RVqRV_q, the robustness value for statistical significance RVq,αRV_{q,\alpha}, and benchmark comparisons. If RVq=0.15RV_q = 0.15, it means a confounder would need to explain at least 15% of the residual variance of both treatment and outcome to bring the effect to zero.

Contour Plots

Contour plots display the adjusted estimate (or t-value) as a function of the hypothetical confounder’s partial R-squared with the treatment and the outcome.

# Contour plot of adjusted estimates
plot(sens, type = "contour")

# Contour plot of adjusted t-statistics
plot(sens, sensitivity.of = "t-value", type = "contour")

The contour lines show combinations of confounder strength that produce the same adjusted estimate. The red dashed lines mark the threshold where the adjusted effect equals zero or loses statistical significance. Benchmark points (e.g., how strong the confounder would be if it were “1x income” or “3x income”) provide calibration.

Extreme Scenarios

The extreme-scenario plot shows the worst-case bias under the assumption that the confounder explains a given fraction of the treatment variation:

# Extreme scenario plot
plot(sens, type = "extreme")

This plot addresses the question: if an unobserved confounder explained RDZ|X2R^2_{D \sim Z|X} fraction of the residual treatment variance, what is the strongest possible confounding it could produce in the outcome equation?

Multiple Treatments

When you have multiple treatment variables, you can run sensemakr for each and compare robustness values:

# Add additional treatment variables to analysis_data
set.seed(101)
analysis_data$treatment_A <- analysis_data$treatment
analysis_data$treatment_B <- as.numeric(
  0.2 * scale(analysis_data$income) + rnorm(nrow(analysis_data)) > 0)
analysis_data$treatment_C <- as.numeric(
  0.1 * scale(analysis_data$age) + rnorm(nrow(analysis_data)) > 0)

treatments <- c("treatment_A", "treatment_B", "treatment_C")

sens_results <- lapply(treatments, function(trt) {
  model <- lm(
    reformulate(c(trt, "age", "income", "education"), "outcome"),
    data = analysis_data
  )
  sensemakr(
    model = model,
    treatment = trt,
    benchmark_covariates = "income",
    kd = 1:3,
    ky = 1:3
  )
})

# Compare robustness values
sapply(sens_results, function(s) s$sensitivity_stats$rv_q)

Interpretation Guidelines

  • RVq>0.30RV_q > 0.30: Very robust. An unobserved confounder would need to explain more than 30% of the residual variance in both treatment and outcome.
  • RVq[0.10,0.30]RV_q \in [0.10, 0.30]: Moderately robust. Plausible confounders should be discussed.
  • RVq<0.10RV_q < 0.10: Fragile. The result is sensitive to relatively weak unobserved confounding.

Always benchmark against observed covariates. If the strongest observed predictor has partial R-squared of 0.05 with the treatment, and RVq=0.20RV_q = 0.20, then the unobserved confounder would need to be four times as strong as the strongest observed predictor.


Rosenbaum Bounds

In matched observational studies, the key assumption is that treatment assignment is ignorable conditional on observed covariates. Rosenbaum bounds formalize how much hidden bias could be present before the conclusion of a significant treatment effect would be overturned.

Theory

Let πi\pi_i be the probability that unit ii receives treatment. In a matched pair (i,j)(i, j), the odds ratio of treatment assignment is:

1Γπi(1πj)πj(1πi)Γ\frac{1}{\Gamma} \leq \frac{\pi_i(1 - \pi_j)}{\pi_j(1 - \pi_i)} \leq \Gamma

When Γ=1\Gamma = 1, there is no hidden bias (treatment assignment depends only on observed covariates). As Γ\Gamma increases, more hidden bias is allowed. The sensitivity analysis asks: for what value of Γ\Gamma does the conclusion of a significant treatment effect first break down?

Implementation with rbounds

library(rbounds)
library(Matching)

# Simulate data for matching
set.seed(200)
n_rb <- 400
data <- data.frame(
  age       = rnorm(n_rb, 40, 10),
  income    = rnorm(n_rb, 50000, 15000),
  education = rnorm(n_rb, 14, 3)
)
# Treatment assignment depends on covariates
data$ps <- plogis(-2 + 0.03 * data$age + 0.00002 * data$income + 0.1 * data$education)
data$treatment <- rbinom(n_rb, 1, data$ps)
data$outcome <- 0.5 * data$treatment + 0.02 * data$age +
  0.00001 * data$income + rnorm(n_rb)

# Perform matching
match_out <- Match(
  Y  = data$outcome,
  Tr = data$treatment,
  X  = data[, c("age", "income", "education")],
  estimand = "ATT"
)

# Rosenbaum bounds for the Hodges-Lehmann point estimate
hlsens(match_out, Gamma = 3, GammaInc = 0.1)

# Rosenbaum bounds for the Wilcoxon signed-rank test
psens(match_out, Gamma = 3, GammaInc = 0.1)

Implementation with sensitivitymv

The sensitivitymv package extends Rosenbaum bounds to matched sets with variable numbers of controls:

library(sensitivitymv)

# Create simulated matched-set matrix
# Each row is a matched set; col 1 = treated, cols 2-3 = controls; NA pads shorter sets
set.seed(210)
n_sets <- 50
y_mat <- matrix(NA, nrow = n_sets, ncol = 3)
for (i in 1:n_sets) {
  y_mat[i, 1] <- rnorm(1, mean = 0.5)  # treated
  n_controls <- sample(1:2, 1)
  y_mat[i, 2:(1 + n_controls)] <- rnorm(n_controls, mean = 0)
}

senmv(y_mat, gamma = 2, method = "t")

# Sensitivity analysis across a range of Gamma values
gammas <- seq(1, 3, by = 0.1)
pvals  <- sapply(gammas, function(g) {
  senmv(y_mat, gamma = g, method = "t")$pval
})

# Plot sensitivity
plot(gammas, pvals, type = "l", lwd = 2,
     xlab = expression(Gamma), ylab = "Upper bound on p-value",
     main = "Sensitivity to Hidden Bias")
abline(h = 0.05, lty = 2, col = "red")

Interpretation

  • If the conclusion holds up to Γ=2\Gamma = 2, it means that an unobserved covariate would need to change the odds of treatment by a factor of 2 (within matched pairs) to explain away the result.
  • Large clinical trials typically find Γ\Gamma values exceeding 5, suggesting strong robustness.
  • In economics, Γ\Gamma values between 1.5 and 3 are common for well-designed studies.

E-values

The E-value (VanderWeele and Ding, 2017) provides a different perspective on sensitivity to unmeasured confounding. It asks: what is the minimum strength of association (on the risk ratio scale) that an unmeasured confounder would need to have with both the treatment and the outcome to fully explain away the observed treatment-outcome association?

Theory

For an observed risk ratio RRRR, the E-value is:

E-value=RR+RR×(RR1)E\text{-value} = RR + \sqrt{RR \times (RR - 1)}

The E-value has an appealing interpretation: an unmeasured confounder that could explain away the entire observed effect would need to be associated with both treatment and outcome by a risk ratio of at least the E-value, above and beyond all measured covariates.

Computing E-values

library(EValue)

# For a risk ratio
evalues.RR(est = 3.9, lo = 1.8, hi = 8.7)

# For an odds ratio (common outcome)
evalues.OR(est = 2.5, lo = 1.4, hi = 4.5, rare = FALSE)

# For a hazard ratio
evalues.HR(est = 1.8, lo = 1.2, hi = 2.7, rare = TRUE)

# For a standardized mean difference (Cohen's d)
evalues.OLS(est = 0.5, se = 0.1)

E-values for Different Study Designs

# Convert an odds ratio to approximate risk ratio first
# for a common outcome (prevalence > 10%)
or_val  <- 2.5
p0      <- 0.3  # baseline prevalence in unexposed
rr_approx <- or_val / (1 - p0 + p0 * or_val)

evalues.RR(est = rr_approx, lo = 1.5, hi = 3.2)

# For a difference in means (continuous outcome)
# Convert to approximate RR using the VanderWeele-Ding method
evalues.OLS(est = 0.35, se = 0.08)

Visualization

# Bias plot
bias_plot(est = 3.9, lo = 1.8, hi = 8.7)

Interpretation

  • An E-value of 1.0 means no confounding is needed to explain the result (the observed effect is null).
  • E-value of 2.0: a confounder associated with both treatment and outcome by RR = 2 could explain the result.
  • E-value of 5.0: very robust; the confounder would need very strong associations.
  • E-values should be compared to the known strength of measured confounders.

The HonestDiD package (Rambachan and Roth, 2023) provides formal sensitivity analysis for difference-in-differences designs, allowing researchers to relax the parallel trends assumption in a disciplined way.

Theory

Standard DID assumes that treatment and control groups would have followed identical trends absent treatment. HonestDiD considers two classes of restrictions on how the post-treatment trend could deviate from the pre-treatment trend:

  1. Smoothness restrictions (ΔSD(M)\Delta^{SD}(M)): The change in the slope of the counterfactual trend between consecutive periods is bounded by MM. When M=0M = 0, this reduces to the standard parallel trends assumption.

  2. Relative magnitudes (ΔRM(M)\Delta^{RM}(\bar{M})): The post-treatment violations of parallel trends are no larger than M\bar{M} times the maximum pre-treatment violation.

Implementation

library(HonestDiD)
library(fixest)

# Simulate panel data for an event-study design
set.seed(300)
n_units_hdid   <- 100
n_periods_hdid <- 9
treat_period_hdid <- 5  # treatment starts at period 5

panel_data <- expand.grid(unit = 1:n_units_hdid, period = 1:n_periods_hdid)
panel_data$treated <- as.integer(panel_data$unit <= 50)
panel_data$time_to_treat <- panel_data$period - treat_period_hdid
panel_data$unit_fe   <- rep(rnorm(n_units_hdid, 0, 2), each = n_periods_hdid)
panel_data$period_fe <- rep(rnorm(n_periods_hdid), times = n_units_hdid)

# True treatment effect only in post-treatment periods, increasing over time
panel_data$true_effect <- ifelse(
  panel_data$treated == 1 & panel_data$time_to_treat >= 0,
  0.5 * (1 + panel_data$time_to_treat), 0)
panel_data$outcome <- panel_data$unit_fe + panel_data$period_fe +
  panel_data$true_effect + rnorm(nrow(panel_data))

# Estimate event-study specification
es_model <- feols(
  outcome ~ i(time_to_treat, treated, ref = -1) |
    unit + period,
  data    = panel_data,
  cluster = ~unit
)

# Extract coefficients and variance-covariance matrix
betahat <- coef(es_model)
sigma   <- vcov(es_model)

# Identify pre-treatment and post-treatment coefficient indices
# (depends on your specification)
pre_indices  <- 1:3   # t = -4, -3, -2 (t = -1 is reference)
post_indices <- 4:7   # t = 0, 1, 2, 3

Smoothness Restrictions

# Sensitivity analysis with smoothness restrictions
delta_sd_results <- createSensitivityResults(
  betahat        = betahat,
  sigma          = sigma,
  numPrePeriods  = length(pre_indices),
  numPostPeriods = length(post_indices),
  Mvec           = seq(0, 0.05, by = 0.01),
  method         = "C-LF"     # conditional least-favorable
)

# Original CS confidence set (M = 0 corresponds to parallel trends)
print(delta_sd_results)

Relative Magnitudes

# Sensitivity analysis with relative magnitudes restrictions
delta_rm_results <- createSensitivityResults_relativeMagnitudes(
  betahat        = betahat,
  sigma          = sigma,
  numPrePeriods  = length(pre_indices),
  numPostPeriods = length(post_indices),
  Mbarvec        = seq(0, 2, by = 0.5),
  method         = "C-LF"
)

print(delta_rm_results)

Visualization

# Create sensitivity plot
createSensitivityPlot(delta_sd_results,
                      rescaleFactor = 1,
                      SmoothPlot    = TRUE)

# For relative magnitudes
createSensitivityPlot_relativeMagnitudes(delta_rm_results,
                                          rescaleFactor = 1)

Interpretation

The sensitivity plots show how the confidence set for the treatment effect widens as you allow larger violations of parallel trends. If the confidence interval excludes zero even at moderately large values of MM or M\bar{M}, the result is robust to plausible departures from parallel trends. If the interval includes zero at small violations, the result is fragile.


Placebo Tests

Placebo tests check whether an estimated effect appears in settings where it should not. A significant placebo effect raises concerns about the validity of the main analysis.

Types of Placebo Tests

  • Placebo outcomes: Estimate the treatment effect on outcomes that should not be affected by the treatment.
  • Placebo treatments: Use groups that should not be affected as if they were treated.
  • Placebo cutoffs (RD): Test for discontinuities at points other than the real cutoff.
  • In-time placebos (DID/SC): Use pre-treatment periods as fake treatment dates.
  • In-space placebos (SC): Apply the synthetic control method to donor units.

Placebo Outcomes

If the treatment is a job training program and the outcome is wages, a good placebo outcome might be height or eye color (which should not be affected).

set.seed(42)
n <- 500

# Simulate data
placebo_data <- data.frame(
  treatment = rbinom(n, 1, 0.5),
  x1 = rnorm(n),
  x2 = rnorm(n)
)
placebo_data$real_outcome    <- 0.5 * placebo_data$treatment +
                                 0.3 * placebo_data$x1 + rnorm(n)
placebo_data$placebo_outcome <- 0.2 * placebo_data$x1 +
                                 0.4 * placebo_data$x2 + rnorm(n)

# Real outcome: should detect effect
real_model <- lm(real_outcome ~ treatment + x1 + x2,
                 data = placebo_data)

# Placebo outcome: should NOT detect effect
placebo_model <- lm(placebo_outcome ~ treatment + x1 + x2,
                    data = placebo_data)

# Compare
results <- data.frame(
  Outcome   = c("Real Outcome", "Placebo Outcome"),
  Estimate  = c(coef(real_model)["treatment"],
                coef(placebo_model)["treatment"]),
  Std.Error = c(summary(real_model)$coefficients["treatment", "Std. Error"],
                summary(placebo_model)$coefficients["treatment", "Std. Error"]),
  P.Value   = c(summary(real_model)$coefficients["treatment", "Pr(>|t|)"],
                summary(placebo_model)$coefficients["treatment", "Pr(>|t|)"])
)
results
#>           Outcome   Estimate  Std.Error      P.Value
#> 1    Real Outcome  0.6050214 0.08834775 2.220239e-11
#> 2 Placebo Outcome -0.1349716 0.08984293 1.336544e-01

The real outcome shows a significant treatment effect, while the placebo outcome (correctly) does not.

Placebo Treatments (In-Time Placebos for DID)

set.seed(123)
n_units   <- 50
n_periods <- 10
treat_period <- 6

panel <- expand.grid(unit = 1:n_units, period = 1:n_periods)
panel$treated  <- as.integer(panel$unit <= 25)
panel$post     <- as.integer(panel$period >= treat_period)

# Unit and period fixed effects
panel$unit_fe   <- rep(rnorm(n_units, 0, 2), each = n_periods)
panel$period_fe <- rep(rnorm(n_periods), times = n_units)

# Generate outcome with true effect only in post-treatment period
panel$outcome <- panel$unit_fe + panel$period_fe +
                 1.5 * panel$treated * panel$post + rnorm(nrow(panel))

# Real DID: using correct treatment period
real_did <- lm(outcome ~ treated * post, data = panel)

# Placebo DID: pretend treatment started at period 3 (use only pre-treatment data)
panel_pre <- panel[panel$period < treat_period, ]
panel_pre$fake_post <- as.integer(panel_pre$period >= 3)
placebo_did <- lm(outcome ~ treated * fake_post, data = panel_pre)

cat("--- Real DID ---\n")
#> --- Real DID ---
cat("Estimate:", round(coef(real_did)["treated:post"], 3),
    " p-value:", round(summary(real_did)$coefficients["treated:post", 4], 4), "\n\n")
#> Estimate: -0.375  p-value: 0.3222

cat("--- Placebo DID (fake treatment at period 3) ---\n")
#> --- Placebo DID (fake treatment at period 3) ---
cat("Estimate:", round(coef(placebo_did)["treated:fake_post"], 3),
    " p-value:", round(summary(placebo_did)$coefficients["treated:fake_post", 4], 4), "\n")
#> Estimate: 0.304  p-value: 0.5879

Placebo Cutoffs for RD

set.seed(99)
n <- 1000
rd_data <- data.frame(
  running = runif(n, -1, 1)
)
rd_data$treatment <- as.integer(rd_data$running >= 0)
rd_data$outcome   <- 0.5 * rd_data$running +
                     0.8 * rd_data$treatment + rnorm(n, 0, 0.3)

# Real RD at cutoff = 0
real_rd <- lm(outcome ~ running + treatment,
              data = rd_data[abs(rd_data$running) < 0.3, ])

# Placebo cutoffs
placebo_cutoffs <- c(-0.5, -0.3, 0.3, 0.5)

placebo_rd_results <- lapply(placebo_cutoffs, function(c0) {
  d <- rd_data[abs(rd_data$running - c0) < 0.3, ]
  d$fake_treat <- as.integer(d$running >= c0)
  d$running_c  <- d$running - c0
  m <- lm(outcome ~ running_c + fake_treat, data = d)
  data.frame(
    cutoff   = c0,
    estimate = coef(m)["fake_treat"],
    p_value  = summary(m)$coefficients["fake_treat", 4]
  )
})

rd_table <- do.call(rbind, placebo_rd_results)
rd_table <- rbind(
  data.frame(cutoff = 0,
             estimate = coef(real_rd)["treatment"],
             p_value = summary(real_rd)$coefficients["treatment", 4]),
  rd_table
)
rd_table$significant <- ifelse(rd_table$p_value < 0.05, "***", "")
rd_table
#>             cutoff    estimate      p_value significant
#> treatment      0.0  0.83629722 2.517956e-23         ***
#> fake_treat    -0.5  0.05116045 4.578775e-01            
#> fake_treat1   -0.3  0.10880620 1.264862e-01            
#> fake_treat2    0.3  0.04017363 5.772319e-01            
#> fake_treat3    0.5 -0.03659998 5.977710e-01

The real cutoff at 0 shows a significant discontinuity, while the placebo cutoffs do not, supporting the validity of the RD design.

In-Space Placebos for Synthetic Control

library(Synth)

# Simulate panel data for synthetic control
set.seed(400)
n_sc_units <- 8
sc_years   <- 2000:2015
treated_unit_id <- 1

sc_panel <- expand.grid(unit_id = 1:n_sc_units, year = sc_years)
sc_panel$x1 <- rnorm(nrow(sc_panel))
sc_panel$x2 <- rnorm(nrow(sc_panel))
# Unit fixed effects
sc_panel$unit_fe <- rep(rnorm(n_sc_units, 0, 2), each = length(sc_years))
# Time trend
sc_panel$time_fe <- rep(seq(0, 1.5, length.out = length(sc_years)), times = n_sc_units)
# Treatment effect for unit 1 after 2009
sc_panel$treat_effect <- ifelse(
  sc_panel$unit_id == treated_unit_id & sc_panel$year > 2009,
  2.0, 0)
sc_panel$outcome <- sc_panel$unit_fe + sc_panel$time_fe +
  0.3 * sc_panel$x1 + 0.2 * sc_panel$x2 +
  sc_panel$treat_effect + rnorm(nrow(sc_panel), 0, 0.5)

# For each donor unit, pretend it is the treated unit
# and run the synthetic control method
donor_ids <- setdiff(unique(sc_panel$unit_id), treated_unit_id)

placebo_effects <- lapply(donor_ids, function(pid) {
  tryCatch({
    dp <- dataprep(
      foo        = sc_panel,
      predictors = c("x1", "x2"),
      dependent  = "outcome",
      unit.variable    = "unit_id",
      time.variable    = "year",
      treatment.identifier  = pid,
      controls.identifier   = setdiff(unique(sc_panel$unit_id), pid),
      time.predictors.prior = 2000:2009,
      time.optimize.ssr     = 2000:2009,
      time.plot             = 2000:2015
    )
    synth_out <- synth(dp, verbose = FALSE)
    gaps <- dp$Y1plot - (dp$Y0plot %*% synth_out$solution.w)
    data.frame(unit = pid, year = sc_years, gap = as.numeric(gaps))
  }, error = function(e) NULL)
})
placebo_effects <- Filter(Negate(is.null), placebo_effects)
placebo_df <- do.call(rbind, placebo_effects)

# Compute treated unit's gap
dp_treated <- dataprep(
  foo        = sc_panel,
  predictors = c("x1", "x2"),
  dependent  = "outcome",
  unit.variable    = "unit_id",
  time.variable    = "year",
  treatment.identifier  = treated_unit_id,
  controls.identifier   = donor_ids,
  time.predictors.prior = 2000:2009,
  time.optimize.ssr     = 2000:2009,
  time.plot             = 2000:2015
)
synth_treated <- synth(dp_treated, verbose = FALSE)
treated_gaps_df <- data.frame(
  year = sc_years,
  gap  = as.numeric(dp_treated$Y1plot -
    (dp_treated$Y0plot %*% synth_treated$solution.w))
)

# Plot: treated unit's gap should stand out from placebo gaps
ggplot(placebo_df, aes(x = year, y = gap, group = unit)) +
  geom_line(alpha = 0.3, color = "grey60") +
  geom_line(data = treated_gaps_df, aes(x = year, y = gap),
            color = "red", linewidth = 1.2) +
  geom_vline(xintercept = 2010, linetype = "dashed") +
  labs(title = "In-Space Placebo Test",
       x = "Year", y = "Gap (Treated - Synthetic)") +
  theme_minimal()

Specification Curve Analysis

Specification curve analysis (Simonsohn, Simmons, and Nelson, 2020) addresses the problem of researcher degrees of freedom. Instead of reporting a single specification, the researcher estimates the effect across all (or many) reasonable specifications and reports the distribution of results.

Building the Specification Space

Key dimensions of variation include:

  • Which control variables to include
  • Functional form (linear, log, polynomial)
  • Sample restrictions (e.g., dropping outliers, age ranges)
  • Fixed effects structure
  • Clustering level for standard errors
  • Outcome measure

Implementation

set.seed(2024)
n <- 800

spec_data <- data.frame(
  treatment  = rbinom(n, 1, 0.5),
  x1         = rnorm(n),
  x2         = rnorm(n),
  x3         = rnorm(n),
  x4         = rnorm(n),
  group      = sample(1:10, n, replace = TRUE)
)
spec_data$outcome <- 0.4 * spec_data$treatment +
                     0.3 * spec_data$x1 +
                     0.15 * spec_data$x2 +
                     rnorm(n, 0, 0.8)

# Define specification space
control_sets <- list(
  "None"        = "1",
  "x1"          = "x1",
  "x1 + x2"     = "x1 + x2",
  "x1 + x2 + x3" = "x1 + x2 + x3",
  "All"         = "x1 + x2 + x3 + x4"
)

sample_defs <- list(
  "Full"         = rep(TRUE, n),
  "No outliers"  = abs(spec_data$outcome) < 3
)

# Run all specifications
spec_results <- list()
counter <- 0

for (ctrl_name in names(control_sets)) {
  for (samp_name in names(sample_defs)) {
    counter <- counter + 1
    formula_str <- paste0("outcome ~ treatment + ", control_sets[[ctrl_name]])
    d <- spec_data[sample_defs[[samp_name]], ]
    m <- lm(as.formula(formula_str), data = d)

    spec_results[[counter]] <- data.frame(
      spec_id   = counter,
      controls  = ctrl_name,
      sample    = samp_name,
      estimate  = coef(m)["treatment"],
      se        = summary(m)$coefficients["treatment", "Std. Error"],
      p_value   = summary(m)$coefficients["treatment", "Pr(>|t|)"],
      n_obs     = nrow(d)
    )
  }
}

spec_df <- do.call(rbind, spec_results)
spec_df <- spec_df[order(spec_df$estimate), ]
spec_df$rank <- 1:nrow(spec_df)

# Summary statistics
cat("Median estimate:", round(median(spec_df$estimate), 3), "\n")
#> Median estimate: 0.409
cat("Range: [", round(min(spec_df$estimate), 3), ",",
    round(max(spec_df$estimate), 3), "]\n")
#> Range: [ 0.397 , 0.424 ]
cat("All significant at 0.05?", all(spec_df$p_value < 0.05), "\n")
#> All significant at 0.05? TRUE

Visualization

# Upper panel: coefficient estimates sorted
p1 <- ggplot(spec_df, aes(x = rank, y = estimate)) +
  geom_point(aes(color = p_value < 0.05), size = 2) +
  geom_errorbar(aes(ymin = estimate - 1.96 * se,
                     ymax = estimate + 1.96 * se,
                     color = p_value < 0.05),
                width = 0, alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  scale_color_manual(values = c("TRUE" = "steelblue", "FALSE" = "grey60"),
                     name = "p < 0.05") +
  labs(title = "Specification Curve",
       x = "Specification (sorted by estimate)",
       y = "Treatment Effect Estimate") +
  theme_minimal() +
  theme(legend.position = "bottom")

print(p1)

# Lower panel: specification indicators
indicator_df <- spec_df |>
  select(rank, controls, sample) |>
  pivot_longer(cols = c(controls, sample),
               names_to = "dimension", values_to = "choice")

p2 <- ggplot(indicator_df, aes(x = rank, y = choice)) +
  geom_point(size = 1.5, shape = 15) +
  facet_grid(dimension ~ ., scales = "free_y", space = "free_y") +
  labs(x = "Specification (sorted by estimate)", y = "") +
  theme_minimal() +
  theme(strip.text.y = element_text(angle = 0))

print(p2)

A robust finding will show consistently positive (or negative) estimates across most specifications. If the sign or significance flips depending on reasonable modeling choices, the result is fragile.


Bounds Approaches

When point identification fails (e.g., due to selection or attrition), partial identification provides informative bounds on the parameter of interest.

Lee Bounds

Lee (2009) bounds address sample selection bias. If treatment affects whether an outcome is observed (e.g., differential attrition), naive estimates are biased. Lee bounds trim the sample to restore comparability and provide sharp bounds under a monotonicity assumption.

The causalverse package provides a lee_bounds() function:

# Simulate experiment data with attrition
set.seed(500)
n_lb <- 1000
experiment_data <- data.frame(
  treatment = rbinom(n_lb, 1, 0.5)
)
# Treatment increases both outcome and probability of being observed
experiment_data$attrition_indicator <- rbinom(
  n_lb, 1, 0.9 - 0.05 * experiment_data$treatment)
experiment_data$outcome <- 0.5 * experiment_data$treatment + rnorm(n_lb)
# Set outcome to NA for attrited units (lee_bounds expects observed outcomes)
experiment_data$outcome[experiment_data$attrition_indicator == 1] <- NA
# Re-code: m = 1 means selected (observed), 0 means attrited
experiment_data$attrition_indicator <- as.integer(
  !is.na(experiment_data$outcome))
# Fill in outcome for all units (lee_bounds handles selection via m)
experiment_data$outcome <- ifelse(
  is.na(experiment_data$outcome),
  rnorm(sum(is.na(experiment_data$outcome))),
  experiment_data$outcome)

# Using causalverse::lee_bounds
bounds <- causalverse::lee_bounds(
  df       = experiment_data,
  d        = "treatment",
  m        = "attrition_indicator",
  y        = "outcome",
  numdraws = 1000
)
print(bounds)
#>          term   estimate std.error
#> 1 Lower bound 0.07079846 0.2127913
#> 2 Upper bound 1.68081747 0.2545297

Manual Lee Bounds

set.seed(77)
n <- 2000

lee_data <- data.frame(
  treatment = rbinom(n, 1, 0.5)
)
# Treatment increases both outcome and probability of being observed
lee_data$observed <- rbinom(n, 1, 0.85 + 0.05 * lee_data$treatment)
lee_data$outcome  <- 0.5 * lee_data$treatment + rnorm(n)

# Observed sample
obs_data <- lee_data[lee_data$observed == 1, ]

# Naive estimate (biased due to differential selection)
naive <- mean(obs_data$outcome[obs_data$treatment == 1]) -
         mean(obs_data$outcome[obs_data$treatment == 0])

# Lee bounds: trim the group with higher selection rate
p1 <- mean(lee_data$observed[lee_data$treatment == 1])
p0 <- mean(lee_data$observed[lee_data$treatment == 0])

if (p1 > p0) {
  # Trim treated group
  trim_frac <- 1 - p0 / p1
  y_treated <- sort(obs_data$outcome[obs_data$treatment == 1])
  n_treat   <- length(y_treated)
  n_trim    <- floor(n_treat * trim_frac)

  # Lower bound: trim from top
  lower_trimmed_mean <- mean(y_treated[1:(n_treat - n_trim)])
  # Upper bound: trim from bottom
  upper_trimmed_mean <- mean(y_treated[(n_trim + 1):n_treat])

  y_control_mean <- mean(obs_data$outcome[obs_data$treatment == 0])

  lower_bound <- lower_trimmed_mean - y_control_mean
  upper_bound <- upper_trimmed_mean - y_control_mean
} else {
  # Analogous trimming of control group
  trim_frac <- 1 - p1 / p0
  y_control <- sort(obs_data$outcome[obs_data$treatment == 0])
  n_ctrl    <- length(y_control)
  n_trim    <- floor(n_ctrl * trim_frac)

  lower_trimmed_mean <- mean(y_control[(n_trim + 1):n_ctrl])
  upper_trimmed_mean <- mean(y_control[1:(n_ctrl - n_trim)])

  y_treated_mean <- mean(obs_data$outcome[obs_data$treatment == 1])

  lower_bound <- y_treated_mean - upper_trimmed_mean
  upper_bound <- y_treated_mean - lower_trimmed_mean
}

cat("Observation rates: Treated =", round(p1, 3),
    "  Control =", round(p0, 3), "\n")
#> Observation rates: Treated = 0.893   Control = 0.861
cat("Naive estimate:   ", round(naive, 3), "\n")
#> Naive estimate:    0.45
cat("Lee bounds:       [", round(lower_bound, 3), ",",
    round(upper_bound, 3), "]\n")
#> Lee bounds:       [ 0.373 , 0.522 ]
cat("True effect:       0.5\n")
#> True effect:       0.5

Manski Bounds

Manski (1990) worst-case bounds make no assumptions about the selection process. If the outcome is bounded in [yL,yU][y_L, y_U], then for the missing outcomes we substitute the extreme values:

# Using the same lee_data from above
# Suppose outcome is bounded in [-4, 4]
y_lower <- -4
y_upper <-  4

n1 <- sum(lee_data$treatment == 1)
n0 <- sum(lee_data$treatment == 0)

# For treated: observed outcomes + worst case for missing
y1_obs <- lee_data$outcome[lee_data$treatment == 1 & lee_data$observed == 1]
n1_miss <- sum(lee_data$treatment == 1 & lee_data$observed == 0)

# For control
y0_obs <- lee_data$outcome[lee_data$treatment == 0 & lee_data$observed == 0]
y0_obs_real <- lee_data$outcome[lee_data$treatment == 0 & lee_data$observed == 1]
n0_miss <- sum(lee_data$treatment == 0 & lee_data$observed == 0)

# Lower bound: treated missing get y_lower, control missing get y_upper
manski_lower <- (sum(y1_obs) + n1_miss * y_lower) / n1 -
                (sum(y0_obs_real) + n0_miss * y_upper) / n0

# Upper bound: treated missing get y_upper, control missing get y_lower
manski_upper <- (sum(y1_obs) + n1_miss * y_upper) / n1 -
                (sum(y0_obs_real) + n0_miss * y_lower) / n0

cat("Manski bounds: [", round(manski_lower, 3), ",",
    round(manski_upper, 3), "]\n")
#> Manski bounds: [ -0.581 , 1.386 ]
cat("Lee bounds:    [", round(lower_bound, 3), ",",
    round(upper_bound, 3), "]\n")
#> Lee bounds:    [ 0.373 , 0.522 ]

Manski bounds are always wider than Lee bounds because they make weaker assumptions. The trade-off between assumptions and informativeness is fundamental to partial identification.


Multiple Testing Corrections

When testing multiple hypotheses simultaneously (e.g., treatment effects on several outcomes, subgroup analyses), the probability of at least one false positive increases rapidly. Multiple testing corrections control the overall error rate.

Family-Wise Error Rate (FWER)

The FWER is the probability of making at least one Type I error across all tests.

# Simulate 20 hypothesis tests
set.seed(314)
n_tests <- 20

# 5 tests have true effects, 15 are null
true_effects <- c(rep(0.5, 5), rep(0, 15))

p_values <- sapply(true_effects, function(effect) {
  x <- rnorm(100, mean = effect)
  t.test(x)$p.value
})

# Bonferroni correction (most conservative)
p_bonferroni <- p.adjust(p_values, method = "bonferroni")

# Holm step-down procedure (uniformly more powerful than Bonferroni)
p_holm <- p.adjust(p_values, method = "holm")

# Compare
comparison <- data.frame(
  Test      = 1:n_tests,
  True_Effect = true_effects,
  Raw_p     = round(p_values, 4),
  Bonferroni = round(p_bonferroni, 4),
  Holm      = round(p_holm, 4),
  Raw_sig   = ifelse(p_values < 0.05, "*", ""),
  Bonf_sig  = ifelse(p_bonferroni < 0.05, "*", ""),
  Holm_sig  = ifelse(p_holm < 0.05, "*", "")
)

# Show results for first 10 tests
head(comparison, 10)
#>    Test True_Effect  Raw_p Bonferroni   Holm Raw_sig Bonf_sig Holm_sig
#> 1     1         0.5 0.0000     0.0001 0.0001       *        *        *
#> 2     2         0.5 0.0000     0.0000 0.0000       *        *        *
#> 3     3         0.5 0.0013     0.0260 0.0208       *        *        *
#> 4     4         0.5 0.0000     0.0005 0.0004       *        *        *
#> 5     5         0.5 0.0000     0.0001 0.0001       *        *        *
#> 6     6         0.0 0.2380     1.0000 1.0000                          
#> 7     7         0.0 0.1953     1.0000 1.0000                          
#> 8     8         0.0 0.6303     1.0000 1.0000                          
#> 9     9         0.0 0.2964     1.0000 1.0000                          
#> 10   10         0.0 0.9237     1.0000 1.0000

cat("\nRejections at alpha = 0.05:\n")
#> 
#> Rejections at alpha = 0.05:
cat("  Raw:        ", sum(p_values < 0.05), "\n")
#>   Raw:         6
cat("  Bonferroni: ", sum(p_bonferroni < 0.05), "\n")
#>   Bonferroni:  5
cat("  Holm:       ", sum(p_holm < 0.05), "\n")
#>   Holm:        5

False Discovery Rate (FDR)

The FDR controls the expected proportion of false positives among all rejections, which is a less stringent criterion than FWER.

# Benjamini-Hochberg procedure
p_bh <- p.adjust(p_values, method = "BH")

# Benjamini-Yekutieli (valid under arbitrary dependence)
p_by <- p.adjust(p_values, method = "BY")

fdr_comparison <- data.frame(
  Test        = 1:n_tests,
  True_Effect = true_effects,
  Raw_p       = round(p_values, 4),
  BH_adj      = round(p_bh, 4),
  BY_adj      = round(p_by, 4)
)

cat("Rejections at alpha = 0.05:\n")
#> Rejections at alpha = 0.05:
cat("  Benjamini-Hochberg: ", sum(p_bh < 0.05), "\n")
#>   Benjamini-Hochberg:  6
cat("  Benjamini-Yekutieli:", sum(p_by < 0.05), "\n")
#>   Benjamini-Yekutieli: 5

Romano-Wolf Step-Down

The Romano-Wolf procedure controls the FWER while accounting for the dependence structure among test statistics through resampling:

library(wildrwolf)
library(fixest)

# Simulate multi-outcome data with clusters
set.seed(600)
n_rw <- 500
n_groups_rw <- 25
multi_outcome_data <- data.frame(
  group     = rep(1:n_groups_rw, each = n_rw / n_groups_rw),
  treatment = rep(rbinom(n_groups_rw, 1, 0.5), each = n_rw / n_groups_rw)
)
multi_outcome_data$outcome1 <- 0.5 * multi_outcome_data$treatment + rnorm(n_rw)
multi_outcome_data$outcome2 <- 0.3 * multi_outcome_data$treatment + rnorm(n_rw)
multi_outcome_data$outcome3 <- 0.0 * multi_outcome_data$treatment + rnorm(n_rw)
multi_outcome_data$outcome4 <- 0.4 * multi_outcome_data$treatment + rnorm(n_rw)

# Estimate multiple models
models <- feols(
  c(outcome1, outcome2, outcome3, outcome4) ~ treatment | group,
  data    = multi_outcome_data,
  cluster = ~group
)

# Romano-Wolf adjusted p-values
rw_pvals <- rwolf(
  models   = models,
  param    = "treatment",
  B        = 999,
  seed     = 42
)

print(rw_pvals)

Choosing a Correction

Method Controls Assumption Power
Bonferroni FWER None Lowest
Holm FWER None Higher than Bonferroni
Romano-Wolf FWER Resampling valid Higher (accounts for dependence)
Benjamini-Hochberg FDR Independence or PRDS High
Benjamini-Yekutieli FDR None Moderate

Rule of thumb: Use Holm for FWER control. Use Benjamini-Hochberg when FDR control suffices and tests are approximately independent. Use Romano-Wolf when test statistics are correlated and FWER control is needed.


Randomization Inference

Randomization inference (RI) derives exact p-values by comparing the observed test statistic to its distribution under all possible random assignments, without relying on asymptotic approximations.

Theory

Under the sharp null hypothesis H0:Yi(1)=Yi(0)H_0: Y_i(1) = Y_i(0) for all ii, the potential outcomes are fixed and only the treatment assignment is random. The p-value is:

p=#{|τ̂π||τ̂obs|}total permutationsp = \frac{\#\{|\hat{\tau}_\pi| \geq |\hat{\tau}_{obs}|\}}{\text{total permutations}}

where τ̂π\hat{\tau}_\pi is the test statistic computed under permutation π\pi.

Manual Implementation

set.seed(555)
n <- 60

ri_data <- data.frame(
  treatment = c(rep(1, 30), rep(0, 30)),
  outcome   = c(rnorm(30, mean = 0.6), rnorm(30, mean = 0))
)

# Observed test statistic
obs_stat <- mean(ri_data$outcome[ri_data$treatment == 1]) -
            mean(ri_data$outcome[ri_data$treatment == 0])

# Permutation distribution (10000 draws)
n_perms <- 10000
perm_stats <- numeric(n_perms)

for (i in 1:n_perms) {
  perm_treat <- sample(ri_data$treatment)
  perm_stats[i] <- mean(ri_data$outcome[perm_treat == 1]) -
                   mean(ri_data$outcome[perm_treat == 0])
}

# Two-sided p-value
ri_pvalue <- mean(abs(perm_stats) >= abs(obs_stat))

cat("Observed difference in means:", round(obs_stat, 3), "\n")
#> Observed difference in means: 0.607
cat("Randomization inference p-value:", ri_pvalue, "\n")
#> Randomization inference p-value: 0.0202
cat("Conventional t-test p-value:",
    round(t.test(outcome ~ treatment, data = ri_data)$p.value, 4), "\n")
#> Conventional t-test p-value: 0.0188
# Visualization
hist(perm_stats, breaks = 50, col = "grey80", border = "white",
     main = "Randomization Distribution",
     xlab = "Difference in Means Under Sharp Null")
abline(v = obs_stat, col = "red", lwd = 2)
abline(v = -obs_stat, col = "red", lwd = 2, lty = 2)
legend("topright", legend = c("Observed statistic", "Mirror"),
       col = "red", lty = c(1, 2), lwd = 2)

ri2 Package

The ri2 package provides a convenient interface:

library(ri2)

# Simulate data for ri2
set.seed(700)
n_ri2 <- 60
ri2_data <- data.frame(
  treatment = c(rep(1, 30), rep(0, 30)),
  outcome   = c(rnorm(30, mean = 0.6), rnorm(30, mean = 0))
)

# Declare the randomization scheme
declaration <- declare_ra(N = nrow(ri2_data), m = sum(ri2_data$treatment))

# Conduct randomization inference
ri_out <- conduct_ri(
  formula       = outcome ~ treatment,
  declaration   = declaration,
  sharp_hypothesis = 0,
  data          = ri2_data,
  sims          = 5000
)

summary(ri_out)
plot(ri_out)

RI for Clustered Designs

library(ri2)

# Simulate cluster-randomized data
set.seed(710)
n_clust_ri <- 30
clust_size_ri <- 10
cluster_ri_data <- data.frame(
  cluster_id = rep(1:n_clust_ri, each = clust_size_ri),
  treatment  = rep(c(rep(1, 15), rep(0, 15)), each = clust_size_ri)
)
cluster_ri_data$outcome <- 0.4 * cluster_ri_data$treatment +
  rep(rnorm(n_clust_ri, 0, 0.5), each = clust_size_ri) +
  rnorm(n_clust_ri * clust_size_ri)

# Cluster-randomized design
declaration <- declare_ra(
  clusters = cluster_ri_data$cluster_id,
  m_each   = c(15, 15)   # 15 treated clusters, 15 control clusters
)

ri_cluster <- conduct_ri(
  formula       = outcome ~ treatment,
  declaration   = declaration,
  sharp_hypothesis = 0,
  data          = cluster_ri_data,
  sims          = 5000
)

summary(ri_cluster)

When to Use Randomization Inference

  • Small samples where asymptotic approximations may be poor.
  • Clustered designs with few clusters (where cluster-robust SEs are unreliable).
  • When the treatment assignment mechanism is known exactly (experiments).
  • As a complement to conventional inference for robustness.

Bootstrap Methods

Nonparametric Bootstrap

The standard nonparametric bootstrap resamples observations with replacement to approximate the sampling distribution:

set.seed(101)
n <- 200

boot_data <- data.frame(
  treatment = rbinom(n, 1, 0.5),
  x1 = rnorm(n)
)
boot_data$outcome <- 0.5 * boot_data$treatment +
                     0.3 * boot_data$x1 + rnorm(n)

# Point estimate
model_orig <- lm(outcome ~ treatment + x1, data = boot_data)
theta_hat  <- coef(model_orig)["treatment"]

# Bootstrap
B <- 2000
boot_estimates <- numeric(B)

for (b in 1:B) {
  idx <- sample(1:n, n, replace = TRUE)
  boot_model <- lm(outcome ~ treatment + x1, data = boot_data[idx, ])
  boot_estimates[b] <- coef(boot_model)["treatment"]
}

# Bootstrap confidence intervals
# Percentile method
ci_percentile <- quantile(boot_estimates, c(0.025, 0.975))

# Normal approximation
boot_se <- sd(boot_estimates)
ci_normal <- theta_hat + c(-1.96, 1.96) * boot_se

# BCa (bias-corrected and accelerated) approximation
z0 <- qnorm(mean(boot_estimates < theta_hat))
# Jackknife for acceleration
jack_estimates <- numeric(n)
for (i in 1:n) {
  jack_model <- lm(outcome ~ treatment + x1, data = boot_data[-i, ])
  jack_estimates[i] <- coef(jack_model)["treatment"]
}
jack_mean <- mean(jack_estimates)
a_hat <- sum((jack_mean - jack_estimates)^3) /
         (6 * sum((jack_mean - jack_estimates)^2)^1.5)

alpha1 <- pnorm(z0 + (z0 + qnorm(0.025)) / (1 - a_hat * (z0 + qnorm(0.025))))
alpha2 <- pnorm(z0 + (z0 + qnorm(0.975)) / (1 - a_hat * (z0 + qnorm(0.975))))
ci_bca <- quantile(boot_estimates, c(alpha1, alpha2))

cat("Point estimate:     ", round(theta_hat, 3), "\n")
#> Point estimate:      0.522
cat("Analytical SE:      ", round(summary(model_orig)$coefficients["treatment", 2], 3), "\n")
#> Analytical SE:       0.144
cat("Bootstrap SE:       ", round(boot_se, 3), "\n")
#> Bootstrap SE:        0.147
cat("Percentile CI:      [", round(ci_percentile[1], 3), ",",
    round(ci_percentile[2], 3), "]\n")
#> Percentile CI:      [ 0.247 , 0.816 ]
cat("Normal CI:          [", round(ci_normal[1], 3), ",",
    round(ci_normal[2], 3), "]\n")
#> Normal CI:          [ 0.233 , 0.811 ]
cat("BCa CI:             [", round(ci_bca[1], 3), ",",
    round(ci_bca[2], 3), "]\n")
#> BCa CI:             [ 0.254 , 0.824 ]

Pairs Cluster Bootstrap

With clustered data and few clusters, cluster-robust standard errors can be severely biased. The pairs cluster bootstrap resamples entire clusters:

set.seed(202)
n_clusters <- 30
cluster_size <- 20

cluster_data <- data.frame(
  cluster = rep(1:n_clusters, each = cluster_size),
  treatment = rep(rbinom(n_clusters, 1, 0.5), each = cluster_size)
)
cluster_data$cluster_effect <- rep(rnorm(n_clusters, 0, 1),
                                    each = cluster_size)
cluster_data$outcome <- 0.4 * cluster_data$treatment +
                        cluster_data$cluster_effect +
                        rnorm(n_clusters * cluster_size, 0, 0.5)

# Original estimate
orig_model <- lm(outcome ~ treatment, data = cluster_data)
theta_orig <- coef(orig_model)["treatment"]

# Pairs cluster bootstrap
B <- 2000
boot_cluster_est <- numeric(B)

for (b in 1:B) {
  sampled_clusters <- sample(1:n_clusters, n_clusters, replace = TRUE)
  boot_df <- do.call(rbind, lapply(seq_along(sampled_clusters), function(j) {
    d <- cluster_data[cluster_data$cluster == sampled_clusters[j], ]
    d$cluster <- j  # re-label to avoid duplicate cluster IDs
    d
  }))
  boot_model <- lm(outcome ~ treatment, data = boot_df)
  boot_cluster_est[b] <- coef(boot_model)["treatment"]
}

boot_cluster_se <- sd(boot_cluster_est)
ci_cluster <- quantile(boot_cluster_est, c(0.025, 0.975))

cat("Point estimate:          ", round(theta_orig, 3), "\n")
#> Point estimate:           -0.24
cat("Pairs cluster bootstrap SE:", round(boot_cluster_se, 3), "\n")
#> Pairs cluster bootstrap SE: 0.396
cat("95% CI:                  [", round(ci_cluster[1], 3), ",",
    round(ci_cluster[2], 3), "]\n")
#> 95% CI:                  [ -1.042 , 0.521 ]

Wild Cluster Bootstrap

The wild cluster bootstrap (Cameron, Gelbach, and Miller, 2008) is particularly useful when the number of clusters is small (fewer than 50). The fwildclusterboot package provides a fast implementation:

library(fwildclusterboot)
library(fixest)

# Simulate panel data for wild cluster bootstrap
set.seed(800)
n_wc_clusters <- 30
n_wc_per      <- 20
wcb_data <- data.frame(
  cluster_id = rep(1:n_wc_clusters, each = n_wc_per),
  treatment  = rep(rbinom(n_wc_clusters, 1, 0.5), each = n_wc_per),
  x1 = rnorm(n_wc_clusters * n_wc_per),
  x2 = rnorm(n_wc_clusters * n_wc_per)
)
wcb_data$cluster_effect <- rep(rnorm(n_wc_clusters, 0, 1), each = n_wc_per)
wcb_data$outcome <- 0.5 * wcb_data$treatment + 0.3 * wcb_data$x1 +
  0.2 * wcb_data$x2 + wcb_data$cluster_effect +
  rnorm(n_wc_clusters * n_wc_per, 0, 0.5)
wcb_data$state <- wcb_data$cluster_id  # alias for the few-clusters chunk

# Estimate model with fixest
model <- feols(outcome ~ treatment + x1 + x2 | cluster_id,
               data = wcb_data,
               cluster = ~cluster_id)

# Wild cluster bootstrap
boot_result <- boottest(
  model,
  param     = "treatment",
  clustid   = ~cluster_id,
  B         = 9999,
  type      = "rademacher",  # or "mammen", "webb", "norm"
  impose_null = TRUE         # impose null for better finite-sample properties
)

summary(boot_result)
plot(boot_result)

Wild Bootstrap with Few Clusters

When the number of clusters is very small (G < 20), the Webb (2023) six-point distribution often outperforms Rademacher weights:

# For designs with very few clusters
boot_few <- boottest(
  model,
  param     = "treatment",
  clustid   = ~state,
  B         = 99999,
  type      = "webb",        # better for few clusters
  impose_null = TRUE,
  p_val_type  = "equal-tailed"
)

summary(boot_few)

Choosing a Bootstrap Method

Scenario Recommended Method
iid data Nonparametric (pairs) bootstrap
Clustered, G >= 50 Cluster-robust SEs are usually fine
Clustered, 20 <= G < 50 Wild cluster bootstrap (Rademacher)
Clustered, G < 20 Wild cluster bootstrap (Webb weights)
Heteroskedasticity Wild bootstrap (HC)
Panel data with FE Wild cluster bootstrap with fixest

KonFound-It Framework

The KonFound-It framework (Frank, 2000; Frank et al., 2013) quantifies how much of an estimated effect would need to be biased to invalidate a statistical inference. It provides two complementary metrics:

  • Percentage Bias to Invalidate the Inference (PBII): What percentage of the estimated effect would have to be due to bias to change the conclusion?
  • Impact Threshold for a Confounding Variable (ITCV): How strongly correlated with both the treatment and outcome must an omitted variable be to explain away the result?

The konfound package offers konfound() for fitted model objects and pkonfound() for published results (when you only have the estimate, standard error, sample size, and number of covariates).

library(konfound)

# Simulate data with a treatment effect
set.seed(42)
n <- 500
x1 <- rnorm(n)
x2 <- rnorm(n)
treatment <- 0.4 * x1 + 0.3 * x2 + rnorm(n)
outcome <- 0.5 * treatment + 0.3 * x1 + 0.2 * x2 + rnorm(n)

df_konfound <- data.frame(outcome, treatment, x1, x2)
model_konfound <- lm(outcome ~ treatment + x1 + x2, data = df_konfound)
summary(model_konfound)
# Apply KonFound-It to the fitted model
# How much bias is needed to invalidate the treatment effect?
konfound(model_konfound, treatment)
# For published results: estimate, SE, sample size, number of covariates
# Example: coefficient = 0.5, SE = 0.06, n = 500, 3 covariates
pkonfound(est_eff = 0.5, std_err = 0.06, n_obs = 500, n_covariates = 3)
# Impact Threshold for a Confounding Variable (ITCV)
# What correlations with treatment and outcome would an omitted variable need?
pkonfound(est_eff = 0.5, std_err = 0.06, n_obs = 500, n_covariates = 3,
          index = "IT")

Interpretation: If PBII is high (e.g., >50%), the result is robust—more than half of the effect would need to be due to bias. A low ITCV means even a weakly correlated confounder could overturn the finding, while a high ITCV indicates robustness. Researchers can compare the ITCV to correlations of observed covariates as informal benchmarks.


Treatment Effect Sensitivity with treatSens

The treatSens package (Carnegie, Harada, & Hill, 2016) provides sensitivity analysis for causal effects estimated from observational studies with binary or continuous treatments. Unlike methods that assume selection on observables, treatSens directly models how an unobserved confounder related to both treatment assignment and outcomes could bias estimates, building on the framework of Imbens (2003) and Carnegie, Harada, & Hill (2015).

The key idea is to posit a hypothetical unobserved confounder UU and examine how the treatment effect estimate changes across a grid of possible relationships between UU and the treatment (sensitivity parameter ζ\zeta) and between UU and the outcome (sensitivity parameter η\eta).

library(treatSens)

# Simulate observational data with continuous treatment
set.seed(42)
n <- 400
U <- rnorm(n)  # unobserved confounder
x1 <- rnorm(n)
x2 <- rnorm(n)

# Treatment depends on observed and unobserved confounders
treatment <- 0.5 * x1 + 0.3 * x2 + 0.6 * U + rnorm(n)

# Outcome depends on treatment, covariates, and unobserved confounder
outcome <- 0.4 * treatment + 0.3 * x1 + 0.2 * x2 + 0.5 * U + rnorm(n)

df_tsens <- data.frame(outcome, treatment, x1, x2)
# Sensitivity analysis for continuous treatment
# Grid explores how an unmeasured confounder could change the estimate
sens_result <- treatSens(outcome ~ treatment + x1 + x2,
                         trt.family = gaussian,
                         nsim = 20,
                         grid.dim = c(5, 5),
                         data = df_tsens)

# Contour plot: shows how the treatment effect varies over
# (zeta, eta) = (confounder-treatment, confounder-outcome) associations
sensPlot(sens_result)

Interpretation: The contour plot shows estimated treatment effects across combinations of confounder–treatment (ζ\zeta) and confounder–outcome (η\eta) associations. The origin (0, 0) corresponds to the naive estimate. If the zero-effect contour is far from plausible confounder strengths (benchmarked against observed covariates), the finding is robust to unobserved confounding.


Tipping Point Analysis with tipr

The tipr package (McGowan & Greevy, 2022, JOSS) provides an intuitive tipping-point framework: given an observed effect, what characteristics would an unmeasured confounder need to have to “tip” the estimate to the null (or any other threshold)? Rather than computing a single sensitivity parameter, tipr returns the confounder prevalence and effect size combinations that would change the conclusion.

The package provides tipping-point functions for different effect scales:

  • tip_coef(): for regression coefficients (mean differences)
  • tip_rr(): for risk ratios
  • tip_or(): for odds ratios
  • adjust_coef(): adjust an estimate for a hypothetical confounder with specified properties
library(tipr)

# Simulate data and fit a model
set.seed(42)
n <- 500
x1 <- rnorm(n)
treatment <- rbinom(n, 1, plogis(0.3 * x1))
outcome <- 0.8 * treatment + 0.5 * x1 + rnorm(n)

df_tipr <- data.frame(outcome, treatment, x1)
model_tipr <- lm(outcome ~ treatment + x1, data = df_tipr)
est <- coef(model_tipr)["treatment"]
se_est <- summary(model_tipr)$coefficients["treatment", "Std. Error"]
cat("Estimated treatment effect:", round(est, 3),
    " (95% CI:", round(est - 1.96 * se_est, 3), ",",
    round(est + 1.96 * se_est, 3), ")\n")
# What unmeasured confounder would tip the coefficient to zero?
tip_result <- tip_coef(effect = est, se = se_est)
tip_result
# Adjust the estimate for a hypothetical confounder:
# e.g., a binary confounder with 30% prevalence in treated, 10% in control,
# and an effect of 0.5 on the outcome
adjusted <- adjust_coef(
  effect = est,
  se = se_est,
  exposure_confounder_effect = 0.20,
  confounder_outcome_effect = 0.5
)
adjusted

Interpretation: tip_coef() returns combinations of confounder–exposure and confounder–outcome effects that would reduce the estimated coefficient to the null. If these required confounder characteristics are implausibly large compared to observed covariates, the result is robust. adjust_coef() lets researchers ask “what if” questions about specific hypothetical confounders, making sensitivity discussions concrete and accessible for applied audiences.


OVtool for Propensity Score Weighting

The OVtool package extends the TWANG framework for propensity score weighting by providing omitted variable sensitivity analysis. It asks: if an unobserved covariate were added to the propensity score model, how would the treatment effect estimate and covariate balance change? This is particularly relevant for studies using inverse probability weighting (IPW) or related methods, where balance on observed covariates does not guarantee balance on unobserved ones.

library(OVtool)
library(twang)

# Simulate data for propensity score analysis
set.seed(42)
n <- 600
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rbinom(n, 1, 0.5)
treatment <- rbinom(n, 1, plogis(0.5 * x1 + 0.3 * x2 + 0.4 * x3))
outcome <- 2.0 * treatment + 0.8 * x1 + 0.5 * x2 + 0.3 * x3 + rnorm(n)

df_ov <- data.frame(outcome, treatment, x1, x2, x3)

# Fit propensity score model with TWANG
ps_model <- ps(treatment ~ x1 + x2 + x3,
               data = df_ov,
               n.trees = 2000,
               interaction.depth = 2,
               stop.method = "es.mean",
               verbose = FALSE)
# Simulate an omitted confounder and assess sensitivity
# The function simulates a confounder U correlated with treatment and outcome
ov_result <- ov_sim(
  model_results = ps_model,
  outcome_model = lm(outcome ~ treatment, data = df_ov, weights = get.weights(ps_model)),
  data = df_ov,
  treatment = "treatment",
  outcome = "outcome",
  sim_covariates = c("x1", "x2"),
  es_grid = seq(0, 0.5, by = 0.1),
  rho_grid = seq(0, 0.5, by = 0.1),
  n_reps = 50
)

# Visualize sensitivity
plot(ov_result)

Interpretation: The plot shows how the treatment effect estimate changes as the simulated omitted variable becomes more strongly related to treatment assignment and outcome. The approach uses observed covariates as benchmarks, making it easy to judge whether a plausible confounder could meaningfully shift the estimate. If the effect remains significant across the range of simulated confounders calibrated to observed covariate strengths, the weighting-based estimate is robust.


Selection Model Sensitivity with causalsens

The causalsens package provides sensitivity analysis for causal estimates by allowing researchers to specify a confounding function that describes how unobserved selection bias enters the model. This is particularly useful for treatment effects estimated via matching, regression, or IPW, where the functional form of confounding can be directly parameterized.

The key function causalsens() takes a fitted treatment effect model and a confounding function, then recalculates the treatment effect across a range of confounding strengths.

library(causalsens)

# Simulate observational data
set.seed(42)
n <- 500
x1 <- rnorm(n)
x2 <- rnorm(n)
treatment <- rbinom(n, 1, plogis(0.6 * x1 + 0.3 * x2))
outcome <- 1.0 * treatment + 0.5 * x1 + 0.3 * x2 + rnorm(n)

df_csens <- data.frame(outcome, treatment, x1, x2)

# Fit outcome model and propensity score model
outcome_model <- lm(outcome ~ treatment + x1 + x2, data = df_csens)
ps_model_csens <- glm(treatment ~ x1 + x2, data = df_csens, family = binomial)
# Run sensitivity analysis with one-sided confounding
# alpha controls confounding strength; the function explores a range
sens_out <- causalsens(model = outcome_model,
                       confound = one.sided,
                       data = df_csens,
                       alpha = seq(-0.5, 0.5, by = 0.1))

# Plot: how does the ATE change across confounding strengths?
plot(sens_out, type = "raw", bty = "l")
# Alignment-based confounding: bias is proportional to propensity score
sens_aligned <- causalsens(model = outcome_model,
                           confound = alignment,
                           data = df_csens,
                           alpha = seq(-0.5, 0.5, by = 0.1))

plot(sens_aligned, type = "raw", bty = "l")

Interpretation: Each curve shows the estimated treatment effect at a different level of confounding strength (α\alpha). When α=0\alpha = 0, the estimate equals the original (no confounding). Positive α\alpha indicates confounding that inflates the effect; negative α\alpha indicates confounding that attenuates it. The one.sided function models confounding that affects only the treated group, while alignment models confounding proportional to the propensity score. If the effect remains positive (or significant) across a wide range of α\alpha, the causal conclusion is robust to selection on unobservables.


Practical Guidelines

Decision Framework

Before estimation: choose your primary specification. Commit to a primary specification before looking at results to avoid specification searching. Pre-registration helps.

After estimation: probe robustness. Use the tools covered in this vignette to assess whether your conclusions are fragile or robust.

Which Sensitivity Analysis for Which Method?

Causal Method Primary Sensitivity Analysis Additional Checks
OLS / Selection on observables sensemakr (OVB), E-values Specification curve, coefficient stability
Matching / Propensity score Rosenbaum bounds, E-values Balance checks, different matching algorithms
Difference-in-differences HonestDiD, placebo tests Pre-trend tests, varying pre/post windows
Synthetic control In-space placebos, leave-one-out Pre-treatment MSPE ratios
Regression discontinuity Placebo cutoffs, bandwidth sensitivity Density tests, donut-hole RD
Instrumental variables Effective F-statistic, tF procedure Reduced form, plausibly exogenous (Conley bounds)
RCT with attrition Lee bounds, Manski bounds Multiple imputation, bounding exercises

Reporting Standards

Following best practices, sensitivity analyses should be reported transparently:

  1. State assumptions explicitly. Write down the identifying assumptions in formal terms.

  2. Report robustness values. For OVB analysis, report the RVqRV_q and RVq,αRV_{q,\alpha} along with benchmark comparisons.

  3. Show placebo tests. Report placebo test results even (especially) when they are null.

  4. Present specification curves. Show the distribution of estimates across reasonable specifications.

  5. Discuss the sensitivity frontier. At what point do assumptions need to be violated for the result to break down? Is that violation plausible?

  6. Account for multiple testing. When reporting multiple outcomes or subgroups, apply appropriate corrections and report both adjusted and unadjusted p-values.

  7. Use confidence intervals, not just p-values. Sensitivity analyses that report how bounds or intervals change under assumption violations are more informative than binary significance results.

A Minimal Sensitivity Checklist

For any empirical causal study, consider including at minimum:

  • At least one formal sensitivity analysis appropriate to the identification strategy.
  • At least one placebo test (outcome or treatment).
  • At least two alternative specifications.
  • Appropriate multiple testing adjustment if reporting more than three outcomes.
  • Bootstrap or randomization-based inference as a supplement to asymptotic inference when sample size or cluster count is small.

References

  • Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics, 90(3), 414-427.

  • Cinelli, C., and Hazlett, C. (2020). Making sense of sensitivity: Extending omitted variable bias. Journal of the Royal Statistical Society: Series B, 82(1), 39-67.

  • Conley, T. G., Hansen, C. B., and Rossi, P. E. (2012). Plausibly exogenous. Review of Economics and Statistics, 94(1), 260-272.

  • Imbens, G. W., and Manski, C. F. (2004). Confidence intervals for partially identified parameters. Econometrica, 72(6), 1845-1857.

  • Lee, D. S. (2009). Training, wages, and sample selection: Estimating sharp bounds on treatment effects. Review of Economic Studies, 76(3), 1071-1102.

  • Manski, C. F. (1990). Nonparametric bounds on treatment effects. American Economic Review, 80(2), 319-323.

  • Rambachan, A., and Roth, J. (2023). A more credible approach to parallel trends. Review of Economic Studies, 90(5), 2555-2591.

  • Romano, J. P., and Wolf, M. (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4), 1237-1282.

  • Rosenbaum, P. R. (2002). Observational Studies (2nd ed.). Springer.

  • Simonsohn, U., Simmons, J. P., and Nelson, L. D. (2020). Specification curve analysis. Nature Human Behaviour, 4(11), 1208-1214.

  • VanderWeele, T. J., and Ding, P. (2017). Sensitivity analysis in observational research: Introducing the E-value. Annals of Internal Medicine, 167(4), 268-274.

  • Webb, M. D. (2023). Reworking wild bootstrap-based inference for clustered errors. Canadian Journal of Economics, 56(3), 839-858.