Introduction

Instrumental Variables (IV) estimation is one of the most important and widely used identification strategies in applied econometrics and causal inference. When the standard unconfoundedness assumption fails—meaning there exist unobserved factors that jointly influence both the treatment and the outcome—ordinary least squares (OLS) produces biased and inconsistent estimates of causal effects. IV methods provide a path to consistent estimation by leveraging an external source of variation in the treatment that is unrelated to the confounders.

The Endogeneity Problem

Consider the linear model:

Yi=β0+β1Di+𝐗i𝛄+εiY_i = \beta_0 + \beta_1 D_i + \mathbf{X}_i' \boldsymbol{\gamma} + \varepsilon_i

where YiY_i is the outcome, DiD_i is the treatment (or endogenous regressor), 𝐗i\mathbf{X}_i is a vector of exogenous covariates, and εi\varepsilon_i is the structural error term. The parameter of interest is β1\beta_1, the causal effect of DD on YY.

OLS consistently estimates β1\beta_1 only if Cov(Di,εi)=0\text{Cov}(D_i, \varepsilon_i) = 0. Endogeneity occurs when this condition fails, i.e., Cov(Di,εi)0\text{Cov}(D_i, \varepsilon_i) \neq 0. There are three primary sources of endogeneity:

Omitted Variable Bias (OVB)

Suppose there exists an unobserved variable UiU_i that affects both DiD_i and YiY_i. If UiU_i is omitted from the regression, it is absorbed into the error term εi\varepsilon_i, creating correlation between DiD_i and εi\varepsilon_i. The OLS estimator then captures not only the causal effect of DD but also the spurious association induced by UU.

Formally, if the true model is Yi=β1Di+δUi+eiY_i = \beta_1 D_i + \delta U_i + e_i and we omit UiU_i, then:

plimβ̂1OLS=β1+δCov(Di,Ui)Var(Di)\text{plim} \; \hat{\beta}_1^{OLS} = \beta_1 + \delta \cdot \frac{\text{Cov}(D_i, U_i)}{\text{Var}(D_i)}

The bias term δCov(Di,Ui)/Var(Di)\delta \cdot \text{Cov}(D_i, U_i) / \text{Var}(D_i) will be nonzero whenever UU affects both DD and YY.

Simultaneity (Reverse Causality)

When DD and YY are jointly determined—for example, in supply-and-demand systems or when the outcome feeds back into the treatment—the treatment variable is mechanically correlated with the error term. Classic examples include estimating the effect of price on quantity demanded when price is itself determined by demand conditions.

Measurement Error

When the treatment variable DiD_i is measured with classical error, i.e., we observe D̃i=Di+ηi\tilde{D}_i = D_i + \eta_i where ηi\eta_i is noise, OLS suffers from attenuation bias. The coefficient is biased toward zero by a factor related to the signal-to-noise ratio:

plimβ̂1OLS=β1Var(Di)Var(Di)+Var(ηi)\text{plim} \; \hat{\beta}_1^{OLS} = \beta_1 \cdot \frac{\text{Var}(D_i)}{\text{Var}(D_i) + \text{Var}(\eta_i)}

The IV Solution

An instrumental variable ZiZ_i resolves endogeneity by providing variation in DiD_i that is exogenous. The simplest IV estimator (the Wald estimator) in the just-identified case with a single binary instrument is:

β̂1IV=Cov(Zi,Yi)Cov(Zi,Di)=Reduced FormFirst Stage\hat{\beta}_1^{IV} = \frac{\text{Cov}(Z_i, Y_i)}{\text{Cov}(Z_i, D_i)} = \frac{\text{Reduced Form}}{\text{First Stage}}

This ratio isolates the part of the treatment variation driven by the instrument and relates it to the corresponding variation in the outcome.

Potential Outcomes Framework for IV

In the potential outcomes (Rubin) framework, IV estimation targets a specific causal parameter. Following the seminal work of Imbens and Angrist (1994), define for each individual ii:

  • Di(z)D_i(z): the potential treatment status when the instrument is set to zz
  • Yi(d)Y_i(d): the potential outcome when treatment is set to dd

Individuals can be classified into four compliance types based on their response to the instrument:

Type Di(1)D_i(1) Di(0)D_i(0) Description
Compliers 1 0 Take treatment when encouraged
Always-takers 1 1 Always take treatment
Never-takers 0 0 Never take treatment
Defiers 0 1 Do opposite of encouragement

The LATE Interpretation

Imbens and Angrist (1994) showed that under the IV assumptions, the IV estimand identifies the Local Average Treatment Effect (LATE)—the average causal effect for the subpopulation of compliers:

βIV=E[Yi|Zi=1]E[Yi|Zi=0]E[Di|Zi=1]E[Di|Zi=0]=E[Yi(1)Yi(0)Complier]\beta^{IV} = \frac{E[Y_i | Z_i = 1] - E[Y_i | Z_i = 0]}{E[D_i | Z_i = 1] - E[D_i | Z_i = 0]} = E[Y_i(1) - Y_i(0) \mid \text{Complier}]

This is a crucial insight: IV does not generally estimate the Average Treatment Effect (ATE) for the entire population, but rather the effect for the subgroup whose treatment status is influenced by the instrument. The policy relevance of the LATE depends on the context and the nature of the instrument.

Core Assumptions

A valid instrument ZZ must satisfy four conditions:

1. Relevance (First Stage)

The instrument must be correlated with the endogenous treatment variable:

Cov(Zi,Di)0or equivalentlyE[Di|Zi=1]E[Di|Zi=0]\text{Cov}(Z_i, D_i) \neq 0 \quad \text{or equivalently} \quad E[D_i | Z_i = 1] \neq E[D_i | Z_i = 0]

This is the only assumption that is directly testable from the data, via the first-stage regression. A weak first stage (small F-statistic) leads to severe finite-sample bias and unreliable inference.

2. Exclusion Restriction

The instrument affects the outcome YYonly through its effect on DD:

ZiYi(d)for all dZ_i \perp\!\!\!\perp Y_i(d) \quad \text{for all } d

Equivalently, there is no direct effect of ZZ on YY. This assumption is not testable and must be justified on theoretical or institutional grounds. It is typically the most contested assumption in applied work.

3. Independence (Exogeneity)

The instrument is independent of potential outcomes and potential treatment assignments:

Zi(Yi(d,z),Di(z))for all d,zZ_i \perp\!\!\!\perp (Y_i(d, z), D_i(z)) \quad \text{for all } d, z

This means the instrument is “as good as randomly assigned” with respect to all unobserved factors. In practice, this is often achieved through natural experiments, randomized encouragement, or institutional features.

4. Monotonicity

There are no defiers—the instrument shifts everyone in the same direction (or not at all):

Di(1)Di(0)for all iD_i(1) \geq D_i(0) \quad \text{for all } i

This rules out individuals who would do the opposite of what the instrument encourages. Without monotonicity, the LATE interpretation breaks down. This assumption is automatically satisfied when the instrument is a direct randomization of the treatment, but must be argued in observational settings.

Mathematical Notation for 2SLS

Two-Stage Least Squares (2SLS) is the standard estimation procedure for IV. With endogenous variable DD, instruments 𝐙\mathbf{Z}, and exogenous covariates 𝐗\mathbf{X}:

First stage: Regress the endogenous variable on the instruments and exogenous covariates:

Di=π0+𝐙i𝛑1+𝐗i𝛑2+viD_i = \pi_0 + \mathbf{Z}_i' \boldsymbol{\pi}_1 + \mathbf{X}_i' \boldsymbol{\pi}_2 + v_i

Obtain the fitted values D̂i\hat{D}_i.

Second stage: Replace DiD_i with D̂i\hat{D}_i in the structural equation:

Yi=β0+β1D̂i+𝐗i𝛄+εiY_i = \beta_0 + \beta_1 \hat{D}_i + \mathbf{X}_i' \boldsymbol{\gamma} + \varepsilon_i

The key insight is that D̂i\hat{D}_i contains only the variation in DD that is driven by the instruments 𝐙\mathbf{Z}, which is exogenous by assumption. The 2SLS estimator can be written in matrix form as:

𝛃̂2SLS=(𝐃̂𝐖)1𝐃̂𝐘\hat{\boldsymbol{\beta}}_{2SLS} = (\hat{\mathbf{D}}'\mathbf{W})^{-1} \hat{\mathbf{D}}'\mathbf{Y}

where 𝐃̂=𝐏Z𝐃\hat{\mathbf{D}} = \mathbf{P}_Z \mathbf{D} is the projection of 𝐃\mathbf{D} onto the space spanned by 𝐙\mathbf{Z} and 𝐗\mathbf{X}, and 𝐖=[𝐃,𝐗]\mathbf{W} = [\mathbf{D}, \mathbf{X}].

Important: While 2SLS can be understood as a two-step procedure, standard errors must account for the generated regressor in the second stage. Software implementations such as fixest::feols() handle this correctly.

Simulated IV Data

We begin by constructing a rich simulated dataset that illustrates the core endogeneity problem and demonstrates why IV is needed. The data-generating process (DGP) includes an unobserved confounder, an exogenous instrument, covariates, and realistic noise.

set.seed(42)
n <- 2000

# Exogenous covariates
x1 <- rnorm(n)
x2 <- rnorm(n)

# Unobserved confounder: affects both treatment and outcome
u <- rnorm(n)

# Instrument: correlated with D but NOT with U or epsilon
z <- rnorm(n)

# A second instrument for over-identification examples later
z2 <- rnorm(n)

# Endogenous treatment: depends on instruments, covariates, and confounder
d <- 0.8 * z + 0.3 * z2 + 0.3 * x1 + 0.5 * u + rnorm(n, sd = 0.5)

# Outcome: true causal effect of D on Y is 1.0
# U creates the endogeneity problem
y <- 1.0 * d + 0.5 * x1 - 0.3 * x2 + 0.7 * u + rnorm(n, sd = 0.5)

# Panel structure for later examples
group_id <- sample(1:50, n, replace = TRUE)
time_id  <- sample(1:10, n, replace = TRUE)

df <- data.frame(
    y   = y,
    d   = d,
    z   = z,
    z2  = z2,
    x1  = x1,
    x2  = x2,
    gid = factor(group_id),
    tid = factor(time_id)
)

In this DGP:

  • The true causal effect of DD on YY is β1=1.0\beta_1 = 1.0.
  • The confounder UU has a coefficient of 0.5 in the treatment equation and 0.7 in the outcome equation, creating positive OVB (OLS will overestimate β1\beta_1).
  • The instrument ZZ has a first-stage coefficient of 0.8 (strong relevance).
  • The instrument ZZ is independent of UU and ε\varepsilon by construction (exclusion and independence hold).

Let us verify the key relationships in the data:

# Instrument is correlated with treatment (relevance)
cat("Cor(Z, D):", round(cor(df$z, df$d), 3), "\n")
#> Cor(Z, D): 0.691

# Instrument is uncorrelated with the confounder (by construction)
cat("Cor(Z, U):", round(cor(df$z, u), 3), "\n")
#> Cor(Z, U): 0.001

# Treatment is correlated with the confounder (endogeneity)
cat("Cor(D, U):", round(cor(df$d, u), 3), "\n")
#> Cor(D, U): 0.454

OLS vs. IV: Demonstrating the Bias

OLS Estimation (Biased)

OLS ignores the endogeneity problem and regresses YY directly on DD:

ols_naive <- fixest::feols(y ~ d, data = df)
ols_controls <- fixest::feols(y ~ d + x1 + x2, data = df)

Both OLS specifications will overestimate the true effect because the positive correlation between DD and the omitted confounder UU inflates the coefficient. Even controlling for observed covariates X1X_1 and X2X_2 does not solve the problem because UU remains unobserved.

2SLS Estimation (Consistent)

The IV estimator uses only the variation in DD that is driven by the instrument ZZ, purging the endogenous component:

iv_simple <- fixest::feols(y ~ 1 | 0 | d ~ z, data = df)
iv_controls <- fixest::feols(y ~ x1 + x2 | 0 | d ~ z, data = df)

Comparison Table

fixest::etable(
    ols_naive, ols_controls, iv_simple, iv_controls,
    headers = c("OLS (naive)", "OLS (controls)", "IV (simple)", "IV (controls)"),
    se.below = TRUE,
    fitstat = ~ r2 + n + ivf
)
#>                         ols_naive     ols_cont..   iv_simple    iv_contr..
#>                       OLS (naive) OLS (controls) IV (simple) IV (controls)
#> Dependent Var.:                 y              y           y             y
#>                                                                           
#> Constant                 0.0160        0.0153      0.0011        0.0054   
#>                         (0.0215)      (0.0182)    (0.0239)      (0.0197)  
#> d                        1.383***      1.300***    0.9911***     0.9983***
#>                         (0.0185)      (0.0162)    (0.0296)      (0.0246)  
#> x1                                     0.4188***                 0.5102***
#>                                       (0.0190)                  (0.0212)  
#> x2                                    -0.3235***                -0.3131***
#>                                       (0.0180)                  (0.0196)  
#> _____________________   _________     __________  __________    __________
#> S.E. type                     IID            IID         IID           IID
#> R2                        0.73624        0.81187     0.18059       0.38753
#> Observations                2,000          2,000       2,000         2,000
#> F-test (1st stage), d          --             --     1,827.8       2,070.4
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The OLS estimates will be noticeably above 1.0, while the IV estimates should be centered around the true value of 1.0. Adding covariates to the IV specification can improve precision (smaller standard errors) without affecting consistency.

2SLS with fixest

The fixest package (Berge, 2018) provides a fast and flexible implementation of 2SLS that integrates seamlessly with fixed effects, clustering, and multiple estimation. This section covers the full range of fixest IV capabilities.

Basic Syntax

The general formula structure for IV in fixest is:

feols(outcome ~ exog_covariates | fixed_effects | endog_var ~ instruments, data)

The three parts separated by | are:

  1. Linear part: The outcome on the left, exogenous covariates on the right.
  2. Fixed effects: Use 0 if none.
  3. IV specification: Endogenous variable(s) on the left, instruments on the right.
# No covariates, no fixed effects
m1 <- fixest::feols(y ~ 1 | 0 | d ~ z, data = df)

# With exogenous covariates, no fixed effects
m2 <- fixest::feols(y ~ x1 + x2 | 0 | d ~ z, data = df)

fixest::etable(m1, m2, headers = c("No controls", "With controls"),
               fitstat = ~ r2 + n + ivf)
#>                                       m1                  m2
#>                              No controls       With controls
#> Dependent Var.:                        y                   y
#>                                                             
#> Constant                 0.0011 (0.0239)     0.0054 (0.0197)
#> d                     0.9911*** (0.0296)  0.9983*** (0.0246)
#> x1                                        0.5102*** (0.0212)
#> x2                                       -0.3131*** (0.0196)
#> _____________________ __________________ ___________________
#> S.E. type                            IID                 IID
#> R2                               0.18059             0.38753
#> Observations                       2,000               2,000
#> F-test (1st stage), d            1,827.8             2,070.4
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Instruments

When multiple instruments are available, include them on the right side of the IV formula. This yields an over-identified model, which can improve efficiency:

# Two instruments for one endogenous variable
m3 <- fixest::feols(y ~ x1 + x2 | 0 | d ~ z + z2, data = df)
summary(m3)
#> TSLS estimation - Dep. Var.: y
#>                   Endo.    : d
#>                   Instr.   : z, z2
#> Second stage: Dep. Var.: y
#> Observations: 2,000
#> Standard-errors: IID 
#>              Estimate Std. Error    t value  Pr(>|t|)    
#> (Intercept)  0.005678   0.019651   0.288941   0.77266    
#> fit_d        1.007106   0.022823  44.127163 < 2.2e-16 ***
#> x1           0.507500   0.020928  24.249676 < 2.2e-16 ***
#> x2          -0.313396   0.019470 -16.096044 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.877144   Adj. R2: 0.41829
#> F-test (1st stage), d: stat = 1,419.22764, p < 2.2e-16 , on 2 and 1,995 DoF.
#>            Wu-Hausman: stat =   606.30077, p < 2.2e-16 , on 1 and 1,995 DoF.
#>                Sargan: stat =     0.97198, p = 0.324188, on 1 DoF.

With multiple instruments, the 2SLS procedure projects DD onto the space spanned by both Z1Z_1 and Z2Z_2 (and the exogenous covariates), using more information about the exogenous variation in DD.

Fixed Effects + IV

One of the key strengths of fixest is the ability to combine high-dimensional fixed effects with IV estimation. The fixed effects are absorbed (projected out) in both stages:

# One-way fixed effects
m4 <- fixest::feols(y ~ x1 + x2 | gid | d ~ z, data = df)

# Two-way fixed effects
m5 <- fixest::feols(y ~ x1 + x2 | gid + tid | d ~ z, data = df)

# Two-way FE with multiple instruments
m6 <- fixest::feols(y ~ x1 + x2 | gid + tid | d ~ z + z2, data = df)

fixest::etable(m4, m5, m6,
               headers = c("One-way FE", "Two-way FE", "Two-way FE + 2 IVs"),
               fitstat = ~ r2 + n + ivf)
#>                                        m4                  m5
#>                                One-way FE          Two-way FE
#> Dependent Var.:                         y                   y
#>                                                              
#> d                       1.001*** (0.0249)   1.003*** (0.0248)
#> x1                     0.5111*** (0.0215)  0.5105*** (0.0215)
#> x2                    -0.3092*** (0.0197) -0.3090*** (0.0197)
#> Fixed-Effects:        ------------------- -------------------
#> gid                                   Yes                 Yes
#> tid                                    No                 Yes
#> _____________________ ___________________ ___________________
#> S.E. type                             IID                 IID
#> R2                                0.40265             0.40496
#> Observations                        2,000               2,000
#> F-test (1st stage), d             2,012.0             2,016.3
#> 
#>                                        m6
#>                        Two-way FE + 2 IVs
#> Dependent Var.:                         y
#>                                          
#> d                       1.010*** (0.0231)
#> x1                     0.5084*** (0.0213)
#> x2                    -0.3092*** (0.0197)
#> Fixed-Effects:        -------------------
#> gid                                   Yes
#> tid                                   Yes
#> _____________________ ___________________
#> S.E. type                             IID
#> R2                                0.43489
#> Observations                        2,000
#> F-test (1st stage), d             1,377.1
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is equivalent to first demeaning all variables by the fixed effects and then running 2SLS on the demeaned data, but fixest implements this much more efficiently through the method of alternating projections.

First-Stage F-statistic with fitstat()

The fitstat() function extracts the first-stage F-statistic and other diagnostic statistics:

# First-stage F-statistic (Kleibergen-Paap for clustered SEs)
fitstat(m3, type = "ivf")
#> F-test (1st stage), d: stat = 1,419.2, p < 2.2e-16, on 2 and 1,995 DoF.

# Multiple statistics
fitstat(m6, type = "ivf")
#> F-test (1st stage), d: stat = 1,377.1, p < 2.2e-16, on 2 and 1,937 DoF.

The first-stage F-statistic tests the null hypothesis that the instruments are jointly irrelevant in the first stage. Values below 10 suggest weak instruments (Staiger and Stock, 1997), though more recent guidance suggests higher thresholds (Lee et al., 2022).

Wu-Hausman Test

The Wu-Hausman test compares the OLS and IV estimates. Under the null hypothesis that DD is exogenous, OLS is efficient and consistent, while IV is consistent but less efficient. Rejection suggests that OLS is inconsistent and IV should be preferred:

fitstat(m3, type = "ivwald")
#> Wald (1st stage), d: stat = 1,419.2, p < 2.2e-16, on 2 and 1,995 DoF, VCOV: IID.

Clustered Standard Errors

In many applications, standard errors should be clustered to account for within-group correlation. fixest makes this straightforward:

# Cluster by group
m7 <- fixest::feols(y ~ x1 + x2 | gid | d ~ z, data = df, cluster = ~gid)

# Two-way clustering
m8 <- fixest::feols(y ~ x1 + x2 | gid + tid | d ~ z, data = df,
                    cluster = ~gid + tid)

fixest::etable(m7, m8,
               headers = c("Cluster by group", "Two-way cluster"),
               fitstat = ~ r2 + n + ivf)
#>                                        m7                  m8
#>                          Cluster by group     Two-way cluster
#> Dependent Var.:                         y                   y
#>                                                              
#> d                       1.001*** (0.0201)   1.003*** (0.0334)
#> x1                     0.5111*** (0.0194)  0.5105*** (0.0165)
#> x2                    -0.3092*** (0.0173) -0.3090*** (0.0072)
#> Fixed-Effects:        ------------------- -------------------
#> gid                                   Yes                 Yes
#> tid                                    No                 Yes
#> _____________________ ___________________ ___________________
#> S.E.: Clustered                   by: gid       by: gid & tid
#> R2                                0.40265             0.40496
#> Observations                        2,000               2,000
#> F-test (1st stage), d             2,062.7             2,076.6
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When fixed effects are included, fixest automatically clusters at the highest fixed effect level by default. You can override this with the cluster or vcov arguments.

Multiple Estimations

fixest supports stepwise estimation using the sw() and csw() functions, which is useful for presenting robustness across specifications:

# Stepwise addition of controls
m_step <- fixest::feols(y ~ csw(x1, x2) | 0 | d ~ z, data = df)
fixest::etable(m_step, fitstat = ~ r2 + n + ivf)
#>                                 m_step.1            m_step.2
#> Dependent Var.:                        y                   y
#>                                                             
#> Constant                 0.0088 (0.0211)     0.0054 (0.0197)
#> d                     0.9868*** (0.0262)  0.9983*** (0.0246)
#> x1                    0.5079*** (0.0226)  0.5102*** (0.0212)
#> x2                                       -0.3131*** (0.0196)
#> _____________________ __________________ ___________________
#> S.E. type                            IID                 IID
#> R2                               0.36118             0.38753
#> Observations                       2,000               2,000
#> F-test (1st stage), d            2,074.4             2,070.4
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ivreg Package

The ivreg package (Fox, Kleiber, and Zeileis, 2023) provides a classical implementation of IV estimation with comprehensive built-in diagnostics. Its formula interface uses | to separate the structural equation from the list of instruments.

library(ivreg)

# Basic 2SLS: y ~ endogenous + exogenous | instruments + exogenous
iv_ivreg <- ivreg(y ~ d + x1 + x2 | z + x1 + x2, data = df)
summary(iv_ivreg)
#> 
#> Call:
#> ivreg(formula = y ~ d + x1 + x2 | z + x1 + x2, data = df)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -3.344980 -0.582228  0.002285  0.611791  2.799560 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.005389   0.019738   0.273    0.785    
#> d            0.998301   0.024616  40.555   <2e-16 ***
#> x1           0.510162   0.021192  24.073   <2e-16 ***
#> x2          -0.313092   0.019556 -16.010   <2e-16 ***
#> 
#> Diagnostic tests:
#>                   df1  df2 statistic p-value    
#> Weak instruments    1 1996    2070.4  <2e-16 ***
#> Wu-Hausman          1 1995     438.8  <2e-16 ***
#> Sargan              0   NA        NA      NA    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.8818 on 1996 degrees of freedom
#> Multiple R-Squared: 0.7792,  Adjusted R-squared: 0.7788 
#> Wald test:  1168 on 3 and 1996 DF,  p-value: < 2.2e-16

The key feature of ivreg is its rich diagnostic output:

# Full diagnostics
summary(iv_ivreg, diagnostics = TRUE)
#> 
#> Call:
#> ivreg(formula = y ~ d + x1 + x2 | z + x1 + x2, data = df)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -3.344980 -0.582228  0.002285  0.611791  2.799560 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.005389   0.019738   0.273    0.785    
#> d            0.998301   0.024616  40.555   <2e-16 ***
#> x1           0.510162   0.021192  24.073   <2e-16 ***
#> x2          -0.313092   0.019556 -16.010   <2e-16 ***
#> 
#> Diagnostic tests:
#>                   df1  df2 statistic p-value    
#> Weak instruments    1 1996    2070.4  <2e-16 ***
#> Wu-Hausman          1 1995     438.8  <2e-16 ***
#> Sargan              0   NA        NA      NA    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.8818 on 1996 degrees of freedom
#> Multiple R-Squared: 0.7792,  Adjusted R-squared: 0.7788 
#> Wald test:  1168 on 3 and 1996 DF,  p-value: < 2.2e-16

This reports three important tests:

  1. Weak instruments test: Tests whether the instruments are sufficiently correlated with the endogenous variable. Based on the first-stage F-statistic.
  2. Wu-Hausman test: Tests the null hypothesis that OLS is consistent (i.e., the treatment is exogenous). Rejection motivates the use of IV.
  3. Sargan test: In over-identified models, tests the null that all instruments are valid (satisfy the exclusion restriction). Rejection suggests at least one instrument may be invalid.
# Over-identified model with two instruments
iv_overid_ivreg <- ivreg(y ~ d + x1 + x2 | z + z2 + x1 + x2, data = df)
summary(iv_overid_ivreg, diagnostics = TRUE)
#> 
#> Call:
#> ivreg(formula = y ~ d + x1 + x2 | z + z2 + x1 + x2, data = df)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -3.312621 -0.583047 -0.002259  0.606098  2.786930 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.005678   0.019651   0.289    0.773    
#> d            1.007106   0.022823  44.127   <2e-16 ***
#> x1           0.507500   0.020928  24.250   <2e-16 ***
#> x2          -0.313396   0.019470 -16.096   <2e-16 ***
#> 
#> Diagnostic tests:
#>                   df1  df2 statistic p-value    
#> Weak instruments    2 1995  1419.228  <2e-16 ***
#> Wu-Hausman          1 1995   606.301  <2e-16 ***
#> Sargan              1   NA     0.972   0.324    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.878 on 1996 degrees of freedom
#> Multiple R-Squared: 0.781,   Adjusted R-squared: 0.7807 
#> Wald test:  1274 on 3 and 1996 DF,  p-value: < 2.2e-16

You can also extract components for custom analysis:

# Coefficients and confidence intervals
coef(iv_ivreg)
#>  (Intercept)            d           x1           x2 
#>  0.005389457  0.998301450  0.510162174 -0.313091895
confint(iv_ivreg)
#>                   2.5 %      97.5 %
#> (Intercept) -0.03331903  0.04409795
#> d            0.95002578  1.04657712
#> x1           0.46860077  0.55172357
#> x2          -0.35144463 -0.27473916

# Fitted values and residuals
head(fitted(iv_ivreg))
#>          1          2          3          4          5          6 
#>  1.6675428 -1.9572426 -0.6171868  1.7747916 -0.3444108 -1.5605920
head(residuals(iv_ivreg))
#>            1            2            3            4            5            6 
#> -0.086637807 -0.199299915 -0.136906115  0.005957688 -1.449318981 -0.935338184

First-Stage Diagnostics

Verifying instrument strength is arguably the most critical step in any IV analysis. A weak instrument leads to severe problems: the IV estimator is biased toward OLS in finite samples, standard confidence intervals have incorrect coverage, and hypothesis tests have incorrect size.

First-Stage Regression

Always examine the first-stage regression to understand the instrument-treatment relationship:

first_stage <- fixest::feols(d ~ z + x1 + x2, data = df)
summary(first_stage)
#> OLS estimation, Dep. Var.: d
#> Observations: 2,000
#> Standard-errors: IID 
#>              Estimate Std. Error   t value  Pr(>|t|)    
#> (Intercept) -0.005437   0.017618 -0.308637   0.75763    
#> z            0.808905   0.017778 45.501197 < 2.2e-16 ***
#> x1           0.297313   0.017715 16.782732 < 2.2e-16 ***
#> x2           0.011491   0.017451  0.658494   0.51030    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.786479   Adj. R2: 0.54183

The coefficient on ZZ should be statistically significant and economically meaningful. A large t-statistic (absolute value > 3.2) on the excluded instrument is necessary but not sufficient for strong identification.

F-Statistic and the Rule of Thumb

The first-stage F-statistic tests the joint significance of the excluded instruments. Staiger and Stock (1997) proposed the widely-used rule of thumb that F>10F > 10 indicates instruments that are strong enough to avoid severe weak instrument problems:

# From the IV model directly
fitstat(m3, type = "ivf")
#> F-test (1st stage), d: stat = 1,419.2, p < 2.2e-16, on 2 and 1,995 DoF.

# Can also compute manually from the first stage
first_stage_both <- fixest::feols(d ~ z + z2 + x1 + x2, data = df)
summary(first_stage_both)
#> OLS estimation, Dep. Var.: d
#> Observations: 2,000
#> Standard-errors: IID 
#>              Estimate Std. Error   t value  Pr(>|t|)    
#> (Intercept) -0.006305   0.016159 -0.390158   0.69646    
#> z            0.809481   0.016306 49.642886 < 2.2e-16 ***
#> z2           0.311666   0.016040 19.430246 < 2.2e-16 ***
#> x1           0.293602   0.016250 18.067745 < 2.2e-16 ***
#> x2          -0.000279   0.016018 -0.017421   0.98610    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.721194   Adj. R2: 0.614544

Partial R-Squared

The partial R-squared of the excluded instruments measures how much of the variation in DD is explained by the instruments after controlling for the included exogenous variables. This complements the F-statistic:

# R-squared from regression including instruments
r2_full <- summary(fixest::feols(d ~ z + z2 + x1 + x2, data = df))$r2

# R-squared from regression excluding instruments
r2_restricted <- summary(fixest::feols(d ~ x1 + x2, data = df))$r2

partial_r2 <- (r2_full - r2_restricted) / (1 - r2_restricted)
cat("Partial R-squared of instruments:", round(partial_r2, 4), "\n")
#> Partial R-squared of instruments:

A low partial R-squared (e.g., below 0.05) warrants concern about instrument strength, even if the F-statistic appears adequate.

Visualization of the First Stage

Visual inspection of the first-stage relationship provides intuition that summary statistics alone cannot:

# Residualize D and Z with respect to covariates for a clean visual
d_resid <- residuals(lm(d ~ x1 + x2, data = df))
z_resid <- residuals(lm(z ~ x1 + x2, data = df))

first_stage_plot_df <- data.frame(z_resid = z_resid, d_resid = d_resid)

ggplot(first_stage_plot_df, aes(x = z_resid, y = d_resid)) +
    geom_point(alpha = 0.12, size = 0.8, color = "grey40") +
    geom_smooth(method = "lm", se = TRUE, color = "#2c7bb6", linewidth = 1.2) +
    labs(
        title = "First Stage: Instrument vs. Treatment (Residualized)",
        subtitle = "After partialling out covariates X1 and X2",
        x = "Instrument (Z), residualized",
        y = "Treatment (D), residualized"
    ) +
    causalverse::ama_theme()
#> `geom_smooth()` using formula = 'y ~ x'

A strong first stage will show a clear linear relationship. If the scatter plot shows a flat or very noisy relationship, the instrument may be too weak.

Stock-Yogo Critical Values

Stock and Yogo (2005) derived critical values for the first-stage F-statistic that depend on the number of instruments and the tolerated degree of bias or size distortion. Their key results for 2SLS with one endogenous variable are:

Number of instruments 10% max bias 15% max bias 20% max bias 25% max bias
1 16.38 8.96 6.66 5.53
2 19.93 11.59 8.75 7.25
3 22.30 12.83 9.54 7.80
5 26.87 15.09 10.27 8.84
10 26.80 15.38 11.49 9.43

These critical values are considerably higher than the rule-of-thumb of 10, especially when the researcher wants to limit relative bias to 10% or less. The table shows that with a single instrument, the F-statistic should exceed 16.38 to ensure bias is no more than 10% of the OLS bias.

Weak Instruments

The Problem

When instruments are weak (low first-stage F), several problems arise:

  1. Finite-sample bias: The 2SLS estimator is biased toward the OLS estimate. The bias is approximately Bias2SLSBiasOLS/F\text{Bias}_{2SLS} \approx \text{Bias}_{OLS} / F, so with F=5F = 5, the IV estimate retains about 20% of the OLS bias.
  2. Non-normal sampling distribution: The 2SLS estimator is no longer approximately normally distributed, making standard t-tests and confidence intervals unreliable.
  3. Size distortion: Hypothesis tests reject too often (or too rarely) at their nominal level.
  4. Confidence interval under-coverage: Standard 95% confidence intervals may cover the true parameter far less than 95% of the time.

To demonstrate, let us construct a weak instrument scenario:

set.seed(123)
n_weak <- 2000

u_w  <- rnorm(n_weak)
z_w  <- rnorm(n_weak)

# Weak first stage: coefficient on z is only 0.05
d_w <- 0.05 * z_w + 0.8 * u_w + rnorm(n_weak, sd = 0.5)
y_w <- 1.0 * d_w + 0.7 * u_w + rnorm(n_weak, sd = 0.5)

df_weak <- data.frame(y = y_w, d = d_w, z = z_w)

# IV estimation with weak instrument
iv_weak <- fixest::feols(y ~ 1 | 0 | d ~ z, data = df_weak)

# Compare OLS and weak-IV
ols_weak <- fixest::feols(y ~ d, data = df_weak)
fixest::etable(ols_weak, iv_weak, headers = c("OLS", "IV (weak)"),
               fitstat = ~ r2 + n + ivf)
#>                                ols_weak          iv_weak
#>                                     OLS        IV (weak)
#> Dependent Var.:                       y                y
#>                                                         
#> Constant               -0.0009 (0.0140)  0.0183 (0.0249)
#> d                     1.627*** (0.0149) 0.8018. (0.4829)
#> _____________________ _________________ ________________
#> S.E. type                           IID              IID
#> R2                              0.85627          0.00050
#> Observations                      2,000            2,000
#> F-test (1st stage), d                --           4.8357
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note how the first-stage F-statistic is very low, and the IV estimate may be far from the true effect and have enormous standard errors.

Anderson-Rubin Confidence Sets

The Anderson and Rubin (1949) test inverts a test statistic that is robust to weak instruments. It tests the null H0:β1=β0H_0: \beta_1 = \beta_0 by checking whether the instruments are correlated with the residuals Yβ0DY - \beta_0 D after controlling for covariates.

The Anderson-Rubin confidence set is the collection of all β0\beta_0 values that are not rejected:

CSAR={β0:AR(β0)χk,1α2}CS_{AR} = \{\beta_0 : AR(\beta_0) \leq \chi^2_{k, 1-\alpha}\}

where kk is the number of instruments. This confidence set has correct coverage even when instruments are arbitrarily weak.

# Manual Anderson-Rubin confidence set construction
# Grid of hypothesized beta values
beta_grid <- seq(-1, 4, by = 0.05)
ar_pvals <- numeric(length(beta_grid))

for (i in seq_along(beta_grid)) {
    # Under H0: beta = beta_grid[i], form Y - beta*D
    y_tilde <- df$y - beta_grid[i] * df$d
    # Regress on instrument
    ar_reg <- lm(y_tilde ~ df$z)
    ar_pvals[i] <- summary(ar_reg)$coefficients["df$z", "Pr(>|t|)"]
}

ar_df <- data.frame(beta = beta_grid, pval = ar_pvals)

# AR confidence set: values where p-value > 0.05
ar_ci <- ar_df$beta[ar_df$pval > 0.05]
cat("Anderson-Rubin 95% Confidence Set: [",
    round(min(ar_ci), 3), ",", round(max(ar_ci), 3), "]\n")
#> Anderson-Rubin 95% Confidence Set: [ 0.95 , 1 ]
ggplot(ar_df, aes(x = beta, y = pval)) +
    geom_line(color = "#2c7bb6", linewidth = 0.8) +
    geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
    geom_vline(xintercept = 1.0, linetype = "dotted", color = "grey40") +
    annotate("text", x = 1.0, y = 0.9, label = "True effect = 1.0",
             hjust = -0.1, size = 3.5) +
    labs(
        title = "Anderson-Rubin Confidence Set",
        subtitle = "Values of beta where p-value > 0.05 form the confidence set",
        x = expression(beta[0]),
        y = "p-value"
    ) +
    causalverse::ama_theme()

The tF Procedure (Lee et al. 2022)

Lee, McCrary, Moreira, and Porter (2022) proposed the tF procedure, which adjusts critical values for the t-test based on the first-stage F-statistic. When the F-statistic is large, the adjusted critical value converges to the standard 1.96. When F is small, the critical value is inflated to account for weak instrument distortions.

The procedure works as follows:

  1. Compute the IV estimate β̂IV\hat{\beta}_{IV} and its standard error SE(β̂IV)SE(\hat{\beta}_{IV}).
  2. Compute the first-stage F-statistic.
  3. Look up the adjusted critical value cα(F)c_{\alpha}(F) from the Lee et al. tables.
  4. Reject H0:β=0H_0: \beta = 0 if |β̂IV/SE(β̂IV)|>cα(F)|\hat{\beta}_{IV} / SE(\hat{\beta}_{IV})| > c_{\alpha}(F).

Key critical values for a 5% test:

First-stage F Adjusted critical value
5 3.09
10 2.28
15 2.10
20 2.04
50 1.98
100 1.97

This approach provides valid inference even with moderately weak instruments and is straightforward to implement as a robustness check.

Over-identification Tests

When the number of instruments exceeds the number of endogenous variables, the model is over-identified. This extra information can be used both for efficiency and for testing instrument validity.

The Sargan-Hansen J Test

The Sargan (1958) and Hansen (1982) J test examines whether the over-identifying restrictions are satisfied. Under the null hypothesis that all instruments are valid (exogenous), the test statistic is:

J=nRε̂ on 𝐙,𝐗2J = n \cdot R^2_{\hat{\varepsilon} \text{ on } \mathbf{Z}, \mathbf{X}}

where ε̂\hat{\varepsilon} are the 2SLS residuals. Under H0H_0, Jχ2(mk)J \sim \chi^2(m - k) where mm is the number of instruments and kk is the number of endogenous variables.

# 2SLS with two instruments
iv_overid <- fixest::feols(y ~ x1 + x2 | 0 | d ~ z + z2, data = df)
summary(iv_overid)
#> TSLS estimation - Dep. Var.: y
#>                   Endo.    : d
#>                   Instr.   : z, z2
#> Second stage: Dep. Var.: y
#> Observations: 2,000
#> Standard-errors: IID 
#>              Estimate Std. Error    t value  Pr(>|t|)    
#> (Intercept)  0.005678   0.019651   0.288941   0.77266    
#> fit_d        1.007106   0.022823  44.127163 < 2.2e-16 ***
#> x1           0.507500   0.020928  24.249676 < 2.2e-16 ***
#> x2          -0.313396   0.019470 -16.096044 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.877144   Adj. R2: 0.41829
#> F-test (1st stage), d: stat = 1,419.22764, p < 2.2e-16 , on 2 and 1,995 DoF.
#>            Wu-Hausman: stat =   606.30077, p < 2.2e-16 , on 1 and 1,995 DoF.
#>                Sargan: stat =     0.97198, p = 0.324188, on 1 DoF.

# Manual Sargan test
resid_iv <- residuals(iv_overid)
sargan_reg <- lm(resid_iv ~ z + z2 + x1 + x2, data = df)
sargan_stat <- n * summary(sargan_reg)$r.squared
sargan_pval <- pchisq(sargan_stat, df = 1, lower.tail = FALSE)  # df = 2 instruments - 1 endog

cat("Sargan J-statistic:", round(sargan_stat, 3), "\n")
#> Sargan J-statistic: 0.972
cat("p-value:", round(sargan_pval, 3), "\n")
#> p-value: 0.324
cat("Degrees of freedom:", 1, "(2 instruments - 1 endogenous variable)\n")
#> Degrees of freedom: 1 (2 instruments - 1 endogenous variable)

Interpretation:

  • Failure to reject (p>0.05p > 0.05): The over-identifying restrictions are not violated. This is consistent with (but does not prove) all instruments being valid.
  • Rejection (p<0.05p < 0.05): At least one instrument likely violates the exclusion restriction.

Caveat: The Sargan test has low power when instruments are weak and can also reject when all instruments are invalid in the same direction. It tests whether different instruments give the same estimate, not whether any individual instrument is valid in an absolute sense.

# Using ivreg for automatic Sargan test
library(ivreg)
iv_overid_ivreg2 <- ivreg(y ~ d + x1 + x2 | z + z2 + x1 + x2, data = df)
summary(iv_overid_ivreg2, diagnostics = TRUE)
#> 
#> Call:
#> ivreg(formula = y ~ d + x1 + x2 | z + z2 + x1 + x2, data = df)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -3.312621 -0.583047 -0.002259  0.606098  2.786930 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.005678   0.019651   0.289    0.773    
#> d            1.007106   0.022823  44.127   <2e-16 ***
#> x1           0.507500   0.020928  24.250   <2e-16 ***
#> x2          -0.313396   0.019470 -16.096   <2e-16 ***
#> 
#> Diagnostic tests:
#>                   df1  df2 statistic p-value    
#> Weak instruments    2 1995  1419.228  <2e-16 ***
#> Wu-Hausman          1 1995   606.301  <2e-16 ***
#> Sargan              1   NA     0.972   0.324    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.878 on 1996 degrees of freedom
#> Multiple R-Squared: 0.781,   Adjusted R-squared: 0.7807 
#> Wald test:  1274 on 3 and 1996 DF,  p-value: < 2.2e-16

When to Use Multiple Instruments

Using more instruments can improve efficiency but raises concerns:

  1. More instruments = more efficiency (lower standard errors), but only if all instruments are valid.
  2. Many instruments bias: With too many instruments relative to sample size, 2SLS becomes biased toward OLS. LIML (Limited Information Maximum Likelihood) is more robust to this problem.
  3. Pre-testing: If you selected instruments based on their first-stage performance, the resulting estimates and tests may be unreliable.

Sensitivity Analysis with sensemakr

The sensemakr package (Cinelli and Hazlett, 2020) provides a formal framework for sensitivity analysis to assess how robust an estimated causal effect is to potential omitted variable bias (OVB). While originally designed for OLS, the approach offers valuable insights for the IV context, particularly for assessing the plausibility of the exclusion restriction.

Basic Sensitivity Analysis

library(sensemakr)

# Fit OLS model (sensemakr works with lm objects)
ols_for_sens <- lm(y ~ d + x1 + x2, data = df)

# Sensitivity analysis: how strong would a confounder need to be
# to explain away the effect of d?
sens <- sensemakr(
    model      = ols_for_sens,
    treatment  = "d",
    benchmark_covariates = "x1",
    kd = 1:3,  # multiples of x1's confounding strength
    ky = 1:3
)

summary(sens)

The summary reports the Robustness Value (RV), which indicates the minimum strength of confounding (as a fraction of the residual variance of both treatment and outcome) needed to fully explain away the observed effect.

Contour Plots

# Contour plot of the estimated effect as a function of
# confounding strength with treatment (R2_DZ.X) and outcome (R2_YZ.X)
plot(sens, sensitivity.of = "estimate")

The contour plot shows:

  • The x-axis represents the partial R2R^2 of the hypothetical confounder with the treatment.
  • The y-axis represents the partial R2R^2 of the hypothetical confounder with the outcome.
  • Contour lines show the adjusted estimate after accounting for such a confounder.
  • The red line shows where the estimate would cross zero.

Benchmarking

# The benchmark uses observed covariates as reference points
# "How strong would the confounder need to be relative to x1?"
plot(sens, sensitivity.of = "estimate", type = "extreme")

Benchmarking allows researchers to frame sensitivity in interpretable terms: “A confounder would need to be 3 times as important as X1X_1 in explaining both treatment and outcome variation to reduce the effect below zero.” This makes the abstract question of unobserved confounding concrete and communicable.

Application to the IV Context

In IV estimation, sensemakr can be applied to:

  1. Assess the exclusion restriction: Treat the reduced-form regression (YY on ZZ) as the OLS model and evaluate sensitivity to violations of the exclusion restriction (direct effects of ZZ on YY).
  2. Compare OLS and IV sensitivity: If OLS is highly sensitive to confounding but IV is not, this strengthens the case for the IV strategy.
  3. Evaluate instrument independence: Assess sensitivity of the first-stage relationship to potential confounders.

IV with Panel Data

Panel data offers additional opportunities to address endogeneity by combining IV with fixed effects that absorb time-invariant confounders.

Fixed Effects IV

When time-invariant unobserved heterogeneity is a concern, entity fixed effects can be combined with IV to control for both time-invariant confounders and time-varying endogeneity:

# Generate panel data
set.seed(42)
n_panel <- 5000
n_groups <- 100
n_time <- 50

panel <- data.frame(
    id   = rep(1:n_groups, each = n_time),
    time = rep(1:n_time, n_groups)
)

# Group-level unobserved heterogeneity
alpha_i <- rnorm(n_groups, sd = 1)
panel$alpha <- alpha_i[panel$id]

# Time-varying confounder
panel$u <- rnorm(n_panel)

# Instrument
panel$z <- rnorm(n_panel)

# Treatment
panel$d <- 0.7 * panel$z + 0.3 * panel$alpha + 0.4 * panel$u + rnorm(n_panel, sd = 0.5)

# Outcome
panel$y <- 1.0 * panel$d + 0.5 * panel$alpha + 0.6 * panel$u + rnorm(n_panel, sd = 0.5)

panel$id   <- factor(panel$id)
panel$time <- factor(panel$time)

# OLS without FE (biased by alpha and u)
panel_ols <- fixest::feols(y ~ d, data = panel)

# OLS with FE (still biased by u)
panel_ols_fe <- fixest::feols(y ~ d | id, data = panel)

# IV without FE (consistent if z is valid, but less efficient)
panel_iv <- fixest::feols(y ~ 1 | 0 | d ~ z, data = panel)

# IV with FE (preferred: controls for alpha, instruments for u)
panel_iv_fe <- fixest::feols(y ~ 1 | id | d ~ z, data = panel)

# IV with two-way FE
panel_iv_twfe <- fixest::feols(y ~ 1 | id + time | d ~ z, data = panel)

fixest::etable(panel_ols, panel_ols_fe, panel_iv, panel_iv_fe, panel_iv_twfe,
               headers = c("OLS", "OLS+FE", "IV", "IV+FE", "IV+TWFE"),
               fitstat = ~ r2 + n + ivf)
#>                               panel_ols      panel_ols_fe           panel_iv
#>                                     OLS            OLS+FE                 IV
#> Dependent Var.:                       y                 y                  y
#>                                                                             
#> Constant                0.0114 (0.0120)                      0.0100 (0.0132)
#> d                     1.382*** (0.0120) 1.256*** (0.0111) 0.9904*** (0.0186)
#> Fixed-Effects:        ----------------- ----------------- ------------------
#> id                                   No               Yes                 No
#> time                                 No                No                 No
#> _____________________ _________________ _________________ __________________
#> S.E. type                           IID               IID                IID
#> R2                              0.72557           0.79484            0.18971
#> Observations                      5,000             5,000              5,000
#> F-test (1st stage), d                --                --            5,178.8
#> 
#>                              panel_iv_fe      panel_iv_twfe
#>                                    IV+FE            IV+TWFE
#> Dependent Var.:                        y                  y
#>                                                            
#> Constant                                                   
#> d                     0.9921*** (0.0157) 0.9920*** (0.0158)
#> Fixed-Effects:        ------------------ ------------------
#> id                                   Yes                Yes
#> time                                  No                Yes
#> _____________________ __________________ __________________
#> S.E. type                            IID                IID
#> R2                               0.44424            0.45014
#> Observations                       5,000              5,000
#> F-test (1st stage), d            6,118.3            6,090.9
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The key insight: entity fixed effects eliminate bias from time-invariant confounders (αi\alpha_i), while IV addresses bias from time-varying confounders (UitU_{it}). Combining both provides the strongest identification.

First-Differenced IV

An alternative to within-transformation (fixed effects) is first differencing. For unit ii at time tt:

ΔYit=β1ΔDit+Δεit\Delta Y_{it} = \beta_1 \Delta D_{it} + \Delta \varepsilon_{it}

First differencing removes time-invariant confounders, just like fixed effects. The instrument in this specification is ΔZit\Delta Z_{it}.

# First-differenced IV can be estimated by:
# 1. Computing first differences manually
# 2. Using feols with differenced variables

# The fixed effects IV estimator in fixest is equivalent to
# within-transformed IV, which gives identical coefficients
# to first-differenced IV only with T=2.

In practice, the fixed effects estimator is preferred for T>2T > 2 because it is more efficient than first differencing under standard assumptions (homoskedasticity and no serial correlation). However, first differencing may be preferred when errors follow a random walk.

Shift-Share (Bartik) Instruments

Shift-share instruments, also known as Bartik instruments after Bartik (1991), are among the most widely used IV strategies in applied economics. They exploit cross-sectional variation in exposure to aggregate shocks to instrument for local-level variables.

Structure

A Bartik instrument takes the form:

Bit=ksik,t1gktB_{it} = \sum_k s_{ik,t-1} \cdot g_{kt}

where:

  • sik,t1s_{ik,t-1} are shares measuring the initial exposure of unit ii to sector/category kk (e.g., industry employment shares in a region).
  • gktg_{kt} are shifts measuring aggregate changes in sector kk (e.g., national industry growth rates).

The idea is that regions with different initial industry compositions are differentially exposed to national industry-level shocks, generating exogenous variation in the local variable of interest (e.g., employment growth, import exposure).

Identification

Recent econometric work has clarified the conditions under which Bartik instruments are valid:

  • Goldsmith-Pinkham, Sorkin, and Swift (2020): Show that identification comes from the shares. The key identifying assumption is that the initial shares are exogenous (uncorrelated with local demand shocks). Valid if initial industry composition is quasi-random.

  • Borusyak, Hull, and Jaravel (2022): Show that identification can come from the shifts. The key assumption is that the national growth rates are exogenous (effectively random after conditioning on controls). This requires many sectors with no single sector dominating.

  • Adao, Kolesar, and Morales (2019): Highlight that standard errors are typically too small because residuals are correlated across regions with similar industry compositions. They propose a correction.

Practical Considerations

  1. Pre-period shares: Always use shares from before the treatment period to avoid simultaneity.
  2. Leave-one-out shifts: Compute national growth rates excluding the focal region to avoid mechanical correlation.
  3. Check share exogeneity: Test whether initial shares predict pre-treatment outcomes or other placebo outcomes.
  4. Use appropriate standard errors: Adao-Kolesar-Morales (AKM) standard errors or exposure-robust inference.

Judge/Examiner IV Designs

Judge (or examiner) IV designs exploit the quasi-random assignment of cases to decision-makers—such as judges, patent examiners, disability examiners, or loan officers—who vary in their leniency or propensity to assign treatment.

Structure

The instrument is typically the leave-out mean of the judge’s decision rate:

Zij=1nj1ii,iJjDiZ_{ij} = \frac{1}{n_j - 1} \sum_{i' \neq i, i' \in J_j} D_{i'}

where JjJ_j is the set of cases assigned to judge jj.

Key Assumptions

  1. Random (or quasi-random) assignment: Cases must be assigned to judges independently of case characteristics. This is often ensured by institutional rules (e.g., random case rotation).
  2. Relevance: Judges must vary in their treatment propensities.
  3. Exclusion: A judge’s leniency affects the outcome only through the treatment decision.
  4. Monotonicity: A more lenient judge gives (weakly) more favorable treatment to every case. This is the most demanding assumption—it rules out judges who are lenient for some types of cases but strict for others.

Recent Methodological Advances

  • Frandsen, Lefgren, and Leslie (2023): Provide a test for the monotonicity assumption and show that violations can substantially bias estimates.
  • Blandhol, Bonney, Mogstad, and Torgovitsky (2022): Demonstrate that the standard IV estimand may not have a LATE interpretation when monotonicity fails, and propose alternative approaches.
  • Mueller-Smith (2015), Dobbie, Goldin, and Yang (2018): Classic applications in criminal justice.

Practical Considerations

  1. Leave-out construction: Always use leave-out means to avoid mechanical correlation.
  2. Control for court/time: Include fixed effects for the court or location and time period to account for the assignment pool.
  3. Check random assignment: Verify that the instrument is balanced with respect to case characteristics.
  4. Assess monotonicity: Use the Frandsen et al. test or examine whether results are stable across subgroups.

Visualizing IV

Good visualizations are essential for communicating the IV strategy and building intuition.

First-Stage Scatter

The first-stage scatter plot shows the relationship between the instrument and the treatment:

ggplot(df, aes(x = z, y = d)) +
    geom_point(alpha = 0.12, size = 0.8, color = "grey50") +
    geom_smooth(method = "lm", se = TRUE, color = "#2c7bb6", linewidth = 1.2) +
    labs(
        title    = "First Stage: Instrument (Z) vs. Treatment (D)",
        subtitle = paste0("Slope = ", round(coef(lm(d ~ z, data = df))["z"], 3),
                          ", F = ", round(summary(lm(d ~ z, data = df))$fstatistic[1], 1)),
        x = "Instrument (Z)",
        y = "Treatment (D)"
    ) +
    causalverse::ama_theme()
#> `geom_smooth()` using formula = 'y ~ x'

Reduced-Form Scatter

The reduced-form relationship between the instrument and the outcome captures the total effect of ZZ on YY through DD. The ratio of the reduced-form to the first-stage slope gives the IV estimate:

rf_slope <- coef(lm(y ~ z, data = df))["z"]
fs_slope <- coef(lm(d ~ z, data = df))["z"]

ggplot(df, aes(x = z, y = y)) +
    geom_point(alpha = 0.12, size = 0.8, color = "grey50") +
    geom_smooth(method = "lm", se = TRUE, color = "#d7191c", linewidth = 1.2) +
    labs(
        title    = "Reduced Form: Instrument (Z) vs. Outcome (Y)",
        subtitle = paste0("Slope = ", round(rf_slope, 3),
                          "; IV = RF/FS = ", round(rf_slope, 3), "/",
                          round(fs_slope, 3), " = ", round(rf_slope / fs_slope, 3)),
        x = "Instrument (Z)",
        y = "Outcome (Y)"
    ) +
    causalverse::ama_theme()
#> `geom_smooth()` using formula = 'y ~ x'

OLS vs. IV Coefficient Comparison

A forest plot comparing estimates across specifications makes the bias in OLS visible:

# Gather estimates from multiple models
coef_comparison <- data.frame(
    Model = c("OLS (naive)", "OLS (controls)", "IV (simple)", "IV (controls)",
              "IV (two instruments)", "IV + FE", "True Effect"),
    Estimate = c(
        coef(ols_naive)["d"],
        coef(ols_controls)["d"],
        coef(iv_simple)["fit_d"],
        coef(iv_controls)["fit_d"],
        coef(m3)["fit_d"],
        coef(m5)["fit_d"],
        1.0
    ),
    CI_lower = c(
        confint(ols_naive)["d", 1],
        confint(ols_controls)["d", 1],
        confint(iv_simple)["fit_d", 1],
        confint(iv_controls)["fit_d", 1],
        confint(m3)["fit_d", 1],
        confint(m5)["fit_d", 1],
        NA
    ),
    CI_upper = c(
        confint(ols_naive)["d", 2],
        confint(ols_controls)["d", 2],
        confint(iv_simple)["fit_d", 2],
        confint(iv_controls)["fit_d", 2],
        confint(m3)["fit_d", 2],
        confint(m5)["fit_d", 2],
        NA
    )
)

coef_comparison$Model <- factor(
    coef_comparison$Model,
    levels = rev(coef_comparison$Model)
)

ggplot(coef_comparison, aes(x = Estimate, y = Model)) +
    geom_point(size = 3) +
    geom_errorbarh(
        aes(xmin = CI_lower, xmax = CI_upper),
        height = 0.2, na.rm = TRUE
    ) +
    geom_vline(xintercept = 1.0, linetype = "dashed", color = "grey40") +
    labs(
        title    = "Comparison of OLS and IV Estimates",
        subtitle = "Dashed line indicates the true causal effect (1.0)",
        x        = "Estimated Coefficient on D",
        y        = NULL
    ) +
    causalverse::ama_theme()
#> Warning: `geom_errorbarh()` was deprecated in ggplot2 4.0.0.
#>  Please use the `orientation` argument of `geom_errorbar()` instead.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> `height` was translated to `width`.

The plot should clearly show that OLS estimates are biased upward, while IV estimates across various specifications are clustered around the true value.

Binned Scatter Plot for First Stage

For presentations and publications, a binned scatter plot can provide a cleaner visual of the first stage by averaging across bins of the instrument:

# Create 20 equal-sized bins of the instrument
df$z_bin <- cut(df$z, breaks = 20, labels = FALSE)
bin_means <- df %>%
    group_by(z_bin) %>%
    summarise(
        z_mean = mean(z),
        d_mean = mean(d),
        d_se   = sd(d) / sqrt(n()),
        .groups = "drop"
    )

ggplot(bin_means, aes(x = z_mean, y = d_mean)) +
    geom_point(size = 2.5, color = "#2c7bb6") +
    geom_errorbar(aes(ymin = d_mean - 1.96 * d_se,
                      ymax = d_mean + 1.96 * d_se),
                  width = 0.05, color = "#2c7bb6", alpha = 0.5) +
    geom_smooth(method = "lm", se = FALSE, color = "#d7191c",
                linewidth = 0.8, linetype = "dashed") +
    labs(
        title = "Binned Scatter: First Stage Relationship",
        subtitle = "Each point is the mean within a bin of the instrument",
        x = "Instrument (Z), bin mean",
        y = "Treatment (D), bin mean"
    ) +
    causalverse::ama_theme()
#> `geom_smooth()` using formula = 'y ~ x'

LIML, Fuller, and K-Class Estimators

The K-Class Family

The k-class estimator unifies several well-known IV estimators under a single parameterization. For the standard IV model Y=Dβ+Xγ+εY = D\beta + X\gamma + \varepsilon with instruments ZZ, the k-class estimator solves:

β̂k=(D(IkMZ)D)1D(IkMZ)Y\hat{\beta}_k = (D'(I - kM_Z)D)^{-1} D'(I - kM_Z)Y

where MZ=IZ(ZZ)1ZM_Z = I - Z(Z'Z)^{-1}Z' is the residual-maker matrix from projecting onto the instruments. Different values of kk yield different estimators:

  • k=0k = 0: OLS (ignores endogeneity)
  • k=1k = 1: Two-Stage Least Squares (2SLS)
  • k=λ̂mink = \hat{\lambda}_{\min}: Limited Information Maximum Likelihood (LIML), where λ̂min\hat{\lambda}_{\min} is the smallest eigenvalue of a certain matrix
  • k=λ̂mina/(nK)k = \hat{\lambda}_{\min} - a/(n-K): Fuller’s modified LIML (with bias parameter aa, commonly a=1a = 1 or a=4a = 4)

Why LIML? When instruments are weak or numerous, 2SLS suffers from finite-sample bias toward OLS. LIML is approximately median-unbiased even with weak instruments, making it a safer default. Fuller’s estimator further reduces bias and has finite moments (unlike LIML whose moments may not exist).

The Anderson-Rubin (AR) test and the Conditional Likelihood Ratio (CLR) test provide inference that remains valid regardless of instrument strength—they do not rely on first-stage F-statistics being large.

Implementation with ivmodel

library(ivmodel)

# Simulate IV data with a moderately weak instrument
set.seed(42)
n <- 500
z <- rnorm(n)
u <- rnorm(n)
x_covar <- rnorm(n)
d <- 0.3 * z + 0.8 * u + rnorm(n)  # weak instrument (pi = 0.3)
y <- 1.5 * d + 0.5 * x_covar + u + rnorm(n)  # true beta = 1.5

# Fit the ivmodel object
iv_fit <- ivmodel(
    Y = y, D = d, Z = z, X = x_covar,
    intercept = TRUE
)

# Extract estimates from different estimators
tsls_est  <- TSLS(iv_fit)
liml_est  <- LIML(iv_fit)
full_est  <- Fuller(iv_fit, fuller = 1)  # Fuller with a = 1
full4_est <- Fuller(iv_fit, fuller = 4)  # Fuller with a = 4

# Custom k-class: k = 0.5 (between OLS and 2SLS)
kclass_est <- KClass(iv_fit, k = 0.5)

# Anderson-Rubin test (valid with weak instruments)
ar_result <- AR.test(iv_fit)

# Conditional Likelihood Ratio test
clr_result <- CLR(iv_fit)

# Build comparison table
comparison <- data.frame(
    Estimator = c("2SLS", "LIML", "Fuller(1)", "Fuller(4)", "K-Class(0.5)"),
    Estimate  = c(
        tsls_est$point.est, liml_est$point.est,
        full_est$point.est, full4_est$point.est,
        kclass_est$point.est
    ),
    Std.Error = c(
        tsls_est$std.err, liml_est$std.err,
        full_est$std.err, full4_est$std.err,
        kclass_est$std.err
    ),
    CI.Lower = c(
        tsls_est$ci[1], liml_est$ci[1],
        full_est$ci[1], full4_est$ci[1],
        kclass_est$ci[1]
    ),
    CI.Upper = c(
        tsls_est$ci[2], liml_est$ci[2],
        full_est$ci[2], full4_est$ci[2],
        kclass_est$ci[2]
    )
)

knitr::kable(comparison, digits = 3,
             caption = "K-Class Estimator Comparison (True beta = 1.5)")

cat("\nAnderson-Rubin test p-value:", ar_result$p.value,
    "\nCLR test p-value:", clr_result$p.value, "\n")

When 2SLS and LIML estimates diverge substantially, this signals that weak instruments may be biasing 2SLS. In such cases, prefer LIML or Fuller, and rely on AR/CLR confidence sets for inference.

Comprehensive IV Diagnostics with ivDiag

The Effective F-Statistic

The conventional first-stage F-statistic (Staiger and Stock 1997) can be misleading with heteroskedastic, clustered, or serially correlated errors. @olea2013robust proposed the effective F-statistic, which accounts for non-iid errors and provides a single threshold (approximately 10) for detecting weak instruments in the presence of heteroskedasticity or clustering.

The tF procedure (Lee et al. 2022) adjusts the t-ratio critical values based on the effective F-statistic, providing valid inference even when instruments are not strong. This avoids the binary strong/weak classification and instead continuously adjusts for instrument strength.

Implementation with ivDiag

library(ivDiag)

# Simulate panel-like data
set.seed(42)
n <- 1000
df_diag <- data.frame(
    Y  = rnorm(n),
    D  = rnorm(n),
    Z  = rnorm(n),
    X1 = rnorm(n),
    X2 = rnorm(n),
    cl = sample(1:50, n, replace = TRUE)
)
# Create endogeneity and instrument relevance
u <- rnorm(n)
df_diag$D <- 0.4 * df_diag$Z + 0.3 * df_diag$X1 + 0.6 * u + rnorm(n, sd = 0.5)
df_diag$Y <- 2.0 * df_diag$D + 0.5 * df_diag$X1 - 0.3 * df_diag$X2 + u + rnorm(n, sd = 0.5)

# One-call comprehensive diagnostics
diag_result <- ivDiag(
    data    = df_diag,
    Y       = "Y",
    D       = "D",
    Z       = "Z",
    controls = c("X1", "X2"),
    cl       = df_diag$cl  # cluster variable
)

# Display all diagnostic results
diag_result

# Individual diagnostic functions
eff_f <- eff_F(
    data     = df_diag,
    Y        = "Y",
    D        = "D",
    Z        = "Z",
    controls = c("X1", "X2"),
    cl       = df_diag$cl
)
cat("Effective F-statistic:", eff_f$F_eff, "\n")

# tF-adjusted inference
tf_result <- tF(
    data     = df_diag,
    Y        = "Y",
    D        = "D",
    Z        = "Z",
    controls = c("X1", "X2"),
    cl       = df_diag$cl
)
tf_result

The ivDiag package (Lal et al. 2023) consolidates best-practice diagnostics into a single call. If ivDiag is not available on CRAN, the effective F-statistic can also be computed manually by adjusting the Kleibergen-Paap F-statistic for non-iid errors.

LASSO-Selected Instruments and Many Instruments

The Many-Instruments Problem

When the number of candidate instruments KK is large relative to the sample size nn, 2SLS suffers from severe finite-sample bias—the bias approaches that of OLS as K/nK/n grows. Even when K<nK < n, overfitting in the first stage inflates the effective endogeneity of the fitted values.

LASSO-based instrument selection (Belloni, Chen, Chernozhukov, and Hansen 2012) addresses this by using 1\ell_1-penalized regression in the first stage to select the most relevant instruments, then performing IV estimation with only the selected instruments. The hdm package implements the post-LASSO IV estimator with formal guarantees on inference validity.

Three variants are available:

  • rlassoIV(): LASSO selects instruments in the first stage
  • rlassoIVselectZ(): LASSO selects among instruments only
  • rlassoIVselectX(): LASSO selects among controls only

Implementation with hdm

library(hdm)

# Simulate data with many candidate instruments
set.seed(42)
n <- 500
K <- 100  # many candidate instruments

# Generate instruments: only first 5 are truly relevant
Z_mat <- matrix(rnorm(n * K), n, K)
colnames(Z_mat) <- paste0("z", 1:K)

pi_true <- c(rep(0.3, 5), rep(0, K - 5))  # sparse first stage
u <- rnorm(n)
x_ctrl <- rnorm(n)

d <- Z_mat %*% pi_true + 0.7 * u + rnorm(n, sd = 0.5)
y <- 1.0 * d + 0.5 * x_ctrl + u + rnorm(n, sd = 0.5)  # true beta = 1.0

# LASSO-selected IV estimation
lasso_iv <- rlassoIV(
    x = x_ctrl, d = d, y = y, z = Z_mat,
    select.X = TRUE, select.Z = TRUE
)
summary(lasso_iv)

# Variant: select only among instruments
lasso_z <- rlassoIVselectZ(
    x = x_ctrl, d = d, y = y, z = Z_mat
)
summary(lasso_z)

# Variant: select only among controls
lasso_x <- rlassoIVselectX(
    x = x_ctrl, d = d, y = y, z = Z_mat[, 1:5]
)
summary(lasso_x)

# Compare with naive 2SLS using all instruments
cat("\nLASSO IV estimate:", coef(lasso_iv)["d"],
    "\nTrue parameter: 1.0\n")

When the number of candidate instruments is large, LASSO selection dramatically reduces finite-sample bias while maintaining valid post-selection inference.

Jackknife Instrumental Variables (JIVE)

Leave-One-Out First Stage

The Jackknife IV Estimator (JIVE) addresses finite-sample bias in 2SLS by using leave-one-out predictions from the first stage. For each observation ii, the first-stage fitted value D̂i\hat{D}_{-i} is computed from a regression that excludes observation ii:

D̂i=Ziπ̂i\hat{D}_{-i} = Z_i' \hat{\pi}_{-i}

This breaks the mechanical correlation between the first-stage fitted values and the second-stage errors that drives 2SLS bias, particularly when instruments are numerous or weak. JIVE1 uses D̂i\hat{D}_{-i} directly; JIVE2 uses a bias-corrected version that further improves performance.

Angrist, Imbens, and Krueger (1999) showed that JIVE is approximately unbiased for the coefficient on the endogenous variable when the number of instruments is large.

Implementation with SteinIV

library(SteinIV)

# Simulate data with multiple instruments
set.seed(42)
n <- 300
K <- 20  # number of instruments

Z_mat <- matrix(rnorm(n * K), n, K)
pi_coefs <- runif(K, 0.05, 0.15)
u <- rnorm(n)

d <- Z_mat %*% pi_coefs + 0.6 * u + rnorm(n, sd = 0.5)
y <- 2.0 * d + u + rnorm(n, sd = 0.5)  # true beta = 2.0

# JIVE estimation
jive_result <- jive.est(
    y = y, X = d, Z = Z_mat
)

cat("JIVE1 estimate:", jive_result$est["JIVE1"],
    "\nJIVE2 estimate:", jive_result$est["JIVE2"],
    "\n2SLS estimate:", jive_result$est["TSLS"],
    "\nOLS estimate:", jive_result$est["OLS"],
    "\nTrue parameter: 2.0\n")

JIVE typically produces estimates closer to the true parameter than 2SLS when instruments are numerous, at the cost of slightly larger standard errors.

Marginal Treatment Effects (MTE)

The MTE Framework

The Marginal Treatment Effect (Heckman and Vytlacil 2005) generalizes the LATE framework by defining a treatment effect for each value of the unobserved resistance to treatment UDU_D:

MTE(x,uD)=E[Y1Y0X=x,UD=uD]\text{MTE}(x, u_D) = E[Y_1 - Y_0 \mid X = x, U_D = u_D]

where UDU_D is the quantile of the first-stage unobservable such that D=1[p(Z)UD]D = \mathbf{1}[p(Z) \geq U_D] with p(Z)=P(D=1Z)p(Z) = P(D = 1 \mid Z) being the propensity score. The MTE evaluated at uDu_D gives the treatment effect for individuals who are indifferent between treatment and control when the propensity score equals uDu_D.

All conventional treatment effect parameters can be recovered as weighted averages of the MTE:

  • ATE: 01MTE(u)du\int_0^1 \text{MTE}(u) du
  • ATT: 01MTE(u)hATT(u)du\int_0^1 \text{MTE}(u) \cdot h_{\text{ATT}}(u) du
  • LATE: 01MTE(u)hLATE(u)du\int_0^1 \text{MTE}(u) \cdot h_{\text{LATE}}(u) du

This makes the MTE a fundamental building block for extrapolation beyond LATE.

Implementation with ivmte

library(ivmte)

# Simulate data for MTE estimation
set.seed(42)
n <- 2000
z <- sample(1:5, n, replace = TRUE)  # discrete instrument
x <- rnorm(n)
u_d <- runif(n)  # unobserved resistance

# Propensity score depends on Z
pscore <- plogis(-1 + 0.5 * z)
d <- as.integer(pscore >= u_d)

# Heterogeneous treatment effects: MTE varies with u_d
y0 <- 1 + 0.5 * x + rnorm(n, sd = 0.5)
y1 <- y0 + 2 - 1.5 * u_d + 0.3 * x  # MTE decreasing in u_d
y <- d * y1 + (1 - d) * y0

df_mte <- data.frame(y = y, d = d, z = z, x = x)

# Parametric MTE specification
# m0 and m1 are the potential outcome equations
mte_fit <- ivmte(
    data = df_mte,
    target = "ate",
    m0 = ~ x + u,         # control potential outcome
    m1 = ~ x + u,         # treated potential outcome
    propensity = d ~ z,    # first stage for propensity score
    instrument = "z"
)

mte_fit

# Semiparametric specification with Bernstein polynomials
mte_fit_semi <- ivmte(
    data = df_mte,
    target = "ate",
    m0 = ~ 0 + u + I(u^2) + x + x:u,
    m1 = ~ 0 + u + I(u^2) + x + x:u,
    propensity = d ~ z,
    instrument = "z",
    uname = "u"
)

mte_fit_semi

The MTE framework is particularly valuable when treatment effects vary across individuals and you want to extrapolate beyond the complier subpopulation identified by a specific instrument.

Control Function Approach

Two-Stage Residual Inclusion

The control function approach provides an alternative to standard IV by explicitly modeling the source of endogeneity. The key insight is that endogeneity arises because the treatment DD is correlated with the structural error ε\varepsilon. If we can recover the component of DD that is correlated with ε\varepsilon—via the first-stage residuals—we can include it as a control to remove the bias.

Procedure:

  1. Regress DD on ZZ and XX to obtain residuals v̂i=DiD̂i\hat{v}_i = D_i - \hat{D}_i
  2. Regress YY on DD, XX, and v̂\hat{v} to obtain β̂\hat{\beta}

In the linear model, this is numerically identical to 2SLS. However, the control function approach has two major advantages:

  • It extends naturally to nonlinear models (probit, Poisson, etc.) where standard 2SLS is inconsistent
  • The t-statistic on v̂\hat{v} provides a direct test of endogeneity (equivalent to the Hausman test)

Important: Standard errors from the second stage are incorrect because they ignore estimation error in v̂\hat{v}. Bootstrap or analytical corrections are required.

Manual Implementation with fixest

# Simulate data
set.seed(42)
n <- 1000
z <- rnorm(n)
u <- rnorm(n)
x <- rnorm(n)
d <- 0.5 * z + 0.3 * x + 0.7 * u + rnorm(n, sd = 0.5)
y <- 1.5 * d + 0.8 * x + u + rnorm(n, sd = 0.5)

df_cf <- data.frame(y = y, d = d, z = z, x = x)

# Step 1: First-stage regression
first_stage <- feols(d ~ z + x, data = df_cf)
df_cf$v_hat <- residuals(first_stage)

# Step 2: Second-stage with control function
cf_stage2 <- feols(y ~ d + x + v_hat, data = df_cf)

# Compare with standard 2SLS
iv_standard <- feols(y ~ x | d ~ z, data = df_cf)

# Display results
etable(
    iv_standard, cf_stage2,
    headers = c("2SLS", "Control Function"),
    notes = "Note: CF standard errors require bootstrap correction."
)
#>                        iv_standard          cf_stage2
#>                               2SLS   Control Function
#> Dependent Var.:                  y                  y
#>                                                      
#> Constant          -0.0128 (0.0350)   -0.0128 (0.0238)
#> d                1.518*** (0.0679)  1.518*** (0.0461)
#> x               0.7675*** (0.0385) 0.7675*** (0.0261)
#> v_hat                              0.9294*** (0.0535)
#> _______________ __________________ __________________
#> S.E. type                      IID                IID
#> Observations                 1,000              1,000
#> R2                         0.28794            0.92201
#> Adj. R2                    0.28651            0.92177
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# The t-stat on v_hat tests endogeneity
cat("\nEndogeneity test (t-stat on v_hat):",
    coef(cf_stage2)["v_hat"] / se(cf_stage2)["v_hat"], "\n")
#> 
#> Endogeneity test (t-stat on v_hat): 17.36462

# Bootstrap for correct standard errors
set.seed(42)
boot_coefs <- replicate(500, {
    idx <- sample(n, replace = TRUE)
    df_b <- df_cf[idx, c("y", "d", "z", "x")]
    fs <- lm(d ~ z + x, data = df_b)
    df_b$v_hat <- residuals(fs)
    ss <- lm(y ~ d + x + v_hat, data = df_b)
    coef(ss)["d"]
})
cat("Bootstrap SE for beta_d:", sd(boot_coefs),
    "\nBootstrap 95% CI: [", quantile(boot_coefs, 0.025),
    ",", quantile(boot_coefs, 0.975), "]\n")
#> Bootstrap SE for beta_d: 0.06780177 
#> Bootstrap 95% CI: [ 1.377734 , 1.642153 ]

The control function approach is especially useful in nonlinear settings (e.g., probit or Poisson models) where plugging in first-stage fitted values (as in standard 2SLS) produces inconsistent estimates.

Internal Instruments: Heteroskedasticity-Based and Copula Methods

When No External Instrument Is Available

In many empirical settings, a valid external instrument simply does not exist. Two classes of methods exploit features of the data itself to achieve identification:

Lewbel (2012) heteroskedasticity-based instruments exploit heteroskedasticity in the first-stage errors to construct instruments from the data. The identifying assumption is that the covariance between the first-stage regressors and the product of the first- and second-stage errors is zero, while the first-stage errors are heteroskedastic. The constructed instruments are (XX)v̂(X - \bar{X}) \hat{v}, where v̂\hat{v} are first-stage residuals.

Gaussian copula correction (Park and Gupta 2012) exploits the assumption that the endogenous regressor is non-normally distributed while the errors are normal. Under a Gaussian copula, the correlation between the endogenous variable and the error can be identified from the marginal distribution of the regressor.

Both methods trade the standard exclusion restriction for alternative (untestable) assumptions. They are best used as robustness checks or when supplementing weak external instruments.

Implementation with REndo

library(REndo)

# Simulate data with endogeneity but no external instrument
set.seed(42)
n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)

# Create endogeneity: d correlated with error
u <- rnorm(n)
# Non-normal endogenous variable (needed for copula)
d_latent <- 0.5 * x1 + 0.3 * x2 + 0.6 * u + rexp(n, rate = 1)
d <- d_latent

y <- 1.0 * d + 0.5 * x1 - 0.3 * x2 + u + rnorm(n, sd = 0.5)

df_endo <- data.frame(y = y, d = d, x1 = x1, x2 = x2)

# Gaussian copula correction
cop_fit <- copulaCorrection(
    y ~ d + x1 + x2 | continuous(d),
    data = df_endo
)
summary(cop_fit)

# Lewbel (2012) heteroskedasticity-based instruments
het_fit <- hetErrorsIV(
    y ~ d + x1 + x2 | IIV(d),
    data = df_endo
)
summary(het_fit)

These methods should be applied with caution: the Gaussian copula requires non-normality of the endogenous regressor, and Lewbel’s approach requires heteroskedasticity in the first stage. Diagnostic checks (e.g., testing for heteroskedasticity, assessing distributional assumptions) are essential.

Partial Identification and Bounds

When the Exclusion Restriction is Imperfect

When the exclusion restriction is only approximately valid—the instrument may have a small direct effect on the outcome—point identification fails. Instead, we can obtain bounds on the causal effect that remain valid under weaker assumptions.

Nevo and Rosen (2012) derive bounds when the instrument is “imperfect” (correlated with the error, but the direction and rough magnitude of the correlation are known). Conley, Hansen, and Rossi (2012) propose a framework where the direct effect of the instrument on the outcome γ\gamma lies in a known interval [γl,γu][\gamma_l, \gamma_u], yielding bounds on the structural parameter.

For the binary treatment and binary instrument case, Balke and Pearl (1997) derived sharp bounds using the potential outcomes framework without assuming monotonicity:

max{P(Y=1|Z=1)P(Y=1|Z=0)P(D=0|Z=1)P(D=1|Z=0),0}LATE\max\{P(Y=1|Z=1) - P(Y=1|Z=0) - P(D=0|Z=1) - P(D=1|Z=0), \; 0\} \;\leq\; \text{LATE} \;\leq\; \ldots

Manual Implementation: Sensitivity to Exclusion Restriction Violations

# Simulate IV data
set.seed(42)
n <- 2000
z <- rnorm(n)
u <- rnorm(n)
d <- 0.5 * z + 0.6 * u + rnorm(n)
y <- 1.5 * d + u + rnorm(n)  # true beta = 1.5

# Standard 2SLS (assumes gamma = 0)
iv_fit <- feols(y ~ 1 | d ~ z)

# Sensitivity analysis: what if Z has direct effect gamma on Y?
# Y = beta*D + gamma*Z + u
# 2SLS estimates beta + gamma/pi, where pi is first-stage coefficient
pi_hat <- coef(feols(d ~ z))["z"]
beta_2sls <- coef(iv_fit)["fit_d"]

# Bounds for different assumptions about gamma
gamma_grid <- seq(-0.3, 0.3, by = 0.05)
beta_bounds <- beta_2sls - gamma_grid / pi_hat

bounds_df <- data.frame(
    gamma = gamma_grid,
    beta_adjusted = beta_bounds
)

# Plot sensitivity
ggplot(bounds_df, aes(x = gamma, y = beta_adjusted)) +
    geom_line(linewidth = 1) +
    geom_hline(yintercept = 1.5, linetype = "dashed", color = "red") +
    geom_vline(xintercept = 0, linetype = "dotted") +
    geom_ribbon(
        data = subset(bounds_df, gamma >= -0.1 & gamma <= 0.1),
        aes(ymin = -Inf, ymax = Inf), alpha = 0.15, fill = "steelblue"
    ) +
    labs(
        title = "Sensitivity of IV Estimate to Exclusion Restriction Violations",
        x = expression(gamma ~ "(direct effect of Z on Y)"),
        y = expression(hat(beta) ~ "(adjusted estimate)"),
        caption = "Dashed red line = true beta. Shaded region: |gamma| < 0.1"
    ) +
    causalverse::ama_theme()

Partial identification provides honest inference when the exclusion restriction is doubtful. Reporting bounds alongside point estimates can strengthen the credibility of IV analyses.

Nonparametric IV Estimation

Beyond Linearity

Standard IV assumes a linear relationship between the treatment and outcome. When this functional form is misspecified, 2SLS estimates a weighted average of marginal effects that may not correspond to any policy-relevant parameter.

Nonparametric IV (NPIV) estimation relaxes the linearity assumption entirely, estimating the structural function g()g(\cdot) in:

Y=g(D)+ε,E[εZ]=0Y = g(D) + \varepsilon, \quad E[\varepsilon \mid Z] = 0

This is an ill-posed inverse problem: the conditional expectation restriction does not uniquely pin down gg without regularization. Various approaches exist, including sieve estimation (Newey and Powell 2003), Tikhonov regularization, and kernel-based methods.

Implementation with npiv

library(npiv)

# Simulate data with nonlinear structural function
set.seed(42)
n <- 1000
z <- rnorm(n)
u <- rnorm(n)
d <- 0.6 * z + 0.5 * u + rnorm(n, sd = 0.5)

# True structural function: g(d) = sin(d) + 0.5*d
y <- sin(d) + 0.5 * d + u + rnorm(n, sd = 0.3)

# NPIV estimation using series/sieve approach
npiv_fit <- npiv(y = y, d = d, z = z)

# Plot estimated vs true structural function
plot(npiv_fit,
     main = "Nonparametric IV: Estimated vs True Structural Function")

NPIV is most useful for exploratory analysis to check whether the linear specification is adequate, or when theory suggests a nonlinear structural relationship.

Randomization Inference for IV

Finite-Sample Valid P-Values

Standard IV inference relies on asymptotic approximations that may perform poorly in small samples or with weak instruments. Randomization inference (also called permutation inference) provides exact, finite-sample valid p-values by exploiting the known or assumed assignment mechanism of the instrument.

The key idea: under the sharp null hypothesis H0:β=β0H_0: \beta = \beta_0 for all units, we can compute the implied residual Ỹi=Yiβ0Di\tilde{Y}_i = Y_i - \beta_0 D_i for each observation. If the instrument is randomly assigned, permuting ZZ should not systematically affect the relationship between ZZ and Ỹ\tilde{Y}. The randomization distribution of the Wald statistic under H0H_0 provides exact p-values.

Manual Implementation

# Simulate IV data
set.seed(42)
n <- 200
z <- rbinom(n, 1, 0.5)  # randomly assigned binary instrument
u <- rnorm(n)
d <- 0.4 * z + 0.5 * u + rnorm(n, sd = 0.5)
y <- 1.0 * d + u + rnorm(n, sd = 0.5)  # true beta = 1.0

# Observed Wald statistic
wald_obs <- (mean(y[z == 1]) - mean(y[z == 0])) /
    (mean(d[z == 1]) - mean(d[z == 0]))

# Permutation distribution under H0: beta = 0
set.seed(42)
n_perm <- 2000
wald_perm <- numeric(n_perm)

for (b in seq_len(n_perm)) {
    z_perm <- sample(z)  # permute instrument assignment
    wald_perm[b] <- (mean(y[z_perm == 1]) - mean(y[z_perm == 0])) /
        (mean(d[z_perm == 1]) - mean(d[z_perm == 0]))
}

# Two-sided p-value
p_value_ri <- mean(abs(wald_perm) >= abs(wald_obs))

# Visualize
perm_df <- data.frame(wald = wald_perm)
ggplot(perm_df, aes(x = wald)) +
    geom_histogram(bins = 50, fill = "grey70", color = "white") +
    geom_vline(xintercept = wald_obs, color = "red", linewidth = 1) +
    geom_vline(xintercept = -wald_obs, color = "red",
               linewidth = 1, linetype = "dashed") +
    labs(
        title = "Randomization Inference for IV",
        subtitle = paste0("Observed Wald = ", round(wald_obs, 3),
                          ", RI p-value = ", round(p_value_ri, 3)),
        x = "Wald statistic (permutation distribution)",
        y = "Count"
    ) +
    causalverse::ama_theme()

Randomization inference is especially valuable when the instrument is experimentally assigned (e.g., lottery-based instruments) and the sample is small.

Regression Kink Design

Exploiting Kinks in Policy Rules

The Regression Kink Design (RKD; Card, Lee, Pei, and Weber 2015) is an IV strategy that exploits a kink (change in slope) in the relationship between a running variable and the treatment at a known threshold, rather than a discontinuous jump as in the standard RDD.

If a policy rule creates a kink in the treatment at threshold cc:

D={f0(X)if X<cf1(X)if XcD = \begin{cases} f_0(X) & \text{if } X < c \\ f_1(X) & \text{if } X \geq c \end{cases}

where f0f_0 and f1f_1 join continuously at cc but have different slopes, then the ratio of the change in the slope of E[YX]E[Y \mid X] to the change in the slope of E[DX]E[D \mid X] at cc identifies a causal effect:

τRKD=limxcdE[Y|X=x]dxlimxcdE[Y|X=x]dxlimxcdE[D|X=x]dxlimxcdE[D|X=x]dx\tau_{\text{RKD}} = \frac{\lim_{x \downarrow c} \frac{dE[Y|X=x]}{dx} - \lim_{x \uparrow c} \frac{dE[Y|X=x]}{dx}}{\lim_{x \downarrow c} \frac{dE[D|X=x]}{dx} - \lim_{x \uparrow c} \frac{dE[D|X=x]}{dx}}

This is estimated using local polynomial regression with deriv = 1 in the rdrobust package.

Implementation with rdrobust

library(rdrobust)

# Simulate data with a kink in the treatment at x = 0
set.seed(42)
n <- 2000
x <- runif(n, -2, 2)  # running variable
u <- rnorm(n)

# Treatment has different slopes on each side of the kink
d <- ifelse(x < 0, 0.5 * x, 0.5 * x + 0.8 * x) + 0.3 * u + rnorm(n, sd = 0.3)

# Outcome depends on treatment
y <- 1.5 * d + 0.5 * x + u + rnorm(n, sd = 0.5)

# RKD estimation: use deriv = 1 for kink (slope change)
rkd_fit <- rdrobust(y = y, x = x, c = 0, deriv = 1)
summary(rkd_fit)

# First stage kink
rkd_first <- rdrobust(y = d, x = x, c = 0, deriv = 1)
summary(rkd_first)

# Fuzzy RKD: ratio of outcome kink to treatment kink
cat("\nFuzzy RKD estimate (ratio):",
    rkd_fit$coef["Conventional"] / rkd_first$coef["Conventional"], "\n")

The RKD is identified under weaker conditions than the standard RDD (no discontinuity needed), but it requires smoothness of potential outcomes at a higher order and is more sensitive to bandwidth choice.

Mendelian Randomization

Genetic Variants as Instruments

Mendelian randomization (MR) applies the IV framework to epidemiology, using genetic variants (SNPs) as instruments for modifiable risk factors (e.g., BMI, cholesterol) to estimate their causal effects on health outcomes. The random assortment of alleles during meiosis provides a natural randomization that satisfies instrument relevance, and the exclusion restriction holds if the SNP affects the outcome only through the risk factor (no horizontal pleiotropy).

The standard MR assumptions mirror IV:

  1. Relevance: The SNP is associated with the risk factor
  2. Independence: The SNP is independent of confounders (satisfied by random genetic inheritance)
  3. Exclusion: The SNP affects the outcome only through the risk factor

When multiple SNPs are available, MR-Egger regression provides a test for and correction of horizontal pleiotropy.

library(MendelianRandomization)

# Simulated summary-level MR data (as typically available from GWAS)
set.seed(42)
n_snps <- 20

# True causal effect of risk factor on outcome: beta = 0.5
beta_true <- 0.5
bx <- rnorm(n_snps, mean = 0.3, sd = 0.1)  # SNP-exposure associations
by <- beta_true * bx + rnorm(n_snps, sd = 0.05)  # SNP-outcome associations
sx <- rep(0.05, n_snps)  # SE of SNP-exposure
sy <- rep(0.05, n_snps)  # SE of SNP-outcome

# Create MR input object
mr_input <- mr_input(
    bx = bx, bxse = sx,
    by = by, byse = sy
)

# IVW (inverse-variance weighted) estimate
mr_ivw <- mr_ivw(mr_input)
mr_ivw

# MR-Egger (allows for pleiotropy)
mr_egger <- mr_egger(mr_input)
mr_egger

# Weighted median (robust to up to 50% invalid instruments)
mr_median <- mr_median(mr_input)
mr_median

Mendelian randomization is an active area connecting genetics and causal inference. The MendelianRandomization package provides a comprehensive toolkit for summary-level MR analysis.

Practical Guidelines

This section provides a checklist for conducting and reporting IV analyses in applied research.

Before Estimation

  1. Justify the instrument theoretically. Explain the institutional or natural variation that generates the instrument. The exclusion restriction cannot be tested—it must be argued.

  2. State the target parameter. Are you estimating a LATE? If so, who are the compliers? Is the LATE policy-relevant?

  3. Discuss monotonicity. Is it plausible that the instrument shifts everyone in the same direction? Are there subgroups for whom the instrument might work in opposite directions?

  4. Consider the exclusion restriction carefully. List all possible channels through which the instrument might affect the outcome. Discuss why these alternative channels are unlikely or have been controlled for.

During Estimation

  1. Report the first-stage regression. Show the coefficient, standard error, and F-statistic of the excluded instruments. Plot the first stage.

  2. Check for weak instruments. The first-stage F-statistic should exceed the Stock-Yogo critical values for the desired bias or size threshold. Consider the Lee et al. (2022) tF procedure for robust inference.

  3. Report both OLS and IV. The comparison is informative: if OLS and IV are similar, either the endogeneity problem is small or the instrument is weak.

  4. Test for endogeneity. The Wu-Hausman test helps determine whether IV is actually needed. If it fails to reject, OLS may be preferred (more efficient).

  5. Test over-identifying restrictions. If you have more instruments than endogenous variables, report the Sargan/Hansen J test. But remember: failure to reject does not prove instruments are valid.

  6. Cluster standard errors appropriately. Cluster at the level of variation of the instrument or treatment assignment. When in doubt, use larger clusters.

Robustness Checks

  1. Vary the set of controls. The IV estimate should be stable when adding or removing exogenous covariates (they affect precision, not consistency).

  2. Try different instrument sets. If multiple instruments are available, report results using each separately and all together.

  3. Conduct placebo tests. Test whether the instrument predicts pre-treatment outcomes or outcomes that it should not affect under the exclusion restriction.

  4. Perform sensitivity analysis. Use frameworks like sensemakr to quantify how strong violations of the exclusion restriction would need to be to overturn the results.

  5. Check for many-instruments bias. If the number of instruments is large relative to the sample size, consider LIML or JIVE estimators as alternatives to 2SLS.

Reporting

  1. Present a clear IV “story.” Readers should understand the source of exogenous variation, who the compliers are, and why the exclusion restriction is plausible.

  2. Include all diagnostic tests. First-stage F, Wu-Hausman, Sargan (if applicable), and any weak instrument-robust inference.

  3. Report reduced-form estimates. The reduced form (YY on ZZ) provides a model-free intent-to-treat effect that does not rely on the first-stage specification.

  4. Discuss external validity. The LATE applies to compliers. Discuss whether the complier subpopulation is representative and whether the results can be generalized.

  5. Be transparent about limitations. No instrument is perfect. Acknowledge the strongest threats to validity and discuss how they might affect the conclusions.

References

  • Adao, R., Kolesar, M., and Morales, E. (2019). “Shift-Share Designs: Theory and Inference.” Quarterly Journal of Economics, 134(4), 1949-2010.

  • Anderson, T. W. and Rubin, H. (1949). “Estimation of the Parameters of a Single Equation in a Complete System of Stochastic Equations.” Annals of Mathematical Statistics, 20(1), 46-63.

  • Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.

  • Bartik, T. J. (1991). Who Benefits from State and Local Economic Development Policies? W.E. Upjohn Institute.

  • Berge, L. (2018). “Efficient Estimation of Maximum Likelihood Models with Multiple Fixed-Effects: the R Package FENmlm.” CREA Discussion Paper.

  • Blandhol, C., Bonney, J., Mogstad, M., and Torgovitsky, A. (2022). “When is TSLS Actually LATE?” Review of Economic Studies, forthcoming.

  • Borusyak, K., Hull, P., and Jaravel, X. (2022). “Quasi-Experimental Shift-Share Research Designs.” Review of Economic Studies, 89(1), 181-213.

  • Cinelli, C. and Hazlett, C. (2020). “Making Sense of Sensitivity: Extending Omitted Variable Bias.” Journal of the Royal Statistical Society: Series B, 82(1), 39-67.

  • Dobbie, W., Goldin, J., and Yang, C. S. (2018). “The Effects of Pretrial Detention on Conviction, Future Crime, and Employment.” American Economic Review, 108(2), 201-240.

  • Fox, J., Kleiber, C., and Zeileis, A. (2023). “ivreg: Instrumental-Variables Regression by 2SLS, 2SM, or 2SMM, with Diagnostics.” R package.

  • Frandsen, B. R., Lefgren, L. J., and Leslie, E. C. (2023). “Judging Judge Fixed Effects.” American Economic Review, 113(1), 253-277.

  • Goldsmith-Pinkham, P., Sorkin, I., and Swift, H. (2020). “Bartik Instruments: What, When, Why, and How.” American Economic Review, 110(8), 2586-2624.

  • Hansen, L. P. (1982). “Large Sample Properties of Generalized Method of Moments Estimators.” Econometrica, 50(4), 1029-1054.

  • Imbens, G. W. and Angrist, J. D. (1994). “Identification and Estimation of Local Average Treatment Effects.” Econometrica, 62(2), 467-475.

  • Lee, D. S., McCrary, J., Moreira, M. J., and Porter, J. (2022). “Valid t-Ratio Inference for IV.” American Economic Review, 112(10), 3260-3290.

  • Mueller-Smith, M. (2015). “The Criminal and Labor Market Impacts of Incarceration.” Working Paper.

  • Sargan, J. D. (1958). “The Estimation of Economic Relationships Using Instrumental Variables.” Econometrica, 26(3), 393-415.

  • Staiger, D. and Stock, J. H. (1997). “Instrumental Variables Regression with Weak Instruments.” Econometrica, 65(3), 557-586.

  • Stock, J. H. and Yogo, M. (2005). “Testing for Weak Instruments in Linear IV Regression.” In Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, Cambridge University Press.

  • Stock, J. H. and Watson, M. W. (2019). Introduction to Econometrics. 4th Edition. Pearson.

  • Angrist, J. D., Imbens, G. W., and Krueger, A. B. (1999). “Jackknife Instrumental Variables Estimation.” Journal of Applied Econometrics, 14(1), 57-67.

  • Balke, A. and Pearl, J. (1997). “Bounds on Treatment Effects from Studies with Imperfect Compliance.” Journal of the American Statistical Association, 92(439), 1171-1176.

  • Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012). “Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain.” Econometrica, 80(6), 2369-2429.

  • Card, D., Lee, D. S., Pei, Z., and Weber, A. (2015). “Inference on Causal Effects in a Generalized Regression Kink Design.” Econometrica, 83(6), 2453-2483.

  • Conley, T. G., Hansen, C. B., and Rossi, P. E. (2012). “Plausibly Exogenous.” Review of Economics and Statistics, 94(1), 260-272.

  • Heckman, J. J. and Vytlacil, E. (2005). “Structural Equations, Treatment Effects, and Econometric Policy Evaluation.” Econometrica, 73(3), 669-738.

  • Lal, A., Lockhart, M., Xu, Y., and Zu, Z. (2023). “How Much Should We Trust Instrumental Variable Estimates in Political Science? Practical Advice Based on 67 Replicated Studies.” Working Paper.

  • Lewbel, A. (2012). “Using Heteroscedasticity to Identify and Estimate Mismeasured and Endogenous Regressor Models.” Journal of Business and Economic Statistics, 30(1), 67-80.

  • Nevo, A. and Rosen, A. M. (2012). “Identification with Imperfect Instruments.” Review of Economics and Statistics, 94(3), 659-671.

  • Newey, W. K. and Powell, J. L. (2003). “Instrumental Variable Estimation of Nonparametric Models.” Econometrica, 71(5), 1565-1578.

  • Olea, J. L. M. and Pflueger, C. (2013). “A Robust Test for Weak Instruments.” Journal of Business and Economic Statistics, 31(3), 358-369.

  • Park, S. and Gupta, S. (2012). “Handling Endogenous Regressors by Joint Estimation Using Copulas.” Marketing Science, 31(4), 567-586.