Semiparametric and benchmark methods for longitudinal EHR data with informative visiting and informative observation processes.

Overview

CIMEHR (Clinical Informative Missingness for Electronic Health Record data) provides methods for longitudinal EHR analyses where patients visit irregularly and outcomes may be selectively measured at visits. The primary methods jointly characterize three linked processes:

the visiting process: when a patient has an EHR visit;
the observation process: whether the outcome/biomarker is recorded at a visit;
the longitudinal outcome process: the outcome trajectory among the target population.

The main estimator is CIMEHR(). Two extensions, CIMEHR_timevarying_integral() and CIMEHR_timevarying_ou(), allow more flexible time-varying/serially correlated observation-process structures.

Installation

{r, eval = FALSE} # install.packages("devtools") devtools::install_github("ysph-dsde/CIMEHR")

Data format

The analysis functions expect long-format data, with one row per subject-visit or subject-time record. The CIMEHR estimators expect at least the following column roles, with default column names shown:

Column role	Default name	Meaning
Subject identifier	`id`	Unique subject identifier
Visit time	`time`	Visit time; `NA` indicates a row without a visit
Outcome	`Y`	Longitudinal outcome; usually `NA` when the outcome was not recorded
Observation indicator	`R`	1 if the outcome was recorded at the visit, 0 otherwise
Censoring/follow-up end	`C`	End of follow-up for each subject
Covariates	user-named	Baseline or model covariates

A note on time units. The methods do not assume a specific unit for time and C — months, weeks, days, or years all work, as long as a single unit is used consistently within a dataset. The choice affects only the interpretation of the slope on time in the longitudinal outcome model: that coefficient is the expected change in outcome per one unit of time. The bundled sim_ehr_data was generated with time and C in months over a 120-month (10-year) follow-up window, parameter values calibrated to match an outpatient HbA1c cadence. For your own data, choose the unit that gives a clinically meaningful slope (typically months or years for chronic-disease EHR cohorts) and document it alongside your analysis.

All package methods should be given long/panel data. The CIMEHR-family functions accept custom column names through id_col, time_col, y_col, r_col, and censor_col, so renaming your data is not required:

Converting short/wide data to long data

If your data are stored with one row per subject and repeated outcome/time columns, convert them to long format before fitting a model:

```{r, eval = FALSE} library(dplyr) library(tidyr)

wide_dat <- tibble::tibble( id = 1:2, Age = c(0.4, -0.2), Gender = c(1, 0), NSES = c(0.2, -0.4), C = c(60, 60), time_1 = c(5, 8), time_2 = c(20, NA), time_3 = c(45, 50), Y_1 = c(1.2, NA), Y_2 = c(1.6, NA), Y_3 = c(2.0, 1.1) )

long_dat <- wide_dat %>% pivot_longer( cols = matches(“^(time|Y)\d+$”), names_to = c(”.value”, ”visit”), names_pattern = ”(time|Y)(\d+)” ) %>% mutate(R = as.integer(!is.na(Y))) %>% arrange(id, time)


## Example dataset

The package includes `sim_ehr_data`, a long-format simulated EHR cohort designed to look like an outpatient HbA1c extract. The bundled dataset is one replicate from the package's Monte Carlo simulation framework (see Section 8, [Simulation study](#simulation-study)), specifically the `continuous_phi0` scenario (focal Z = NSES, phi = 0). The columns are listed below. Latent quantities used internally by the data-generating process (the shared factor, the visiting frailty, and per-process random effects) are not included, since they are not observable in real EHR data.

| Column      | Type        | Description                                                  |
|:------------|:------------|:-------------------------------------------------------------|
| `id`        | integer     | Subject identifier                                           |
| `Age`       | continuous  | Standardized age                                             |
| `Gender`    | binary      | 1 = male, 0 = female                                         |
| `Marital`   | binary      | 1 = married, 0 = not                                         |
| `Black`     | binary      | 1 = Black race, 0 = otherwise                                |
| `Hispanic`  | binary      | 1 = Hispanic ethnicity, 0 = otherwise                        |
| `NSES`      | continuous  | Standardized neighborhood median household income            |
| `time`      | continuous  | Visit time on [0, 120] months; `NA` for subjects with no visits|
| `R`         | binary      | 1 if HbA1c was recorded at the visit, 0 otherwise            |
| `log_HbA1c` | continuous  | Natural log of HbA1c; recommended outcome for analysis. `NA` when `R = 0` |
| `C`         | continuous  | End-of-follow-up time (120 months for all subjects)       |          |

```{r, eval = FALSE}
library(CIMEHR)
data("sim_ehr_data")
head(sim_ehr_data)

fit <- CIMEHR(
  data = sim_ehr_data,
  y_col                         = "log_HbA1c",
  covars_visit_XV               = c("Age", "Gender", "Marital", "Black", "Hispanic", "NSES"),
  covars_outcome_fixed_XY       = c("Age", "Gender", "Marital", "Black", "Hispanic", "NSES"),
  covars_outcome_random_link_ZY = "NSES",
  covars_obs_fixed_XO           = c("Age", "Gender", "Marital", "Black", "Hispanic", "NSES"),
  covars_obs_random_link_ZO     = "NSES"
)
print(fit)
summary(fit)

To simulate new data with custom covariate sets, see ?sim_data_gen. The function takes a covariates list spec — each element declares one covariate’s name, type ("continuous" or "binary"), and distributional parameters — and per-process coefficients are passed as named numeric vectors keyed by those covariate names. The bundled sim_ehr_data is one realisation; the data-raw script in the package documents the full call.

Method comparison

This section compares the statistical methods for longitudinal EHR data implemented in (or referenced by) the package. Methods are differentiated by whether they account for informative visiting (IP) and informative observation (IO), the specific scope of the shared latent structure linking these processes to the outcome, and the adjustment mechanism applied. Symbols: ✓ = explicitly handled or modeled; ✗ = ignored or not handled.

Abbreviations: LMM, linear mixed model; VA/OA, visit/observation-adjusted; JMVL, joint modeling of visiting and longitudinal data; IIRR, inverse intensity rate ratio; MAR, missing at random; MI, multiple imputation; LI, linear increment.

Methods implemented in CIMEHR

The following methods have function implementations in the package:

Method	Function name
Summary statistics	`Summary_stat()`
Standard LMM (Laird and Ware, 1982)	`Linear_mixed_model()`
Liang (Liang et al., 2009)	`Joint_modeling_visiting_and_longitudinal_Liang()`
IIRR-Weighting (Bůžková and Lumley, 2007)	`Inverse_intensity_rate_ratio_weighting()`
IIRR-Balanced (Yiu and Su, 2025)	`Inverse_intensity_rate_ratio_balancing()`
Pairwise Likelihood (Chen et al., 2015)	`Pairwise_likelihood()`
Multiple Imputation (Rubin, 1987)	`Multiple_imputation_IP()`
Linear Increment (Aalen and Gunnes, 2010)	`Linear_increment_IP()`
EHRJoint (Du et al., 2025)	`EHRJoint()`
Proposed (Yang, Shi, and Mukherjee, 2026)	`CIMEHR()`, `CIMEHR_timevarying_integral()`, `CIMEHR_timevarying_ou()`

VA-LMM, OA-LMM, JMVL-G, and Adapted-Liang appear in the comparison tables below for context but are not implemented as separate functions; VA-LMM and OA-LMM are available through Linear_mixed_model() via the visit_adjust argument.

Outcome-only

Method	IP	IO	Shared Latent	Adjustment Mechanism	Package
Summary statistics	✗	✗	None	Reduces the longitudinal series to a single scalar (e.g., min, mean, median, max) for regression.	—
Standard LMM (Laird and Ware, 1982)	✗	✗	None	Conditions on observed covariates (valid only under MAR).	lme4
VA-LMM (Goldstein et al., 2016)	✓	✗	None	Includes visit count as a fixed covariate proxy.	lme4
OA-LMM	✗	✓	None	Includes observation count as a fixed covariate proxy.	lme4

IP-only

Method	IP	IO	Shared Latent	Adjustment Mechanism	Package
Liang (Liang et al., 2009)	✓	✗	Visiting & Outcome	Joint likelihood estimation via shared frailty linking visiting and outcome.	IrregLong
JMVL-G (Gasparini et al., 2020)	✓	✗	Visiting & Outcome	Joint likelihood modeling inter-visit times via shared random effects.	merlin
IIRR-Weighting (Bůžková and Lumley, 2007)	✓	✗	None	Inverse intensity weighting based on visiting intensity.	—
IIRR-Balanced (Yiu and Su, 2025)	✓	✗	None	Inverse intensity weighting adjusted for covariate balance.	—
Pairwise Likelihood (Chen et al., 2015)	✓	✗	None	Constructs a likelihood from all within-subject pairs of observations; conditioning on observed visit times eliminates the visit intensity from the likelihood.	—

Imputation + IP-only

Method	IP	IO	Shared Latent	Adjustment Mechanism	Package
Multiple Imputation (MI) (Rubin, 1987)	✓	✓	Determined by IP-only	Multiple imputation based on posterior draws, followed by IP-only methods.	mice
Linear Increment Methods (LI) (Aalen and Gunnes, 2010)	✓	✓	Determined by IP-only	Deterministic linear interpolation between observed visits to impute missing values, followed by IP-only methods.	slim

IP+IO

Method	IP	IO	Shared Latent	Adjustment Mechanism	Package
Adapted-Liang (Liang et al., 2009)	✓	✓	Visiting & Outcome	Joint model structure defined, but observation coefficients are constrained to zero.	—
EHRJoint (Du et al., 2025)	✓	✓	Visiting & Outcome	Tripartite model; observation process modeled via covariates only (no latent link).	—
Proposed (this package)	✓	✓	All three processes	Tripartite model; full shared latent structure links Visiting, Observation, and Outcome.	CIMEHR

IP = informative visiting / informative presence adjustment; IO = informative observation adjustment. The “Proposed” row corresponds to CIMEHR(); CIMEHR_timevarying_integral() and CIMEHR_timevarying_ou() extend this framework with more flexible observation-process modeling.

Primary CIMEHR methods

```{r, eval = FALSE} covars_all <- c(“Age”, “Gender”, “Marital”, “Black”, “Hispanic”, “NSES”)

fit_base <- CIMEHR( data = sim_ehr_data, y_col = “log_HbA1c”, covars_visit_XV = covars_all, covars_outcome_fixed_XY = covars_all, covars_outcome_random_link_ZY = “NSES”, covars_obs_fixed_XO = covars_all, covars_obs_random_link_ZO = “NSES” )

fit_integral <- CIMEHR_timevarying_integral( data = sim_ehr_data, y_col = “log_HbA1c”, covars_visit_XV = covars_all, covars_outcome_fixed_XY = covars_all, covars_outcome_random_link_ZY = “NSES”, covars_obs_fixed_XO = covars_all, covars_obs_random_link_ZO = “NSES”, gh_points_U = 7, gh_points_q = 5 )

fit_ou <- CIMEHR_timevarying_ou( data = sim_ehr_data, y_col = “log_HbA1c”, covars_visit_XV = covars_all, covars_outcome_fixed_XY = covars_all, covars_outcome_random_link_ZY = “NSES”, covars_obs_fixed_XO = covars_all, covars_obs_random_link_ZO = “NSES”, pair_type = “adjacent” )


The `summary()` output for CIMEHR-family fits reports the frailty model and the observation-process link. For example, `CIMEHR()` and `CIMEHR_timevarying_integral()` use a probit observation model, while `CIMEHR_timevarying_ou()` uses a probit observation model with OU serial correlation estimated by pairwise composite likelihood.


## Extracting stage summaries and coefficients

The regular `summary()` call prints model output but does not need to be saved before extracting coefficients. You can print one process at a time with the stage-specific summary helpers:

```{r, eval = FALSE}
summary_visiting(fit_base)
summary_observation(fit_base)
summary_outcome(fit_base)

If a method does not have the requested stage, the helper prints a note and exits without error. For example, outcome-only methods do not have a visiting or observation stage.

You can extract all coefficients from a stage using coef() for CIMEHR-family methods or the more general extract_coefficient() helper:

```{r, eval = FALSE} # All outcome coefficients coef(fit_base, type = “outcome”)

A particular outcome coefficient

extract_coefficient(fit_base, stage = “outcome”, parameter = “NSES”)

A particular visiting-process coefficient

extract_coefficient(fit_base, stage = “visiting”, parameter = “NSES”)

A particular observation-process coefficient

extract_coefficient(fit_base, stage = “observation”, parameter = “NSES”)



## Comparing fitted methods

`method_comparisons()` can fit one or more methods and return process-specific comparison tables. Use `methods = "all"` to fit all methods supported by the comparison helper, or provide any subset of method names. Method-specific arguments are supplied through `method_args`.

```{r, eval = FALSE}
# See all supported method names
available_comparison_methods()

covars_all <- c("Age", "Gender", "Marital", "Black", "Hispanic", "NSES")
out_form <- log_HbA1c ~ Age + Gender + Marital + Black + Hispanic + NSES + time

cmp_subset <- method_comparisons(
  data    = sim_ehr_data,
  methods = c("Summary_stat", "Linear_mixed_model",
              "Inverse_intensity_rate_ratio_weighting",
              "Joint_modeling_visiting_and_longitudinal_Liang",
              "CIMEHR"),
  method_args = list(
    Summary_stat       = list(outcome = "log_HbA1c", 
                              formula = ~ Age + Gender + Marital + Black + Hispanic + NSES),
    Linear_mixed_model = list(fixed         = out_form,
                              random        = ~ 1 | id,
                              obs_indicator = "R"),
    Inverse_intensity_rate_ratio_weighting = list(outcome = "log_HbA1c", visit_covs = covars_all),
    Joint_modeling_visiting_and_longitudinal_Liang = list(
      outcome = "log_HbA1c", visit_covs = covars_all, long_covs = covars_all, random_covs = "NSES"
    ),
    CIMEHR = list(
      y_col                         = "log_HbA1c",
      covars_visit_XV               = covars_all,
      covars_outcome_fixed_XY       = covars_all,
      covars_outcome_random_link_ZY = "NSES",
      covars_obs_fixed_XO           = covars_all,
      covars_obs_random_link_ZO     = "NSES"
    )
  ),
  printSE = TRUE,
  print95CI = TRUE,
  true_value_verbose = TRUE
)

print(cmp_subset)
cmp_subset$tables$outcome

For real, non-simulated data, the comparison tables omit true-value, bias, and RMSE columns unless you explicitly pass a true_values list. If a method fails, method_comparisons() keeps going by default, stores the error in cmp_subset$errors, and prints NA rows for that method.

Simulation study

A Monte Carlo simulation study is provided in simulations/simulation_study.R. The script crosses two focal-covariate types (continuous NSES and binary Gender) with four values of the Ornstein-Uhlenbeck serial-correlation parameter (phi $\in$ {0, 0.001, 0.1, 0.3}), producing eight scenarios. Each scenario runs N_SIM = 1000 Monte Carlo replications; each replication generates a fresh dataset of n = 2000 participants on the [0, 120] window via sim_data_gen() and fits four methods:

CIMEHR()
Joint_modeling_visiting_and_longitudinal_Liang()
Inverse_intensity_rate_ratio_weighting()
IrregLong::iiwgee() (external comparator)

For each method × scenario × parameter (beta on each of Age, Gender, Marital, Black, Hispanic, NSES), the script reports bias, empirical SE, mean analytic SE, SE ratio, RMSE, and 95% CI coverage. Coverage is reported as --- for Inverse_intensity_rate_ratio_weighting() because that method does not return analytic standard errors; bootstrap inference would be needed there.

The bundled sim_ehr_data dataset is itself one replicate from this framework (the continuous_phi0 scenario, sim_id = 1), so the data-generating spec used by the study and the spec of the bundled example dataset are identical aside from phi and the focal covariate.

To run the study:

{bash, eval = FALSE} Rscript simulations/simulation_study.R Rscript simulations/simulation_study.R --cores 8 --nsim 100 Rscript simulations/simulation_study.R --scenarios continuous_phi0,binary_phi0.3

Bootstrap inference

Use bootstrap() for methods without analytic standard errors or as a robustness check for CIMEHR-family methods:

{r, eval = FALSE} fit_iirr <- Inverse_intensity_rate_ratio_weighting( sim_ehr_data, outcome = "log_HbA1c", visit_covs = c("Age", "Gender", "Marital", "Black", "Hispanic", "NSES") ) bs <- bootstrap(fit_iirr, data = sim_ehr_data, B = 200, seed = 1) print(bs) summary(bs)

Model-specific arguments can be passed through covars:

{r, eval = FALSE} covars_all <- c("Age", "Gender", "Marital", "Black", "Hispanic", "NSES") bs_cimehr <- bootstrap( fit_base, data = sim_ehr_data, B = 200, seed = 1, covars = list( covars_visit_XV = covars_all, covars_outcome_fixed_XY = covars_all, covars_outcome_random_link_ZY = "NSES", covars_obs_fixed_XO = covars_all, covars_obs_random_link_ZO = "NSES" ) )

Getting help

Run vignette("getting-started", package = "CIMEHR") for a more detailed tutorial.

References

Aalen, O. O., Borgan, Ø., and Gjessing, H. K. (2008). Survival and event history analysis: A process point of view. Springer.

Aalen, O. O. and Gunnes, N. (2010). A dynamic approach for reconstructing missing longitudinal data using the linear increments model. Biostatistics, 11, 453–472.

Bůžková, P. and Lumley, T. (2007). Longitudinal data analysis for generalized linear models with follow-up dependent on outcome-related variables. Canadian Journal of Statistics, 35, 485–500.

Chen, Y., Ning, J., and Cai, C. (2015). Regression analysis of longitudinal data with irregular and informative observation times. Biostatistics, 16, 727–739.

Du, J., Shi, X., and Mukherjee, B. (2025). A new statistical approach for joint modeling of longitudinal outcomes measured in electronic health records with clinically informative presence and observation processes.

Gasparini, A., Abrams, K. R., Barrett, J. K., Major, R. W., Sweeting, M. J., Brunskill, N. J., and Crowther, M. J. (2020). Mixed-effects models for health care longitudinal data with an informative visiting process: A Monte Carlo simulation study. Statistica Neerlandica, 74, 5–23.

Goldstein, B. A., Bhavsar, N. A., Phelan, M., and Pencina, M. J. (2016). Controlling for informed presence bias due to the number of health encounters in an electronic health record. American Journal of Epidemiology, 184, 847–855.

Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963–974.

Liang, Y., Lu, W., and Ying, Z. (2009). Joint modeling and analysis of longitudinal data with informative observation times. Biometrics, 65, 377–384.

Pullenayegum, E. M. (2026). IrregLong: Analysis of Longitudinal Data with Irregular Observation Times. R package version 0.4.1.

Rubin, D. (1987). Multiple imputation for nonresponse in surveys. John Wiley & Sons, New York.

Yiu, S. and Su, L. (2025). Accommodating informative visit times for analysing irregular longitudinal data: A sensitivity analysis approach with balancing weights estimators. Journal of the Royal Statistical Society Series C: Applied Statistics, 74, 824–843.

Yang, C.-H., Shi, X., and Mukherjee, B. (2026). Joint modeling of longitudinal EHR data with shared random effects for informative visiting and observation processes. arXiv:2602.15374.