Semiparametric and benchmark methods for longitudinal EHR data with informative visiting and informative observation processes.
CIMEHR (Clinical Informative Missingness for Electronic Health Record data) provides methods for longitudinal EHR analyses where patients visit irregularly and outcomes may be selectively measured at visits. The primary methods jointly characterize three linked processes:
The main estimator is CIMEHR(). Two extensions,
CIMEHR_timevarying_integral() and
CIMEHR_timevarying_ou(), allow more flexible
time-varying/serially correlated observation-process structures.
{r, eval = FALSE} # install.packages("devtools") devtools::install_github("ysph-dsde/CIMEHR")
The analysis functions expect long-format data, with one row per subject-visit or subject-time record. The CIMEHR estimators expect at least the following column roles, with default column names shown:
| Column role | Default name | Meaning |
|---|---|---|
| Subject identifier | id |
Unique subject identifier |
| Visit time | time |
Visit time; NA indicates a
row without a visit |
| Outcome | Y |
Longitudinal outcome; usually
NA when the outcome was not recorded |
| Observation indicator | R |
1 if the outcome was recorded at the visit, 0 otherwise |
| Censoring/follow-up end | C |
End of follow-up for each subject |
| Covariates | user-named | Baseline or model covariates |
A note on time units. The methods do not assume a
specific unit for time and C — months, weeks,
days, or years all work, as long as a single unit is used consistently
within a dataset. The choice affects only the interpretation of the
slope on time in the longitudinal outcome model: that
coefficient is the expected change in outcome per one unit of time. The
bundled sim_ehr_data was generated with time
and C in months over a 120-month (10-year) follow-up
window, parameter values calibrated to match an outpatient HbA1c
cadence. For your own data, choose the unit that gives a clinically
meaningful slope (typically months or years for chronic-disease EHR
cohorts) and document it alongside your analysis.
All package methods should be given long/panel data. The
CIMEHR-family functions accept custom column names through
id_col, time_col, y_col,
r_col, and censor_col, so renaming your data
is not required:
If your data are stored with one row per subject and repeated outcome/time columns, convert them to long format before fitting a model:
```{r, eval = FALSE} library(dplyr) library(tidyr)
wide_dat <- tibble::tibble( id = 1:2, Age = c(0.4, -0.2), Gender = c(1, 0), NSES = c(0.2, -0.4), C = c(60, 60), time_1 = c(5, 8), time_2 = c(20, NA), time_3 = c(45, 50), Y_1 = c(1.2, NA), Y_2 = c(1.6, NA), Y_3 = c(2.0, 1.1) )
long_dat <- wide_dat %>% pivot_longer( cols = matches(“^(time|Y)\d+$”), names_to = c(”.value”, ”visit”), names_pattern = ”(time|Y)(\d+)” ) %>% mutate(R = as.integer(!is.na(Y))) %>% arrange(id, time)
## Example dataset
The package includes `sim_ehr_data`, a long-format simulated EHR cohort designed to look like an outpatient HbA1c extract. The bundled dataset is one replicate from the package's Monte Carlo simulation framework (see Section 8, [Simulation study](#simulation-study)), specifically the `continuous_phi0` scenario (focal Z = NSES, phi = 0). The columns are listed below. Latent quantities used internally by the data-generating process (the shared factor, the visiting frailty, and per-process random effects) are not included, since they are not observable in real EHR data.
| Column | Type | Description |
|:------------|:------------|:-------------------------------------------------------------|
| `id` | integer | Subject identifier |
| `Age` | continuous | Standardized age |
| `Gender` | binary | 1 = male, 0 = female |
| `Marital` | binary | 1 = married, 0 = not |
| `Black` | binary | 1 = Black race, 0 = otherwise |
| `Hispanic` | binary | 1 = Hispanic ethnicity, 0 = otherwise |
| `NSES` | continuous | Standardized neighborhood median household income |
| `time` | continuous | Visit time on [0, 120] months; `NA` for subjects with no visits|
| `R` | binary | 1 if HbA1c was recorded at the visit, 0 otherwise |
| `log_HbA1c` | continuous | Natural log of HbA1c; recommended outcome for analysis. `NA` when `R = 0` |
| `C` | continuous | End-of-follow-up time (120 months for all subjects) | |
```{r, eval = FALSE}
library(CIMEHR)
data("sim_ehr_data")
head(sim_ehr_data)
fit <- CIMEHR(
data = sim_ehr_data,
y_col = "log_HbA1c",
covars_visit_XV = c("Age", "Gender", "Marital", "Black", "Hispanic", "NSES"),
covars_outcome_fixed_XY = c("Age", "Gender", "Marital", "Black", "Hispanic", "NSES"),
covars_outcome_random_link_ZY = "NSES",
covars_obs_fixed_XO = c("Age", "Gender", "Marital", "Black", "Hispanic", "NSES"),
covars_obs_random_link_ZO = "NSES"
)
print(fit)
summary(fit)
To simulate new data with custom covariate sets, see
?sim_data_gen. The function takes a covariates
list spec — each element declares one covariate’s name, type
("continuous" or "binary"), and distributional
parameters — and per-process coefficients are passed as named numeric
vectors keyed by those covariate names. The bundled
sim_ehr_data is one realisation; the data-raw script in the
package documents the full call.
This section compares the statistical methods for longitudinal EHR data implemented in (or referenced by) the package. Methods are differentiated by whether they account for informative visiting (IP) and informative observation (IO), the specific scope of the shared latent structure linking these processes to the outcome, and the adjustment mechanism applied. Symbols: ✓ = explicitly handled or modeled; ✗ = ignored or not handled.
Abbreviations: LMM, linear mixed model; VA/OA, visit/observation-adjusted; JMVL, joint modeling of visiting and longitudinal data; IIRR, inverse intensity rate ratio; MAR, missing at random; MI, multiple imputation; LI, linear increment.
The following methods have function implementations in the package:
| Method | Function name |
|---|---|
| Summary statistics | Summary_stat() |
| Standard LMM (Laird and Ware, 1982) | Linear_mixed_model() |
| Liang (Liang et al., 2009) | Joint_modeling_visiting_and_longitudinal_Liang() |
| IIRR-Weighting (Bůžková and Lumley, 2007) | Inverse_intensity_rate_ratio_weighting() |
| IIRR-Balanced (Yiu and Su, 2025) | Inverse_intensity_rate_ratio_balancing() |
| Pairwise Likelihood (Chen et al., 2015) | Pairwise_likelihood() |
| Multiple Imputation (Rubin, 1987) | Multiple_imputation_IP() |
| Linear Increment (Aalen and Gunnes, 2010) | Linear_increment_IP() |
| EHRJoint (Du et al., 2025) | EHRJoint() |
| Proposed (Yang, Shi, and Mukherjee, 2026) | CIMEHR(),
CIMEHR_timevarying_integral(),
CIMEHR_timevarying_ou() |
VA-LMM, OA-LMM, JMVL-G, and Adapted-Liang appear in the comparison
tables below for context but are not implemented as separate functions;
VA-LMM and OA-LMM are available through
Linear_mixed_model() via the visit_adjust
argument.
| Method | IP | IO | Shared Latent | Adjustment Mechanism | Package |
|---|---|---|---|---|---|
| Summary statistics | ✗ | ✗ | None | Reduces the longitudinal series to a single scalar (e.g., min, mean, median, max) for regression. | — |
| Standard LMM (Laird and Ware, 1982) | ✗ | ✗ | None | Conditions on observed covariates (valid only under MAR). | lme4 |
| VA-LMM (Goldstein et al., 2016) | ✓ | ✗ | None | Includes visit count as a fixed covariate proxy. | lme4 |
| OA-LMM | ✗ | ✓ | None | Includes observation count as a fixed covariate proxy. | lme4 |
| Method | IP | IO | Shared Latent | Adjustment Mechanism | Package |
|---|---|---|---|---|---|
| Liang (Liang et al., 2009) | ✓ | ✗ | Visiting & Outcome | Joint likelihood estimation via shared frailty linking visiting and outcome. | IrregLong |
| JMVL-G (Gasparini et al., 2020) | ✓ | ✗ | Visiting & Outcome | Joint likelihood modeling inter-visit times via shared random effects. | merlin |
| IIRR-Weighting (Bůžková and Lumley, 2007) | ✓ | ✗ | None | Inverse intensity weighting based on visiting intensity. | — |
| IIRR-Balanced (Yiu and Su, 2025) | ✓ | ✗ | None | Inverse intensity weighting adjusted for covariate balance. | — |
| Pairwise Likelihood (Chen et al., 2015) | ✓ | ✗ | None | Constructs a likelihood from all within-subject pairs of observations; conditioning on observed visit times eliminates the visit intensity from the likelihood. | — |
| Method | IP | IO | Shared Latent | Adjustment Mechanism | Package |
|---|---|---|---|---|---|
| Multiple Imputation (MI) (Rubin, 1987) | ✓ | ✓ | Determined by IP-only | Multiple imputation based on posterior draws, followed by IP-only methods. | mice |
| Linear Increment Methods (LI) (Aalen and Gunnes, 2010) | ✓ | ✓ | Determined by IP-only | Deterministic linear interpolation between observed visits to impute missing values, followed by IP-only methods. | slim |
| Method | IP | IO | Shared Latent | Adjustment Mechanism | Package |
|---|---|---|---|---|---|
| Adapted-Liang (Liang et al., 2009) | ✓ | ✓ | Visiting & Outcome | Joint model structure defined, but observation coefficients are constrained to zero. | — |
| EHRJoint (Du et al., 2025) | ✓ | ✓ | Visiting & Outcome | Tripartite model; observation process modeled via covariates only (no latent link). | — |
| Proposed (this package) | ✓ | ✓ | All three processes | Tripartite model; full shared latent structure links Visiting, Observation, and Outcome. | CIMEHR |
IP = informative visiting / informative presence adjustment; IO =
informative observation adjustment. The “Proposed” row corresponds to
CIMEHR(); CIMEHR_timevarying_integral() and
CIMEHR_timevarying_ou() extend this framework with more
flexible observation-process modeling.
```{r, eval = FALSE} covars_all <- c(“Age”, “Gender”, “Marital”, “Black”, “Hispanic”, “NSES”)
fit_base <- CIMEHR( data = sim_ehr_data, y_col = “log_HbA1c”, covars_visit_XV = covars_all, covars_outcome_fixed_XY = covars_all, covars_outcome_random_link_ZY = “NSES”, covars_obs_fixed_XO = covars_all, covars_obs_random_link_ZO = “NSES” )
fit_integral <- CIMEHR_timevarying_integral( data = sim_ehr_data, y_col = “log_HbA1c”, covars_visit_XV = covars_all, covars_outcome_fixed_XY = covars_all, covars_outcome_random_link_ZY = “NSES”, covars_obs_fixed_XO = covars_all, covars_obs_random_link_ZO = “NSES”, gh_points_U = 7, gh_points_q = 5 )
fit_ou <- CIMEHR_timevarying_ou( data = sim_ehr_data, y_col = “log_HbA1c”, covars_visit_XV = covars_all, covars_outcome_fixed_XY = covars_all, covars_outcome_random_link_ZY = “NSES”, covars_obs_fixed_XO = covars_all, covars_obs_random_link_ZO = “NSES”, pair_type = “adjacent” )
The `summary()` output for CIMEHR-family fits reports the frailty model and the observation-process link. For example, `CIMEHR()` and `CIMEHR_timevarying_integral()` use a probit observation model, while `CIMEHR_timevarying_ou()` uses a probit observation model with OU serial correlation estimated by pairwise composite likelihood.
## Extracting stage summaries and coefficients
The regular `summary()` call prints model output but does not need to be saved before extracting coefficients. You can print one process at a time with the stage-specific summary helpers:
```{r, eval = FALSE}
summary_visiting(fit_base)
summary_observation(fit_base)
summary_outcome(fit_base)
If a method does not have the requested stage, the helper prints a note and exits without error. For example, outcome-only methods do not have a visiting or observation stage.
You can extract all coefficients from a stage using
coef() for CIMEHR-family methods or the more general
extract_coefficient() helper:
```{r, eval = FALSE} # All outcome coefficients coef(fit_base, type = “outcome”)
extract_coefficient(fit_base, stage = “outcome”, parameter = “NSES”)
extract_coefficient(fit_base, stage = “visiting”, parameter = “NSES”)
extract_coefficient(fit_base, stage = “observation”, parameter = “NSES”)
## Comparing fitted methods
`method_comparisons()` can fit one or more methods and return process-specific comparison tables. Use `methods = "all"` to fit all methods supported by the comparison helper, or provide any subset of method names. Method-specific arguments are supplied through `method_args`.
```{r, eval = FALSE}
# See all supported method names
available_comparison_methods()
covars_all <- c("Age", "Gender", "Marital", "Black", "Hispanic", "NSES")
out_form <- log_HbA1c ~ Age + Gender + Marital + Black + Hispanic + NSES + time
cmp_subset <- method_comparisons(
data = sim_ehr_data,
methods = c("Summary_stat", "Linear_mixed_model",
"Inverse_intensity_rate_ratio_weighting",
"Joint_modeling_visiting_and_longitudinal_Liang",
"CIMEHR"),
method_args = list(
Summary_stat = list(outcome = "log_HbA1c",
formula = ~ Age + Gender + Marital + Black + Hispanic + NSES),
Linear_mixed_model = list(fixed = out_form,
random = ~ 1 | id,
obs_indicator = "R"),
Inverse_intensity_rate_ratio_weighting = list(outcome = "log_HbA1c", visit_covs = covars_all),
Joint_modeling_visiting_and_longitudinal_Liang = list(
outcome = "log_HbA1c", visit_covs = covars_all, long_covs = covars_all, random_covs = "NSES"
),
CIMEHR = list(
y_col = "log_HbA1c",
covars_visit_XV = covars_all,
covars_outcome_fixed_XY = covars_all,
covars_outcome_random_link_ZY = "NSES",
covars_obs_fixed_XO = covars_all,
covars_obs_random_link_ZO = "NSES"
)
),
printSE = TRUE,
print95CI = TRUE,
true_value_verbose = TRUE
)
print(cmp_subset)
cmp_subset$tables$outcome
For real, non-simulated data, the comparison tables omit true-value,
bias, and RMSE columns unless you explicitly pass a
true_values list. If a method fails,
method_comparisons() keeps going by default, stores the
error in cmp_subset$errors, and prints NA rows
for that method.
A Monte Carlo simulation study is provided in
simulations/simulation_study.R. The script crosses two
focal-covariate types (continuous NSES and binary
Gender) with four values of the Ornstein-Uhlenbeck
serial-correlation parameter (phi \(\in\) {0, 0.001, 0.1, 0.3}), producing
eight scenarios. Each scenario runs N_SIM = 1000 Monte
Carlo replications; each replication generates a fresh dataset of
n = 2000 participants on the [0, 120] window via
sim_data_gen() and fits four methods:
CIMEHR()Joint_modeling_visiting_and_longitudinal_Liang()Inverse_intensity_rate_ratio_weighting()IrregLong::iiwgee() (external comparator)For each method × scenario × parameter (beta on each of Age, Gender,
Marital, Black, Hispanic, NSES), the script reports bias, empirical SE,
mean analytic SE, SE ratio, RMSE, and 95% CI coverage. Coverage is
reported as --- for
Inverse_intensity_rate_ratio_weighting() because that
method does not return analytic standard errors; bootstrap inference
would be needed there.
The bundled sim_ehr_data dataset is itself one replicate
from this framework (the continuous_phi0 scenario,
sim_id = 1), so the data-generating spec used by the study
and the spec of the bundled example dataset are identical aside from
phi and the focal covariate.
To run the study:
{bash, eval = FALSE} Rscript simulations/simulation_study.R Rscript simulations/simulation_study.R --cores 8 --nsim 100 Rscript simulations/simulation_study.R --scenarios continuous_phi0,binary_phi0.3
Use bootstrap() for methods without analytic standard
errors or as a robustness check for CIMEHR-family methods:
{r, eval = FALSE} fit_iirr <- Inverse_intensity_rate_ratio_weighting( sim_ehr_data, outcome = "log_HbA1c", visit_covs = c("Age", "Gender", "Marital", "Black", "Hispanic", "NSES") ) bs <- bootstrap(fit_iirr, data = sim_ehr_data, B = 200, seed = 1) print(bs) summary(bs)
Model-specific arguments can be passed through
covars:
{r, eval = FALSE} covars_all <- c("Age", "Gender", "Marital", "Black", "Hispanic", "NSES") bs_cimehr <- bootstrap( fit_base, data = sim_ehr_data, B = 200, seed = 1, covars = list( covars_visit_XV = covars_all, covars_outcome_fixed_XY = covars_all, covars_outcome_random_link_ZY = "NSES", covars_obs_fixed_XO = covars_all, covars_obs_random_link_ZO = "NSES" ) )
Run vignette("getting-started", package = "CIMEHR") for
a more detailed tutorial.
Aalen, O. O., Borgan, Ø., and Gjessing, H. K. (2008). Survival and event history analysis: A process point of view. Springer.
Aalen, O. O. and Gunnes, N. (2010). A dynamic approach for reconstructing missing longitudinal data using the linear increments model. Biostatistics, 11, 453–472.
Bůžková, P. and Lumley, T. (2007). Longitudinal data analysis for generalized linear models with follow-up dependent on outcome-related variables. Canadian Journal of Statistics, 35, 485–500.
Chen, Y., Ning, J., and Cai, C. (2015). Regression analysis of longitudinal data with irregular and informative observation times. Biostatistics, 16, 727–739.
Du, J., Shi, X., and Mukherjee, B. (2025). A new statistical approach for joint modeling of longitudinal outcomes measured in electronic health records with clinically informative presence and observation processes.
Gasparini, A., Abrams, K. R., Barrett, J. K., Major, R. W., Sweeting, M. J., Brunskill, N. J., and Crowther, M. J. (2020). Mixed-effects models for health care longitudinal data with an informative visiting process: A Monte Carlo simulation study. Statistica Neerlandica, 74, 5–23.
Goldstein, B. A., Bhavsar, N. A., Phelan, M., and Pencina, M. J. (2016). Controlling for informed presence bias due to the number of health encounters in an electronic health record. American Journal of Epidemiology, 184, 847–855.
Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963–974.
Liang, Y., Lu, W., and Ying, Z. (2009). Joint modeling and analysis of longitudinal data with informative observation times. Biometrics, 65, 377–384.
Pullenayegum, E. M. (2026). IrregLong: Analysis of Longitudinal Data with Irregular Observation Times. R package version 0.4.1.
Rubin, D. (1987). Multiple imputation for nonresponse in surveys. John Wiley & Sons, New York.
Yiu, S. and Su, L. (2025). Accommodating informative visit times for analysing irregular longitudinal data: A sensitivity analysis approach with balancing weights estimators. Journal of the Royal Statistical Society Series C: Applied Statistics, 74, 824–843.
Yang, C.-H., Shi, X., and Mukherjee, B. (2026). Joint modeling of longitudinal EHR data with shared random effects for informative visiting and observation processes. arXiv:2602.15374.