Generates synthetic data from a sophisticated joint modeling framework that seamlessly integrates longitudinal biomarker trajectories with survival outcomes through ordinary differential equations (ODEs). This simulation engine produces realistic clinical trial datasets where complex biomarker dynamics govern event hazards via shared random effects and trajectory features, capturing the intricate interplay between disease progression and time-to-event processes.
Arguments
- n
Integer. Number of subjects to simulate. Larger cohorts provide more stable parameter estimates (default: 50).
- alpha
Numeric vector of length 3. Association parameters quantifying how trajectory features influence survival hazard:
[biomarker, velocity, acceleration]
. Positive values indicate increased risk (default: c(0.5, 0.4, -0.5)).- beta
Numeric vector governing ODE dynamics (length 5). Controls biomarker trajectory evolution:
[biomarker, velocity, x1, x2, time]
. Automatically normalized to unit length for identifiability (default: c(-0.3, -0.5, 0.2, 0.1, 0.05)).- phi
Numeric vector of length 2. Baseline covariate effects modulating survival hazard independently of biomarker dynamics:
[w1, w2]
(default: c(0.2, -0.15)).- weibull_shape
Numeric. Weibull shape parameter (\(\kappa\)) characterizing baseline hazard evolution. Values > 1 yield increasing hazard (aging effect), < 1 decreasing hazard (selection effect), = 1 constant hazard (exponential) (default: 1.5).
- weibull_scale
Numeric. Weibull scale parameter (\(\theta\)) determining the characteristic event time. Larger values shift the hazard curve rightward (default: 8).
- sigma_b
Numeric. Standard deviation of subject-specific random effects capturing unobserved heterogeneity. Larger values increase between-subject variability in both trajectories and hazards (default: 0.1).
- sigma_e
Numeric. Measurement error standard deviation reflecting assay precision and biological fluctuations. Smaller values indicate more reliable biomarker measurements (default: 0.1).
- seed
Integer. Random seed ensuring reproducible simulations. Essential for method validation and comparison studies (default: 42).
- verbose
Logical. Display informative progress messages during simulation workflow (default: TRUE).
Value
A list containing two complementary datasets:
- longitudinal_data
Data frame with longitudinal measurements:
id
: Subject identifiertime
: Measurement timev
: Observed biomarker value incorporating latent trajectory, random effect, and measurement errorx1
: First longitudinal covariate (standardized)x2
: Second longitudinal covariate (standardized)biomarker
: True latent biomarker trajectory valuevelocity
: True latent biomarker velocity valueacceleration
: True latent biomarker acceleration value
- survival_data
Data frame with survival outcomes:
id
: Subject identifiertime
: Observed event/censoring timestatus
: Event indicator (1 = event, 0 = censored)w1
: First survival covariate (standardized)w2
: Second survival covariate (standardized)b
: Shared random effect
Details
Mathematical Framework
The simulation orchestrates a sophisticated joint model architecture comprising two intricately coupled components:
1. Longitudinal Dynamics
Biomarker evolution is governed by a second-order nonlinear ODE system that captures complex temporal dynamics: $$\ddot{m}_i(t) = g(\boldsymbol{\beta}^{\top} \mathbf{Z}_i(t))$$
where:
\(m_i(t)\): Latent biomarker trajectory for subject \(i\)
\(\mathbf{Z}_i(t) = [m_i(t), \dot{m}_i(t), X_1, X_2, t]^{\top}\): State vector (5-dimensional)
\(g(u) = 2 \tanh(u/3)\): Bounded nonlinear transformation ensuring numerical stability and biological plausibility
\(\boldsymbol{\beta}\): Parameter vector governing homeostatic feedback, damping forces, and external influences
Observed measurements arise from a hierarchical structure incorporating both systematic and stochastic components: $$V_{ij} = m_i(T_{ij}) + b_i + \varepsilon_{ij}$$
where \(b_i \sim N(0, \sigma_b^2)\) captures subject-specific deviations and \(\varepsilon_{ij} \sim N(0, \sigma_e^2)\) represents measurement variability.
2. Survival Process
The instantaneous hazard function elegantly links biomarker dynamics to event risk through a multiplicative model: $$\lambda_i(t|b_i) = \lambda_0(t) \exp[\boldsymbol{\alpha}^{\top} \mathbf{m}_i(t) + \mathbf{W}_i^{\top} \boldsymbol{\phi} + b_i]$$
where:
\(\lambda_0(t) = (\kappa/\theta)(t/\theta)^{\kappa-1}\): Weibull baseline hazard capturing population-level risk evolution
\(\mathbf{m}_i(t) = [m_i(t), \dot{m}_i(t), \ddot{m}_i(t)]^{\top}\): Comprehensive trajectory feature vector
\(\mathbf{W}_i\): Time-invariant baseline characteristics
\(b_i\): Shared random effect inducing correlation between longitudinal and survival processes
Simulation Workflow
The data generation proceeds through a carefully orchestrated pipeline:
ODE Integration: Numerical solution of the coupled ODE system using adaptive Runge-Kutta methods with automatic step size control for optimal accuracy-efficiency tradeoff
Event Time Generation: Sophisticated sampling from the conditional hazard distribution via the
simsurv
engine, accounting for complex time-varying covariatesCensoring Mechanism: Realistic administrative censoring uniformly distributed between the 50th and 95th percentiles of event times, mimicking clinical trial follow-up patterns
Visit Scheduling: Adaptive measurement protocols with higher frequency during critical periods (quarterly initially, semi-annual subsequently) and stochastic 10% missingness
Parameter Interpretation Guide
Trajectory Dynamics (
beta
):beta[1]
: Homeostatic feedback strength (negative values promote stability)beta[2]
: Damping coefficient controlling oscillation suppressionbeta[3-4]
: Sensitivity to longitudinal covariatesbeta[5]
: Time trend capturing systematic changes
Hazard Association (
alpha
):alpha[1]
: Current biomarker value effect (positive indicates deleterious biomarker)alpha[2]
: Velocity effect capturing prognostic value of trajectory directionalpha[3]
: Acceleration effect reflecting stability importance (negative suggests protective stability)
Note
Parameter vector
beta
undergoes automatic normalization to unit length, ensuring model identifiability without user interventionVisit schedules intelligently adapt to individual follow-up durations, balancing information gain with practical constraints
Progress reporting can be silenced via
verbose = FALSE
for batch simulations
Examples
# Basic usage with default parameters
sim <- simulate()
#> Step 1/2: Generating survival times...
#> Step 2/2: Generating longitudinal data...
str(sim)
#> List of 2
#> $ longitudinal_data:'data.frame': 747 obs. of 8 variables:
#> ..$ id : int [1:747] 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ time : num [1:747] 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.5 ...
#> ..$ v : num [1:747] 0.1872 0.1778 0.1431 0.0385 -0.2236 ...
#> ..$ x1 : num [1:747] -0.0407 -0.0407 -0.0407 -0.0407 -0.0407 ...
#> ..$ x2 : num [1:747] -2 -2 -2 -2 -2 ...
#> ..$ biomarker : num [1:747] 0 -0.00646 -0.02413 -0.05051 -0.08327 ...
#> ..$ velocity : num [1:747] 0 -0.05 -0.0897 -0.1198 -0.1408 ...
#> ..$ acceleration: num [1:747] -0.2207 -0.1792 -0.1393 -0.1017 -0.0669 ...
#> $ survival_data :'data.frame': 50 obs. of 6 variables:
#> ..$ id : int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ time : num [1:50] 8.13 7.77 1.64 8.38 5.17 ...
#> ..$ status: int [1:50] 0 0 1 0 1 1 0 1 1 0 ...
#> ..$ w1 : num [1:50] 0.3219 -0.7838 1.5757 0.6429 0.0898 ...
#> ..$ w2 : num [1:50] 1.201 1.045 -1.003 1.848 -0.667 ...
#> ..$ b : num [1:50] 0.1371 -0.0565 0.0363 0.0633 0.0404 ...
# Check data characteristics
cat("Event rate:", mean(sim$survival_data$status), "\n")
#> Event rate: 0.68
cat(
"Observations per subject:",
nrow(sim$longitudinal_data) / nrow(sim$survival_data), "\n"
)
#> Observations per subject: 14.94
# Visualize trajectories with survival information
library(ggplot2)
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
# Select subjects with different outcomes
plot_data <- sim$longitudinal_data %>%
left_join(sim$survival_data[, c("id", "time", "status")],
by = "id", suffix = c("", "_event")
) %>%
filter(id %in% sample(unique(id), 9))
ggplot(plot_data, aes(x = time, y = v)) +
geom_line(aes(color = factor(status)), alpha = 0.7) +
geom_point(aes(color = factor(status)), size = 1) +
geom_vline(aes(xintercept = time_event),
linetype = "dashed", alpha = 0.5
) +
facet_wrap(~id, scales = "free_y", ncol = 3) +
scale_color_manual(
values = c("0" = "blue", "1" = "red"), labels = c("Censored", "Event")
) +
theme_minimal() +
labs(
x = "Time (years)", y = "Biomarker Value", color = "Outcome",
title = "Simulated Longitudinal Trajectories"
)