--- title: "6. Building a data object in R" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{6. Building a data object in R} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE, fig.width = 7, fig.height = 5 ) ``` Rceattle normally reads input data from an Excel workbook via `read_data()`. The same data object can be built entirely in R as a named list — useful when: - Simulating data for testing or MSE operating models - Generating inputs programmatically (e.g., multi-scenario loops or sensitivity analyses) - Integrating Rceattle into pipelines that don't use Excel Everything passed to `fit_mod()` lives in this list, so understanding its structure also helps you modify existing data objects read from Excel. This vignette walks through each component. The result can be validated with `data_check()` and fitted with `fit_mod()`, or exported back to Excel with `write_data()`. ### Required vs. optional fields The list are optional and only needed when a particular feature is turned on. `Rceattle` fills missing optional fields with safe defaults and stops model runs under the conditions that actually require them: | Field | Required when | |------------------------------------------|------------------------------------------------| | `comp_data` | (always optional) | | `caal_data` | `any(growth_model > 0)` (estimating growth) | | `emp_sel` | any fleet has `Selectivity = "Fixed"` | | `NByageFixed` | `any(estDynamics > 0)` | | `env_data` | env linkages | | `diet_data` | `msmMode > 0` | | `ration_data` | `msmMode > 0` | | Bioenergetics scalars (`Ceq`, `CA`, `CB`, `Tco`, ...) | `msmMode > 0` | In default single-species mode (`msmMode = 0`) you can omit any of these by simply not setting the field on the list. The single-species example below shows the full machinery, but feel free to skip sections you don't need. ## Setup ```{r} library(Rceattle) # Define core dimensions up front — these drive the size of all downstream arrays. nspp <- 2 # number of species (or stocks) nages <- 10 # maximum age classes (same for all species here) nlengths <- 20 # number of length bins (for age-length keys / CAAL data) nyrs <- 30 # number of hindcast years years <- 1:nyrs simData <- list() ``` ## 1. Species / model controls These scalars and vectors describe the biological structure of each species and control which population model is used. ```{r} simData$nspp <- nspp simData$styr <- 1 # first year of hindcast simData$endyr <- nyrs # last year of hindcast simData$projyr <- nyrs + 10 # last year of the projection period # Character labels shown in plots and output tables simData$spnames <- paste0("Species", 1:nspp) # nsex: 1 = combined-sex model (male and female treated as one group), # 2 = sex-structured model (separate selectivity and mortality by sex) simData$nsex <- rep(1, nspp) # Calendar month of spawning — used to compute the fraction-of-year weight # applied to spawning stock biomass (e.g., 6 = mid-year spawning) simData$spawn_month <- rep(6, nspp) simData$nages <- rep(nages, nspp) # number of age classes per species simData$minage <- rep(1, nspp) # minimum age (usually 1) simData$nlengths <- rep(nlengths, nspp) # estDynamics controls how population abundance is estimated: # 0 = estimate population dynamics (standard statistical catch-at-age) # 1 = fix N-at-age exactly to values in NByageFixed (e.g., external estimates) # 2 = scale NByageFixed by a single estimated scalar per species simData$estDynamics <- rep(0, nspp) # Indices linking each species to its weight and growth datasets. # Multiple weight schedules can exist (e.g., fleet-specific); these point to # which one is used for total biomass (pop) and spawning biomass (ssb). simData$pop_wt_index <- 1:nspp simData$ssb_wt_index <- 1:nspp simData$pop_age_transition_index <- rep(1, nspp) # Length-weight relationship coefficients: W = alpha * L ^ beta. # Used internally to convert length-based selectivity to weight when estimating growth. simData$alpha_wt_len <- rep(0.00001, nspp) simData$beta_wt_len <- rep(3, nspp) # Prior or initial standard deviation for log-recruitment deviations. Controls the # strength of the recruitment penalty in penalized-likelihood mode. simData$sigma_rec_prior <- rep(1, nspp) # Non-modelled ("other") prey biomass available to predators. Used in # multi-species mode to prevent unrealistically high predation mortality when # modelled prey are scarce. Setting this too low forces predators to over-eat # modelled prey; too high dilutes the predation signal. simData$other_food <- rep(1e5, nspp) ``` ## 2. Fleet control table This is the most important configuration table. Each row defines one fleet (fishery or survey) and controls how it is modelled. Multiple fleets can target the same species. ```{r} # Fleet_type: # "Fishery" -- contributes catch and catch-at-age/length composition data # "Survey" -- contributes an abundance index and composition data # "Off" -- fleet is present in the data but excluded from the likelihood # Selectivity options (Selectivity column) — use the string name or integer code: # "Fixed" (0) -- fixed at values in emp_sel; not estimated # "Logistic" (1) -- two-parameter ascending logistic (a50, slope) # "NonParametric" (2) -- freely estimated selectivity-at-age/length with smoothness penalty # "DoubleLogistic" (3) -- ascending + descending dome-shaped logistic # "DescendingLogistic" (4) -- descending logistic only (dome-shaped for oldest ages) # "Hake" (5) -- age-specific logistic with spawning migration pattern # "2DAR1" (6) -- 2D autoregressive random effects (age × year) # "3DAR1" (7) -- 3D autoregressive random effects (age × year × fleet) # # Selectivity_dimension: "Age" or "Length" # Length-based selectivity requires an age-to-length transition matrix (see below). # Catchability options (Catchability column) — use the string name or integer code: # "Fixed" (0) -- Q fixed at Q_prior; not estimated # "Estimated" (1) -- freely estimated on log scale # "Estimated-with-prior" (2) -- estimated with a lognormal prior (Q_prior, Q_sd_prior) # "Analytical" (3) -- Ludwig & Walters / Martell (1994) analytical Q # "PowerEquation" (4) -- power-equation Q (habitat area scaling) # "Environmental" (5) -- mu_q + X * beta, where X is a covariate in env_data # "AR1" (6) -- auto-regressive catchability (Rogers et al. 2024) # NA -- not applicable (use for fisheries without CPUE) # Comp_loglike / CAAL_loglike options: # "Multinomial" -- standard multinomial # "DirichletMultinomial" -- overdispersed; estimates a concentration parameter # "MultinomialAFSC" -- AFSC variant of multinomial (marginal age + length) simData$fleet_control <- data.frame( Fleet_name = paste0(c("Survey", "Fishery"), rep(paste(" Species", 1:nspp), each = 2)), Fleet_code = 1:(nspp * 2), # unique integer ID per fleet Fleet_type = rep(c("Survey", "Fishery"), nspp), Species = rep(1:nspp, each = 2), Month = 0, # 0 = mid-year; calendar month otherwise # Selectivity_index links fleets that share the same selectivity curve. # Fleets with the same index share all selectivity parameters. Selectivity_index = 1:(nspp * 2), Selectivity = "Logistic", Selectivity_dimension = "Age", N_sel_bins = NA, # used for NonParametric only Sel_curve_pen1 = NA, # smoothness penalty (NonParametric) Sel_curve_pen2 = NA, # Time_varying_sel: # 0 / "Off" = time-invariant (default) # 1 / "IID" = penalized log-normal deviates (SD fixed at Time_varying_sel_sd_prior # unless random_sel = TRUE in fit_mod) # 2 / "AR1" = AR(1) (not yet implemented) # 3 / "Block" = time blocks (specify via R-side mapping) # 4 / "RandomWalk" = random walk # 5 / "RandomWalkAscending" = random walk on ascending portion of double logistic Time_varying_sel = 0, Time_varying_sel_sd_prior = 1, Bin_first_selected = 1, # youngest fully-selected age or length bin Sel_norm_bin1 = NA, # bin to normalise selectivity from (optional) Sel_norm_bin2 = NA, # Composition likelihoods Comp_loglike = "Multinomial", Comp_weights = 1, # input effective sample size multiplier CAAL_loglike = "Multinomial", CAAL_weights = 1, # Index units and weight/growth linkage Weight1_Numbers2 = 1, # 1 = weight (mt), 2 = numbers Weight_index = rep(1:nspp, each = 2), Age_transition_index = 1, # Q_index links fleets that share the same catchability parameter. Q_index = 1:(nspp * 2), Catchability = rep(c("Estimated", NA), nspp), Q_prior = rep(c(1, NA), nspp), Q_sd_prior = rep(c(0.2, NA), nspp), # Time_varying_q: # 0 / "Off" = time-invariant # 1 / "IID" = penalized log-normal deviates (SD fixed at Time_varying_q_sd_prior # unless random_q = TRUE in fit_mod) # 2 / "AR1" = AR(1) (not yet implemented) # 3 / "Block" = time blocks # 4 / "RandomWalk" = random walk Time_varying_q = rep(c(0, NA), nspp), Time_varying_q_sd_prior = rep(c(1, NA), nspp), # Estimate_index_sd: # 0 = fix at Log_sd from index_data # 1 = estimate a single SD # 2 = analytical sigma (Ludwig & Walters 1994) Estimate_index_sd = rep(c(0, NA), nspp), Index_sd_prior = rep(c(1, NA), nspp), # Estimate_catch_sd: # 0 = fix at Log_sd from catch_data (treats catch as known) # 1 = estimate Estimate_catch_sd = rep(c(NA, 0), nspp), Catch_sd_prior = rep(c(NA, 1), nspp), # proj_F_prop: how future F is distributed across fishery fleets per species. # Must sum to 1 across fishery fleets for each species. NA for surveys. proj_F_prop = rep(c(NA, 1), nspp) ) ``` ## 3. Survey index data One row per fleet-year combination for survey fleets. `Observation` is the log-scale biomass index (or numbers if `Weight1_Numbers2 = 2`). A negative `Year` value tells Rceattle to predict the index without including it in the likelihood, which is useful for model checking. ```{r} simData$index_data <- data.frame( Fleet_name = rep(paste("Survey Species", 1:nspp), each = nyrs), Fleet_code = rep(seq(1, nspp * 2, by = 2), each = nyrs), Species = rep(1:nspp, each = nyrs), Year = rep(years, nspp), Month = 0, # Selectivity_block: integer index into the selectivity/Q parameter block. # Used when Time_varying_sel or Time_varying_q = 3 (time blocks). Selectivity_block = 1, Observation = rnorm(nyrs * nspp), # replace with real log-index values Log_sd = 0.1 # observation SE on the log scale ) ``` ## 4. Catch data One row per fishery fleet-year combination. `Log_sd` is typically set small (0.01–0.05) when catch is treated as known; increase it if catch is highly uncertain. ```{r} simData$catch_data <- data.frame( Fleet_name = rep(paste("Fishery Species", 1:nspp), each = nyrs), Fleet_code = rep(seq(2, nspp * 2, by = 2), each = nyrs), Species = rep(1:nspp, each = nyrs), Year = rep(years, nspp), Month = 0, Selectivity_block = 1, Catch = abs(rnorm(nyrs * nspp)), # replace with real catch (mt) Log_sd = 0.05 ) ``` ## 5. Composition data (optional) Age or length composition data. `Age0_Length1 = 0` for age compositions, `1` for length compositions. `Sex = 0` for combined sex, `1` for females, `2` for males, `3` for joint female+male. Composition columns (`Comp_1`, `Comp_2`, ...) can be raw counts or proportions — Rceattle normalises internally. Skip this section entirely (don't set `simData$comp_data`) if no composition data are available — `clean_data()` will fill it with an empty data.frame. ```{r} comp_matrix <- matrix(abs(rnorm(nyrs * nspp * nages * 2)), nrow = nyrs * nspp * 2, ncol = nages) colnames(comp_matrix) <- paste0("Comp_", 1:nages) simData$comp_data <- cbind( data.frame( Fleet_name = c(rep(paste("Survey Species", 1:nspp), each = nyrs), rep(paste("Fishery Species", 1:nspp), each = nyrs)), Fleet_code = rep(c(seq(1, nspp * 2, 2), seq(2, nspp * 2, 2)), each = nyrs), Species = c(rep(1:nspp, each = nyrs), rep(1:nspp, each = nyrs)), Sex = 0, Age0_Length1 = 0, Year = rep(years, 2 * nspp), Month = 0, Sample_size = 200 ), comp_matrix ) ``` ## 6. Conditional age-at-length (CAAL) data (optional) CAAL data provide observed age frequencies within each length bin and allow the model to jointly fit length and age compositions, and optionally estimate the growth curve internally (see `vignette("model-parameterizations")` and `?build_growth`). One row per fleet-sex-year-length-bin combination. `caal_data` becomes **required** when growth is being estimated internally (`any(growth_model > 0)`); otherwise it can be omitted. ```{r} caal_matrix <- matrix(abs(rnorm(nyrs * nspp * nages * 2 * nlengths)), nrow = nyrs * nspp * 2 * nlengths, ncol = nages) colnames(caal_matrix) <- paste0("CAAL_", 1:nages) simData$caal_data <- cbind( data.frame( Fleet_name = c(rep(paste("Survey Species", 1:nspp), each = nyrs * nlengths), rep(paste("Fishery Species", 1:nspp), each = nyrs * nlengths)), Fleet_code = rep(c(seq(1, nspp * 2, 2), seq(2, nspp * 2, 2)), each = nyrs * nlengths), Species = c(rep(1:nspp, each = nyrs * nlengths), rep(1:nspp, each = nyrs * nlengths)), Sex = 0, Year = rep(years, 2 * nspp * nlengths), Length = rep(1:nlengths, 2 * nspp * nyrs), Sample_size = 200 ), caal_matrix ) ``` ## 7. Biological inputs ### Age-to-length transition matrix The probability that a fish of age *a* is observed in length bin *l*. Required when fitting length compositions or estimating growth internally. Each row must sum to 1. The placeholder below uses an identity matrix (age *a* maps exactly to length bin *a*); replace with empirical or model-based growth matrices. ```{r} age_trans_list <- lapply(1:nspp, function(sp) { tmp <- matrix(0, nrow = nages, ncol = nlengths) diag(tmp[1:min(nages, nlengths), 1:min(nages, nlengths)]) <- 1 colnames(tmp) <- paste0("Length_", 1:nlengths) cbind(data.frame(Age_transition_name = paste("Species", sp), Age_transition_index = sp, Species = sp, Sex = 0, Age = 1:nages), tmp) }) simData$age_trans_matrix <- do.call(rbind, age_trans_list) ``` ### Ageing error matrix The probability that a fish of true age *t* is read as observed age *o*. An identity matrix (no error) is a conservative starting point. Ageing error matrices can be generated with standard tools (e.g., the AgeingError package) and substituted here. ```{r} age_error_list <- lapply(1:nspp, function(sp) { tmp <- as.data.frame(diag(1, nages)) colnames(tmp) <- paste0("Obs_age", 1:nages) cbind(data.frame(Species = sp, True_age = 1:nages), tmp) }) simData$age_error <- do.call(rbind, age_error_list) ``` ### Weight-at-age Mean weight (metric tonnes) at each age. Setting `Year = 0` applies the same schedule to all years (time-invariant). Providing multiple rows with different years enables time-varying weight-at-age, which is used in empirical weight-at-age models. ```{r} weight_matrix <- matrix((1:nages / nages)^3 * 0.01, nrow = nspp, ncol = nages, byrow = TRUE) colnames(weight_matrix) <- paste0("Age", 1:nages) simData$weight <- cbind( data.frame(Wt_name = paste("Species", 1:nspp), Wt_index = 1:nspp, Species = 1:nspp, Sex = 0, Year = 0), # Year = 0 → time-invariant weight_matrix ) ``` ### Maturity-at-age Proportion mature at each age, used to compute spawning stock biomass. Values should range from 0 to 1. For sex-structured models (`nsex = 2`) this represents the proportion of females that are mature. ```{r} mat_matrix <- matrix(1, nrow = nspp, ncol = nages) colnames(mat_matrix) <- paste0("Age", 1:nages) simData$maturity <- cbind(data.frame(Species = 1:nspp), mat_matrix) ``` ### Sex ratio at age Proportion female at each age. Used in combined-sex models (`nsex = 1`) to weight spawning biomass contributions. In two-sex models (`nsex = 2`) this is the proportion female at recruitment. ```{r} sex_matrix <- matrix(0.5, nrow = nspp, ncol = nages) colnames(sex_matrix) <- paste0("Age", 1:nages) simData$sex_ratio <- cbind(data.frame(Species = 1:nspp), sex_matrix) ``` ### Fixed selectivity (optional) Pre-specified selectivity-at-age values for fleets whose selectivity is fixed (not estimated). **Required** when any fleet has `Selectivity = "Fixed"`. Skip if all fleets estimate their own selectivity (as in this example). When you do need it, the table looks like: ```{r} # Example for one fleet whose selectivity is fixed at the supplied row of values emp_sel <- data.frame( Fleet_name = "Survey Species 1", Fleet_code = 1, Species = 1, Sex = 0, Year = 0, # 0 = applied to every year t(matrix(c(0.1, 0.5, 1, rep(1, nages - 3)), ncol = 1, dimnames = list(paste0("Comp_", 1:nages), NULL))) ) simData$emp_sel <- NULL ``` ### Fixed numbers-at-age (optional) Used when `estDynamics > 0`: - `1` = numbers-at-age fixed exactly to these values - `2` = numbers-at-age scaled by a single estimated scalar per species - `3` = numbers-at-age scaled by age-specific estimated scalars per species **Required** when `any(estDynamics > 0)`. Skip if you're estimating population dynamics from scratch (the default `estDynamics = 0`). When you do need it, the table has columns `Species_name`, `Species`, `Sex`, `Year`, then `Age1 ... Age` for the fixed numbers-at-age. ### Natural mortality (M1 base) Residual natural mortality-at-age before any predation component. In single-species mode, total mortality M = M1. In multi-species mode, M = M1 + M2 where M2 is predation mortality estimated from the bioenergetics model. When switching to multi-species mode it is important to reduce M1 accordingly (see `vignette("single-vs-multispecies")`). ```{r} m_matrix <- matrix(0.2, nrow = nspp, ncol = nages) colnames(m_matrix) <- paste0("Age", 1:nages) simData$M1_base <- cbind(data.frame(Species = 1:nspp, Sex = 0), m_matrix) ``` M1 can also be estimated rather than fixed. See `?build_M1` for options including estimating a single scalar, an age-specific vector, or a temperature-dependent function. ### Weight-at-age Used for biomass calculations. Linked to species or fleets. Set Year = 0 to apply the same weight-at-age to all years (time-invariant). Age_1 ... Age_nages columns hold mean weight (usually kg) at each age. ```{r} weight_matrix <- matrix( (1:nages / nages)^3 * 0.01, # Simple power curve placeholder nrow = nspp, ncol = nages, byrow = TRUE ) colnames(weight_matrix) <- paste0("Age", 1:nages) simData$weight <- cbind( data.frame( Wt_name = paste("Species", 1:nspp), Wt_index = 1:nspp, Species = 1:nspp, Sex = 0, Year = 0 # Year = 0 applies to all years (time-invariant) ), weight_matrix ) ``` ### Environmental covariate data (optional) An optional time series used in environment-recruitment, -M1, -catchability, or -growth relationships. Missing years are filled with the column mean. Column names after `Year` can be anything and are referenced by column order. Skip this if no environmental covariates are used. `clean_data()` will fall back to a Year-only data.frame with no indices, and `data_check()` will only complain when a feature actually needs an index (env-q catchability, `Ceq[sp] > 1`, or any env linkage). ```{r} simData$env_data <- data.frame( Year = 1:nyrs, EnvData = 1 # constant = no environmental effect (placeholder) ) ``` ## 8. Multi-species bioenergetics parameters (optional in single-species mode) These parameters govern the Wisconsin bioenergetics model used to compute predation mortality (M2) in multi-species mode. They are unused in single-species mode and can be omitted entirely — `switch_check()` fills any missing scalars with safe sentinels (`Ceq = 1`, `Pvalue = 1`, `fday = 365`, others `0`) so TMB's length-`nspp` `DATA_VECTOR` requirements are satisfied. In multi-species mode (`msmMode > 0`) **all** of `Ceq`, `Cindex`, `Pvalue`, `fday`, `CA`, `CB`, `Qc`, `Tco`, `Tcm`, `Tcl`, `CK1`, `CK4` must be supplied as length-`nspp` vectors; `data_check()` will list any that are missing or wrong-length. Likewise `diet_data` and `ration_data` are required. See `?build_M1` and the CEATTLE technical documentation for full details on each parameter. ```{r} # Ceq: consumption equation form following Hanson et al. (1997) # 1 = Kitchell et al. (1977) # 2 = Stewart et al. (1983) -- most common for gadids # 3 = Kitchell et al. (1977) with temperature dependence on both limbs # 4 = Thornton and Lessman (1978) simData$Ceq <- rep(4, nspp) simData$Cindex <- rep(1, nspp) # temperature index type simData$Pvalue <- rep(1, nspp) # proportion of maximum consumption realised simData$fday <- rep(1, nspp) # foraging days per year # Wisconsin model coefficients (CA, CB) and temperature dependence parameters simData$CA <- rep(1, nspp) simData$CB <- rep(-1, nspp) simData$Qc <- rep(1, nspp) simData$Tco <- rep(1, nspp) simData$Tcm <- rep(1, nspp) simData$Tcl <- rep(1, nspp) simData$CK1 <- rep(1, nspp) simData$CK4 <- rep(1, nspp) # Diet composition likelihood for each predator species # 0 = Multinomial, 1 = Dirichlet-multinomial simData$Diet_loglike <- rep(0, nspp) simData$Diet_comp_weights <- rep(1, nspp) # Ration data (optional in single-species mode): observed mean ration-at-age from # bioenergetics studies. Required when msmMode > 0. Skip in single-species mode. # # When provided, columns are: Species, Sex, Year, Age1, ..., Age. # Diet composition data: observed stomach content proportions (predator-prey pairs). # Required when msmMode > 0; skip in single-species mode. # # When provided, columns are: Pred, Prey, Pred_sex, Prey_sex, Pred_age, # Prey_age, Year, Sample_size, Stomach_proportion_by_weight. ``` ## 9. Validate and fit `plot_data()` prints a summary of the data list. ```{r} plot_data(simData) ``` ```{r} # Key fit_mod() arguments: # # estimateMode: 0 = full fit (hindcast + projection) # 1 = hindcast only # 2 = projection only (fix parameters, run HCR) # 3 = evaluate only (MakeADFun, no optimisation) -- useful for debugging # # msmMode: 0 = single-species, 1 = Type II MSVPA, 2 = Type III MSVPA # # initMode: 0 = freely estimated initial N-at-age # 1 = equilibrium, F_init = 0, no init deviations # 2 = equilibrium, F_init = 0, with init deviations (default) # 3 = non-equilibrium: F_init estimated, init devs included # 4 = non-equilibrium: F_init scales R0 # # phase: TRUE = use parameter phasing for improved convergence (recommended) sim_run <- fit_mod( data_list = simData, estimateMode = 3, # change to 0 for a real fit msmMode = 0, initMode = 2, random_rec = FALSE, fit_control = fit_control( phase = TRUE, verbose = 1)) ``` ## 10. Export back to Excel (optional) `write_data()` exports the data list to an Excel workbook in the standard Rceattle template format, which can then be edited and re-read with `read_data()`. This makes it easy to share data objects or move between R-based and spreadsheet-based workflows. ```{r} write_data(simData, file = tempfile(fileext = ".xlsx")) ```