### Embracing sampling uncertainty in analyses with COM(P)ADRE

by Owen Jones on Jun 8, 2018**by Patrick Barks (University of Southern Denmark, email: barks@biology.sdu.dk)**The COM(P)ADRE Plant and Animal Matrix Databases together contain thousands of population projection matrices from hundreds of individual studies. The availability of these matrices to researchers has led to fascinating comparative analyses in the fields of ecology, evolution, and demography, at taxonomic, spatial, and temporal scales that would not otherwise be possible (see here for a list of relevant publications). One of the challenges inherent in such analyses is that it’s often difficult to obtain information regarding the degree of sampling uncertainty associated with the values that populate projection matrices (i.e. stage- or age-specific transition rates based on survival, growth, and reproduction). These transition rates are almost always estimates of population parameters based on samples (population in the statistical sense), and therefore have associated sampling uncertainty, as do any parameters derived from them (e.g. population growth rate, damping ratio, life expectancy, etc.)

^{1}. Transition estimates based on a small number of individuals will tend to have large uncertainty, while those based on larger samples have less uncertainty. For example, the figure below depicts the sampling uncertainty associated with a stage-specific survival rate of 40% estimated from a random sample of N = 5 individuals vs. N = 50 individuals. Whereas sampling uncertainty is routinely incorporated into statistical analyses in the original studies that produce projection matrices, it is rarely incorporated into analyses that use published projection matrix data from sources like COM(P)ADRE. This omission may lead to bias or overconfidence in some types of analyses. To investigate this possibility, we are initiating a study to examine the nature and distribution of sampling uncertainty among projection matrices in the COMPADRE Plant Matrix Database. Our goals are to:

- understand whether uncertainty in transition rates is relevant for analyses based on COMPADRE,
- assess the types of variables or analyses that are most likely to be affected by sampling uncertainty, and
- develop resources to help researchers incorporate sampling uncertainty into their analyses.

### An example analysis using COMPADRE

To make the issue of sampling uncertainty more concrete, we’ll work through an example analysis with COMPADRE. Specifically, we’ll test the hypothesis that relatively long-lived species tend to experience relatively low year-to-year variation in population growth rates (λ). For simplicity, we’ll limit this analysis to species categorized as herbaceous perennials, and unmanipulated populations with at least three annual transition matrices in COMPADRE (and a few more selection criteria noted in the RMarkdown document here). In the figure above, each point represents a population (as defined in COMPADRE), and the best-fit line is from a linear mixed model that accounts for non-independence of populations from the same species. There are of course different modeling approaches we could have taken — estimate life expectancy at the species level rather than population level, use a more complete model of phylogenetic non-independence, etc. — but we’ll save some of that for later. For now we’d like to know, how wide are the error bars associated with each point in the figure above? The regression model assumed zero uncertainty in both life expectancy and variance(log λ), but as previously noted, both variables are estimates of population parameters with inherent sampling uncertainty. Let’s take a detour here to try to estimate sampling uncertainty for a single population.### Modeling uncertainty in transition rates

Consider a set of matrices available in COMPADRE from a 6-year study of the perennial forb*Agrimonia eupatoria*(Rosaceae) in southern Sweden (Kiviniemi 2003)

^{2}. The matrices give us point estimates for each transition rate in each year, which we can use to calculate point estimates for derived parameters such as life expectancy, λ, and variance(log λ). But to estimate the uncertainty in all these parameters, we need information from outside COMPADRE

^{3}. First, we need to know how the transition rates were estimated. Based on the original paper, survival transitions were estimated directly from the fates of marked individuals (i.e. A

_{ij}= number transitioned from stage i in year t to stage j in year t+1 / number in stage i in year t), and the single fecundity transition was estimated using the anonymous reproduction method (i.e. A

_{sr}= number in seedling stage in year t+1 / number in reproductive stage in year t). Given this methodology, to reconstruct the raw counts from which each transition rate was estimated, all we need are the denominators in the equations above (i.e. stage-specific sample sizes for each transition period), which the original paper helpfully provides. The figure below shows point estimates for each transition rate (open circles), as well as 90% and 99% confidence intervals based on the relevant sampling distribution (thin and thick bars; assuming multinomial and Poisson distributions for the survival and fecundity transitions, respectively)

^{4}. To estimate the sampling distributions of the derived parameters, we generate thousands of simulated projection matrices by repeatedly drawing from the sampling distribution of each transition rate. The sampling distributions for the derived parameters are summarized below, again alongside the corresponding point estimates. For some transition periods, the point estimate for life expectancy is quite far from the respective confidence interval. This can occur when the sample of individuals in one or more stage classes experiences 100% survival, in which case the point estimate for life expectancy may be very high. But the sampling distribution for those one or more survival parameters will only include values ≤ the point estimate (e.g. see the 1996-97 seedling-to-juvenile transition)

^{5}.

### Incorporating sampling uncertainty into our analysis

We can now add the sampling uncertainty for the population of*Agrimonia eupatoria*to our original figure. On one hand, the sampling uncertainty for this population seems high. On the other hand, we have a lot of data, and the relationship between life expectancy and variance(log λ) is strong. If we extrapolate (wildly) from this and a few other populations for which we have data, to make simple assumptions about the distribution of sampling uncertainty among all populations, we’ll find that the observed degree of uncertainty is unlikely to change the results of the current analysis. The figure below depicts predictions from an extension of the previously-described mixed effect model that now also incorporates simulated measurement error in both life expectancy and variance(log λ). The results are essentially unchanged. Perhaps this will be the case for many analyses with COMPADRE. But perhaps some types of analyses based on smaller subsets of data, or with more marginal effect sizes, will be more strongly influenced by sampling uncertainty. Either way, we think it warrants investigation, and we hope to report back with the answer.

### References

Kiviniemi, K. (2002). Population dynamics of*Agrimonia eupatoria*and

*Geum rivale*, two perennial grassland species.

*Plant Ecology*, 159, 153-169. https://doi.org/10.1023/A:1015506019670

### Notes

^{1}Some of the matrices in COM(P)ADRE may in fact be based on data from entire biological populations rather than samples. Whether these map to ‘statistical populations’ will depend on the research question.

^{2}Kiviniemi (2003) studied two populations of

*A. eupatoria*, denoted A and B. Our analysis only includes population B, because the annual matrices for population A were mostly non-ergodic.

^{3}Apart from uncertainty in the underlying transition rates, the uncertainty in variance(log λ) is also a function of the number of years over which it was estimated. This latter component of uncertainty is straightforward to model, but we ignore it here for simplicity.

^{4}Because the transition rates reported in the original paper were estimated independently across years and transition types, our estimates of sampling error make the same assumptions. But now that we’ve reconstructed the raw data, we could of course model the transitions using a more nuanced correlational structure — e.g. partially pooling across years, or allowing for correlations among transition types.

^{5}Note also that the point estimate for life expectancy for the 1994-95 transition was incalculable, because the estimated transition rates implied a 100% survival loop between the final two stage classes (i.e. infinite life expectancy).