# Develop best practice recommendations for combining seabird study data collected from different platforms: study

This study developed best practice guidance to combine seabird survey data collected from different platforms based on a literature review, expert knowledge and a bespoke model development including sensitivity analysis. This can be used in environmental assessments for planning and licensing.

## 6 Best practice for multi-survey analyses of seabird distributions

Ideally, the data used for robust spatial prediction of species distributions should be both high-resolution and spatially expansive. However, logistical trade-offs between spatiotemporal extent and resolution mean that such in-depth and geographically broad data are rarely available in practice. Instead, researchers need to piece together data from different places, times, or survey methods. Such integration presents several challenges (see Section 5, above), but it also offers remarkable opportunities. For example, data from different places and times, can allow us to increase the spatial extent of our maps, and our historical reconstructions, but importantly, they allow us to model the focal species under distinct and different circumstances, hence increasing the transferability of our model predictions. Also, if the survey designs are different (e.g. different resolutions and different field methodologies), simultaneous analysis has the potential to allow different surveys to effectively ground-truth (i.e. calibrate) each-other. We will approach the objective of this project in four stages.

The following recommendations build upon the closing section of Fletcher et al. (2019). We have arrived at these based on our literature review (see above), but also practical experimentation with realistic simulated data (see accompanying R library, manual and vignette).

### 6.1 Appropriate response and explanatory variables

#### Keep the highest-grade form of data

Even if occasionally true, the notion that occupancy models are more robust than models based on abundance can be misleading, since occupancy represents lower-grade information. Intentionally thresholding abundance data into presence and absence represents considerable information loss, precludes predictions of spatial distribution (instead, yielding surfaces for the probability of presence) and is therefore best avoided. If individual detections are available, these may be used in preference to aggregated counts (Section 5.1). Similar arguments apply to downgrading explanatory data (Section 5.2).

#### Analyse even low-grade data as if originating from abundance

Many data types may be curtailed at the stage of data collection. Citizen scientists may record species presence only once and transect surveys may aggregate counts of birds in each transect segment. Therefore, we may not have the option to analyse high-grade data, but this should not preclude us from modelling the underlying data-generating process as an intensity surface, referring to expected abundance. Treating this surface as latent, but common to all surveys in the pooled data set, enables the integration of multiple surveys and data types into a common statistical platform, a pre-requisite for pooled analyses.

#### Avoid inflated error structures until the end of modelling

Zero-inflated and over-dispersed data are the norm in spatial ecology. Often, this leads to hurdle analyses (e.g. modelling spatial occupancy first and conditional abundance second) or use of over-dispersed likelihood models (such as the negative binomial). However, the decision of whether this is an issue with a particular data set should not be taken a-priori. Modelling with covariates will generally explain some of that variability and use of spatially and temporally auto-correlated errors will account for unexplained hot- and cold-spots in distribution.

#### Partly missing covariates should not necessarily lead to data censoring

When parts of a spatial covariate layer are missing, the tendency is to curtail the data set, either by removing the covariate or by reducing the number of points to a subset for which complete covariate data exist. This may prove necessary in the end, however, it may be worth attempting to reconstruct the covariate either as a separate interpolation step, or as part of an integrated analysis with partially missing data (Section 5.2).

### 6.2 Treatment of survey design attributes and observation errors

#### Use distance sampling

Distance sampling techniques have a long pedigree in ecological surveys and facilitate the pooling of surveys with different protocols by reducing them into a common (if, numerically different) set of detectability characteristics (Section 3.2). The extensions of distance sampling that deal with transect design and the incorporation of covariates facilitate the correction of errors intrinsic in the observation process (Section 5.1).

#### Prioritise cross-calibration between surveys

Different surveys may rank differently in terms of their detectability (accuracy/precision) and spatiotemporal span. These qualities must often be traded-off at the design stage. However, the joint analysis of multiple surveys, allows the combination of high detectability and high span (Section 5.1). Surveys for which the detectability errors have been quantified (e.g. by multiple observer platforms), should be prized highly in this process because they can be used within a joint analysis to cross-calibrate other, less detailed surveys that may have happened close in space and time. Such calibration may be shared hierarchically by all the surveys in the data, stepping-stone-fashion, depending on proximity to each other.

#### Consider state-space approaches

State-space approaches acknowledge both the dynamic nature of marine distribution data (Section 5.4), but also the importance of modelling complex observation processes explicitly (Sections 5.1 & 5.2). In this way, rather than "correcting" the observations for biases, prior to the formal analysis, a statistical observation model is included in the model likelihood to effect the necessary correction in an integrated way (i.e. together with parameter estimation). This has the advantages that both the biological and the observation models are tuned with regard to each other, and that uncertainty propagation from the observation model to the final predictions happens automatically. Although we have not reviewed this option extensively in this report, it will be worth considering as available software becomes optimised and the computation times of multi-survey models decline.

### 6.3 Treatment of space time

#### Use point process models

Point process models allow space-time to be modelled jointly and continuously. They can also subsume all other valid approaches to species distribution modelling (Section 3.1). Finally, they are compatible with other features of modelling developed to enhance predictive power (Section 5.5). Heterogeneous point process approaches are fast becoming the gold standard for spatiotemporal analyses, and their implementation in speed-optimised frameworks such as INLA has attracted a lot of interest from management practitioners.

#### Use autocorrelated structures

Spatially and temporally autocorrelated structures can perform a multiplicity of tasks. They can account for missing covariates (hence explaining residual over-dispersion – Section 5.4). They can also be used to impute gaps in covariate values (Section 5.2). However, most importantly, they can be used to communicate to models of pooled survey data information about the spatiotemporal proximity between abundance observations. In this way, even if exact replication is not part of the survey design, an indirect form of replication can be achieved. There are caveats associated with the implementation and interpretation of auto-correlated structures, and their use is far from automatic (Section 5.4). However, the rewards, particularly for multi-survey data sets are very high.

#### Take dynamics into account

The pseudo-equilibrium assumption for SDMs is difficult to justify in applications that require more than spatial interpolation in the time-frame of data collection (Section 5.5). For example, if we need to account for multi-survey data that include before-and-after control impact, it is important to account for temporal non-stationarity. In some cases, non-linearity in the habitat responses of a species can be captured by simple extensions such as statistical interaction terms in the linear predictors of models. In other cases, a more explicitly biological model may be required. Temporal autocorrelation structures (see above) are also helpful in this respect.

### 6.4 Accessibility and density dependence

#### Use realistic distance measures

For colonial species, accessibility and density dependence in spatial usage are most often represented as non-linear transformations of distance of points at sea from colony locations. Therefore, using appropriate distance measures is essential, if birds don't transit between locations in straight lines. Depending on the species, if we are concerned that they avoid flying over land, or if due to glide-flight they rely on prevalent wind direction, it is important that we account for these effects in the measure of distance. This is particularly relevant for behaviours such as avoidance of anthropogenic structures, where birds need to circumnavigate. The distribution of usage may alter in the vicinity of structure but an SDM will be unable to capture the changes without an appropriate measure of distance.

#### In the present, Use abstracted models for density dependence

We consider that, currently, the computational demands of a fully spatially explicit model of intra-colony, inter-colony and interspecific competition are prohibitive for the purposes of applied SDMs. We have therefore provided an illustration (in the project vignette), of how a pragmatic model for these processes can be developed and incorporated into the linear predictor of an SDM. We recognise that such models are crude approximations of the truth, but even such relatively simple formulations are currently missing from most seabird SDM approaches.

#### In the future, consider spatially explicit models for density dependence

As computational approaches (particularly in the area of Approximate Bayesian Computation and Integrated Nested Laplace Approximation) become more widespread in the field of SDMs, it may become possible to model competition in a fully spatially explicit way. For example, INLA is already capable of modelling multiple, coupled response variables. This would allow the spatial interactions of different colonies to be captured as part of simultaneous regression where the distribution of animals from any given colony is allowed to affect and be affected by the distributions of members of other colonies and species.

### 6.5 Inferential Platforms

#### Use hierarchical models

Three important features of multi-survey models described above rely on hierarchical models. Specifically, cross-calibration of observation models, covariate imputation and latency and use of spatio-temporal proximity to allow the predictions to borrow strength from multiple surveys.

#### Use Bayesian approaches

Computer-intensive Bayesian model-fitting deserves attention because it is implemented in flexible software frameworks (such as JAGS or Stan), that allow state-space and hierarchical structures. More importantly, Bayesian inference permits the elicitation of expert opinion in the form of parameter priors. The expert knowledge on the attributes of field survey practices will prove invaluable at this stage for specifying parameter priors for the observation models.

#### Use Data integration

Although approaches for using multiple data sources could take the form of a comparison (e.g. so that predictions derived from an expansive data set are validated by use of a localised, high resolution set of data), this is a relatively weak approach that does not make best use of the combined data. The alternative approach of joint inference, whereby both data sets are analysed simultaneously to extract maximum power. These approaches are also particularly useful for extending the analyses to non-survey data (Section 7).

#### Fully propagate uncertainty to the final predictions

We have outlined methods that can usefully reconstruct or account for biases, imprecisions and autocorrelations in explanatory and response data, as well as coarse, misaligned, partly or wholly missing covariates. Such methods can go a considerable way towards 1) correcting predictions, 2) realistically representing inherent uncertainties and 3) increasing the spatial extent of model fitting regions, by allowing more of the data to be retained in (i.e. not censored out of) the analysis. However, there is always a limit to how much missing information can be statistically imputed and therefore some prudence may be needed in determining which variables to include in the analysis. This is best illustrated in the case of SDM forecasts that are based on dynamic environmental variables. It may be biologically known that a particular environmental variable is shaping the distribution of a species, but if that variable is not available for future predictions, then it will either need to be excluded from the original analysis, or its effect integrated out of the final predictions.

### 6.6 Computational platforms

#### Support open source

As a matter of process, all code developed by government funding should be made available to the scientific community. We have used R (R Core Team 2019) to develop the demo libraries for this project. It is a good idea to keep all functions within a single environment and push for the standardisation and quality control of these libraries.

#### Ensure strong interface with Geographic Information systems

Much of the effort in preparing for modelling goes into interfacing the analysis framework with the raw data. This would greatly be assisted by establishing stable protocols for data formatting, and by using the GIS functionality in R to keep all data processing on a single platform.

#### Parameterise non-linear model components with exact methods

We have used a flexible MCMC approach to implement autocorrelation structures and non-linear features of biology within statistical models for inference. The prototype models presented in the jointSurvey library are computationally greedy, but they have the best chance of retrieving the difficult parameters pertaining to density dependence and competition. The JAGS environment used here interfaces seamlessly with R, and therefore keeps model usage (if not model development) to the same platform.

#### Implement large scale predictions using fast approximate methods

To generate large scale predictions with the spatio-temporal autocorrelation features stipulated above, it will be imperative to move towards efficient methods, such as INLA. These can be deployed from within R and are therefore the next logical step for real-world applications. In addition, exchange of information between the JAGS models already enclosed in the jointSurvey library and the more efficient INLA models used for large-scale predictions would be both necessary and efficient under the proposed scheme.

### Contact

Email: ScotMER@gov.scot

## There is a problem

**Thanks for your feedback**