Using GitHub Actions continuous integration to automate quality assurance and control of data on ecological dynamics

Winter in the Forest-GEO plot at SCBI.

Abstract

In an era of rapid environmental change, accurate data on ecological dynamics are essential to understanding the resistance and resilience of ecological systems, and the services they provide, to multiple global change drivers. Field data collection errors are common, and researchers often struggle to keep up with data checking until months or even years after data have been collected, at which point many errors can no longer be corrected. The lag between data collection and analysis can also result in slow detection of anomalous dynamics. Needed is a system in which data quality assurance and control (QA/QC), along with basic data summaries, can be automatically conducted immediately following data collection. Here, we implement and test a cyberinfrastructure system to accomplish this. We used GitHub Actions continuous integration (CI) to automate 1) data QA/QC, in particular error checking and naive anomaly detection and 2) the running of routine scripts for data wrangling to produce cleaned data sets ready for analysis. We implemented and tested this system on two annual tree mortality censuses and a dendrometer band survey at two ForestGEO large forest dynamics plots':' Smithsonian Conservation Biology Institute and Harvard Forest. This system automation had numerous benefits. It produced a dashboard providing near real-time information on data collection status and errors requiring correction, resulting in final data sets free of detectable errors. Second, it produced an apparent learning effect among field technicians':' original error rates in field data collection declined significantly following implementation of the system. By implementing CI schemes, researchers can ensure that data sets are free of any errors for which a test can be coded. The result is dramatically improved data quality, increased skill among technicians, and reduced need for expert oversight. Furthermore, CI implementation can reveal anomalous ecological trends as they occur, allowing researchers to adjust sampling to capture unexpected dynamics. Thus, by reducing the time between data collection and analysis, CI stands to accelerate the pace of ecological field research.

Publication
Methods in Ecology and Evolution
Ian R. McGregor
Ian R. McGregor
Postdoctoral Associate

Postdoctoral Associate at the Cary Institute