Big Data in the Geosciences: 4

We are inundated with environmental data – Earth observing satellites stream terabytes of data back to us daily; ground-based sensor networks track weather, water quality and air pollution, taking readings every few minutes; and community scientists log hundreds and thousands of observations every day, recording everything from bird sightings to road closures and accidents. But this very richness of data has created a new set of problems.
This last post in our four-part series gives a brief summary of the data skills that geoscientists will need to develop to effectively work with data in a data-rich, connected, open-source world. This report is loosely based on the town halls and open-source sessions, as well as the more formal Earth and Space Science Informatics sessions, at the AGU fall meeting in Dec 2016.

Data Skills
Twenty-first century science is marked by the availability of huge environmental datasets, unprecedented access to computing power, and an urgent need to understand – and mitigate – the increasing impact of human society on the environment. What skills do geoscientists need to face the challenges and opportunities of 21st century science? We describe three areas that have the potential to leverage today’s data and computing power to meet our current environmental challenges.

Machine learning
Machine Learning – a class of computationally intense algorithms specifically designed to mine big data – shows great promise in helping scientists extract patterns and meaning from big environmental data sets.  The roots of Machine Learning (ML) lie in game development and expert systems. To date, ML has been most widely used to mine consumer and social network data. However, scientists are already beginning to both adapt and develop machine learning techniques to further scientific endeavor (for example, see our previous post).  Although ML has a steep learning curve, and presents some challenges unique to analysis of scientific data (error estimation and propagation; visualization; scientific insight), it represents a powerful new toolbox for big, messy, scientific datasets.

Open Source tools
The Linux operating system introduced a new paradigm for software development, namely, open source development. Open source is marked by a collaborative development process, in which resulting programs or technologies are made freely available to the world.  Open source development is a natural fit with scientific computing, and adoption of the open source paradigm within the scientific computing community has led to a plethora of open source tools to aid the geoscientist – from Paraview to QGIS to R Studio.

Open source databases (for example MySQL, PostGIS, Neo4j) enable scientists to store, retrieve, and manipulate data in a variety of native structures (ie, tabular for MySQL, spatial for PostGIS, and graph for Neo4j). Open source dynamic visualization tools such as D3 and Axiis create charts and graphics, while Leaflet provides interactive web mapping.

Leveraging open source tools poses some challenges. Finding the right tool can be difficult, as documentation for emerging tools is sparse. Debugging issues can be a challenge as well, although forums are usually a good place to get answers. Thus leveraging open source tools requires some insight or awareness of the software development process and the willingness to learn in informal ways such as MOOCs, hackathons, and online tutorials, blogs, and forums. On the plus side, open source development allows engaged scientists to influence the development process. Developing the mind set and skills to use open source tools will allow scientists to leverage a huge – and rapidly expanding – range of tools at just the cost of mental sweat equity.

Participatory science
Science in the times of Newton (1642 – 1726) and Darwin (1809 -1882) was an amateur pursuit, but in the last century or so it has become entrenched as a professional endeavor, generally limited to academics with appropriate training and credentials. However, Chandra Clarke points out that – contrary to popular belief – there remains a strong amateur interest in science in the populace at large. This can be seen in the success of participatory efforts such as the Christmas Bird Count (started by the Audubon Society in 1900) and the Cooperative Observer Program for recording meteorological data (started by the National Weather Service in 1890). The longevity of these and other similar programs speaks both to the general interest in science, and the benefit provided to scientists by engaged and enthusiastic amateurs.

Today, due to a confluence of reasons, including the need for local-scale data, the human power of pattern recognition (not yet matched by machine learning), and a drop in scientific funding, scientists are increasingly turning to interested amateurs to help with the scientific observations in projects variously called citizen science, community science, crowd-sourced science or participatory science (see, for example, Zooniverse, Smithsonian Citizen Science, or Wikipedia).

Key challenges for participatory science projects include recruiting (and retaining) interested volunteers, training amateurs in protocols, and creating processes that promote adherence to protocol and have built-in error checks.  Common strategies include reviews, pooling multiple independent observations, and more innovatively, adaptive learning and evaluation.