Big Data in the Geosciences: 1

We are inundated with environmental data – Earth observing satellites stream terabytes of data back to us daily; ground-based sensor networks track weather, water quality, and air pollution, taking readings every few minutes; and community scientists log hundreds and thousands of observations every day, recording everything from bird sightings to road closures and accidents.  But this very richness of data has created a new set of problems.
This four-part post gives a high-level view of some of the challenges of big data in the geosciences – and how they might be solved – loosely based on the Earth and Space Science Informatics sessions and town halls at the AGU fall meeting in Dec 2016.

Data discovery
With so much environmental data, looking for a specific dataset for a research project can sometimes feel like looking for a needle in a haystack. How can data discovery, that is, finding the right dataset or sharing one’s own dataset with the larger research community, be made more efficient?

One way scientists are trying to make data FAIR (that is, findable, accessible, interoperable, reusable) is by developing trusted data repositories and data publication guidelines. Some of the organizations working on developing the guidelines for efficient data discovery include the Coalition for Publishing Data in the Earth and Space Sciences (COPDESS), the Federation of Earth Science Information Partners (ESIP), and FORCE11 (The Future of Research Communications and e-Scholarship). Data repositories can keep track of the metadata associated with a dataset, as well as track the algorithms and processing that went into creating derived datasets.  Since each dataset within a data repository has a unique identifier (just like a DOI for scientific papers), it makes citing and tracking datasets much simpler.

Another approach being explored is returning relevant environmental datasets based on semantic networks. A search based on a semantic network can return not just the dataset requested by the user, but also other datasets that are in some way meaningfully or conceptually to that dataset. An example of a search result from a semantic network is the Knowledge Graph returned by a Google Search on a well-known person (or entity), say, Rachel Carson. The Knowledge Graph (the right panel in figure below) includes photographs, basic biographical information, quotes, and books, and links to people connected to the person.

NASA is currently working to develop such knowledge networks for its vast collection of Earth Sciences and Planetary Sciences data.