Big Data in the Geosciences: 3

We are inundated with environmental data – Earth observing satellites stream terabytes of data back to us daily; ground-based sensor networks track weather, water quality and air pollution, taking readings every few minutes; and community scientists log hundreds and thousands of observations every day, recording everything from bird sightings to road closures and accidents. But this very richness of data has created a new set of problems.
This third post in our four-part series gives a brief summary of how deep learning is being used in the geosciences today – loosely based on the Earth and Space Science Informatics sessions and town halls at the AGU fall meeting in Dec 2016.

Deep learning
Artificial Neural Networks (ANNs) are already being widely used in domains ranging from stock price predictions to image recognition; from genetic sequencing to targeted marketing. Deep learning neural networks – that is, networks that have multiple layers of neurons between the input and output neurons – are also beginning to be used in the Geosciences to address a range of problems.

Continue reading

Cloudiness Trends for the 2017 Solar Eclipse

Planning an excursion to see the upcoming solar eclipse? NASA can help with that! They provide two sets of data which can point you to good viewing:

A little bit of R scripting lets us combine these and put them onto a Leaflet map of the US.

Downloads

Before coding, there are two things to download:

Displaying weather data from the Global Historical Climatology Network (GHCN)

In our previous blog posts, we downloaded and analyzed the GHCN weather data. That leads us to the next step:  Displaying the data!

Our goal is to display the climate change data so that the regional trends are clearly visible as well as the local weather history underpinning those trends. In order to do this we decided on the following graphics components:

  • displaying data on a map, both in the form of point locations (the weather stations) and bitmap overlays (regional weather trends). The Leaflet library of JavaScript routines handles this part.
  • displaying graphs of the trends in local weather data. The D3.js library handles this.
  • and displaying the data history for a given weather station as a heatmap. We draw directly to
    Continue reading

R: Analyzing weather data from the Global Historical Climatology Network (GHCN)

1. Introduction
The R code below can be used to extract some weather metrics such as maximum daily temperature, minimum daily temperature, average or total daily rainfall and other annual metrics from the GHCN weather data set. This code assumes that you have already created a dataframe of the GHCN stations of interest to you. For example, the set of GHCN stations of interest in this exercise consists of the 520 stations within the US that have data for the 80 years from 1936–2015, with less than 2% missing data (see “R: Reading & Filtering GHCN weather data” on how this set was created). This dataframe (stn80 in our case) with the stations of interest should include, at a minimum, the station ID, LAT, LON (the LAT & LON are useful for mapping the metrics).

The GHCN weather data has one data file for each station. The station data file from GHCN has the following format:
Note: the GHCN station datafiles were converted from a fixed width format to a comma separated format.

head(USC00010252)
  X.1  X          ID year month element Val1 Val2 Val3 Val4 Val5 Val6 Val7 Val8 Val9 Val10 Val11 Val12 Val13 Val14 Val15 Val16 Val17 Val18 Val19 Val20 Val21 Val22 Val23 Val24 Val25 Val26 Val27 Val28 Val29 Val30 Val31
1  42 42 USC00010252 1938     1    TMAX   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA   244   256   239   233   222   194   189   233   228   239   239   239   250   250   200    67    67    94   178   233   183
2  43 43 USC00010252 1938     1    TMIN   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    67    94   111   111   117   144   144   167   183   183   139   122   150   139    44   -72   -67   -17   -17    78    56
3  45 45 USC00010252 1938     1    PRCP    0   64   25    0    0   89  191    0    0     0     0    15     0     5     0     0     0    23     0     0     0     0     0   203     0     0     0     0     0     0    41
4  48 48 USC00010252 1938     2    TMAX  172  161  211  233  261  256  256  256  250   261   256   239   256   233   261   233   233   239   233   128   261   261   178   128   117   211   233   206    NA    NA    NA
5  49 49 USC00010252 1938     2    TMIN  -28   33   83   50   78  156   89   94  106    83    94   100    83   122    89   133    78   128    56     6    78   133   111    33     6     0    44    78    NA    NA    NA
6  51 51 USC00010252 1938     2    PRCP    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     3   686     0     0     0   114     0     0     0     0     0    NA    NA    NA

Note: TMAX and TMIN are in tenths of degree Celsius, so 172 is 17.2C
We will manipulate these station data files in R to create several different metrics and write them to their own output files.
Continue reading

R: Reading & Filtering weather data from the Global Historical Climatology Network (GHCN)

1. Introduction
The GHCN weather data set is a great starting point for exploring trends in global or regional weather patterns. It is a publicly available, observational, dataset that has daily and monthly summaries of weather variables, including high temperature, low temperature and precipitation. The weather data is available for thousands of stations world-wide; and for many of these stations the weather records stretch back over a century. In this blog post, we describe how to:

  • read in this fixed-width dataset into R;
  • use the metadata information to create a subset of weather stations in the US with data from 1936-2015;
  • determine percentage of missing data for each station;

thus creating a list of weather stations in the US with 98% coverage of the weather variables TMAX, TMIN, and PRCP for the 80-year period from 1936 to 2015.

Continue reading

Big Data in the Geosciences: 2

We are inundated with environmental data – Earth observing satellites stream terabytes of data back to us daily; ground-based sensor networks track weather, water quality, and air pollution, taking readings every few minutes; and community scientists log hundreds and thousands of observations every day, recording everything from bird sightings to road closures and accidents. But this very richness of data has created a new set of problems.
This second post in our four-part series gives a high-level view of the challenges of portraying and communicating big data in the geosciences – and how these challenges are being addressed – loosely based on the Earth and Space Science Informatics sessions and town halls at the AGU fall meeting in Dec 2016.

Data Visualization
One of the challenges facing geoscientists is simply how to wrangle meaning from big data and effectively communicate their findings to other interested scientists, communities, students, planners or policy-makers. Big data is challenging as it can have a large number of variables with complex, non-linear relationships among them. Scientists are turning to data visualization – which leverages the incredible pattern-recognition power of the human eye – to design graphics that effectively convey complex information.

Continue reading

Big Data in the Geosciences: 1

We are inundated with environmental data – Earth observing satellites stream terabytes of data back to us daily; ground-based sensor networks track weather, water quality, and air pollution, taking readings every few minutes; and community scientists log hundreds and thousands of observations every day, recording everything from bird sightings to road closures and accidents.  But this very richness of data has created a new set of problems.
This four-part post gives a high-level view of some of the challenges of big data in the geosciences – and how they might be solved – loosely based on the Earth and Space Science Informatics sessions and town halls at the AGU fall meeting in Dec 2016.

Data discovery
With so much environmental data, looking for a specific dataset for a research project can sometimes feel like looking for a needle in a haystack. How can data discovery, that is, finding the right dataset or sharing one’s own dataset with the larger research community, be made more efficient?

Continue reading

R: An introductory regression exercise

In this example, we will complete a linear regression in R using mtcars, one of the built-in R datasets.

1. help or ?
First, let us get acquainted with the mtcars dataset.
To do so, we look at the package description. We can do this by either using ‘?mtcars’ or ‘help(mtcars)’.
Note: this will print in the Help window, and has been pasted in to the Notebook for completeness.

help(mtcars)

mtcars {datasets}
R Documentation
Motor Trend Car Road Tests
Description

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Usage

mtcars
Continue reading