R: Reading & Filtering weather data from the Global Historical Climatology Network (GHCN)

1. Introduction
The GHCN weather data set is a great starting point for exploring trends in global or regional weather patterns. It is a publicly available, observational, dataset that has daily and monthly summaries of weather variables, including high temperature, low temperature and precipitation. The weather data is available for thousands of stations world-wide; and for many of these stations the weather records stretch back over a century. In this blog post, we describe how to:

  • read in this fixed-width dataset into R;
  • use the metadata information to create a subset of weather stations in the US with data from 1936-2015;
  • determine percentage of missing data for each station;

thus creating a list of weather stations in the US with 98% coverage of the weather variables TMAX, TMIN, and PRCP for the 80-year period from 1936 to 2015.

In subsequent blog posts, we will show how to create weather data summaries for these US stations, and possible ways to visualize this data. For an example of how this R code can be used, please see our weather project.

2. Downloading the GHCN weather dataset
The GHCN daily weather dataset can be downloaded from this NOAA FTP site . Some of the files available for download are:

  • ghcnd-stations.txt, which has the lat-long for the ~100,000 weather stations in the GHCN.
  • ghcnd-inventory.txt, which, in addition to the lat-long, includes the start and end years for monitoring of each weather variable for each station.
  • readme.txt, a key file, which gives the file format, variable descriptions, and units, and other useful information.
  • ghcnd-all.tar.gz, the tar’ed and zipped file which contains the daily weather measurments, one file for each station.

The GHCN readme.txt also gives instructions on how to cite the weather data (see excerpt below).

Version 3.22

How to cite:Note that the GHCN-Daily dataset itself now has a DOI (Digital Object Identifier) so it may be relevant to cite both the methods/overview journal article as well as the specific version of the dataset used.

The journal article describing GHCN-Daily is: Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910, doi:10.1175/JTECH-D-11-00103.1.

To acknowledge the specific version of the dataset used, please cite: Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason,and T.G. Houston, 2012: Global Historical Climatology Network – Daily (GHCN-Daily), Version 3. [indicate subset used following decimal, e.g. Version 3.12].

NOAA National Climatic Data Center. http://doi.org/10.7289/V5D21VHZ [access date].


Start by downloading “ghcnd-stations.txt,” which has metadata for all stations.
Then download one of the following TAR files:

  • “ghcnd-all.tar.gz” if you want all of GHCN-Daily, OR
  • “ghcnd-gsn.tar.gz” if you only want the GCOS Surface Network (GSN), OR
  • “ghcnd-hcn.tar.gz” if you only want the U.S. Historical Climatology Network (U.S. HCN).

Then uncompress and untar the contents of the tar file, e.g., by using the following Linux command:

tar xzvf ghcnd_xxx.tar.gz

Where “xxx” stands for “all”, “hcn”, or “gsn” as applicable. The files will be extracted into a subdirectory under the directory where the command is issued.

3. Reading in station metadata
Once the data is downloaded, we can begin identifying weather stations that might be of interest to us. There are ~100,000 weather stations – scattered all over the world – in the GHCN, each collecting data for some subset of weather variables, for different periods of operation. Given the huge amount of data and its diversity, before the start of any analysis, we are likely to identify the geographical region, the weather variables, and the time period that is relevant for the analysis; and use these criteria to identify and read in only the subset of weather stations we are interested in. We will use the metadata information in ghcnd-stations.txt and ghcnd-inventory.txt to identify our weather station subset; so we begin by importing them into R first. Since they are in fixed-width format, we use read.fortran instead of read.csv.

We assume that the GHCN data has been downloaded to a directory whose path is given by noaa_dir, and that the csv file(s) created will be written to a directory whose path is give by noaaout.

We first read in the fixed width file ghcnd-stations.txt, and create both a dataframe and a csv file from the data.

#noaa_dir <- "pathname of directory with GHCN data"
#noaaout <- "pathname of directory for output files"

stnscsv <- paste0(noaaout,"stations.csv")

# Read in the "ghcnd-stations.txt" file, and create a dataframe.
# Write out the dataframe as a csv file for later use.
# As per the readme.txt, the format of ghcnd-stations.txt is:
# IV. FORMAT OF "ghcnd-stations.txt"
# ------------------------------
# Variable   Columns   Type
# ------------------------------
# ID            1-11   Character
# LATITUDE     13-20   Real
# LONGITUDE    22-30   Real
# ELEVATION    32-37   Real
# STATE        39-40   Character
# NAME         42-71   Character
# GSN FLAG     73-75   Character
# HCN/CRN FLAG 77-79   Character
# WMO ID       81-85   Character
# ------------------------------
# beware:  NAME field can contain "#"
# note the blanks between columns -- ditch those 
# ("X1" is the read.fortran notation to ignore a character)

typedcols <- c( "A11", "F9", "F10", "F7", "X1","A2",
                  "X1","A30", "X1", "A3", "X1", "A3", "X1", "A5" )
stns <- read.fortran(paste0(noaa_dir,"ghcnd-stations.txt"),
hdrs <- c("ID", "LAT", "LON", "ELEV", "ST", "NAME","GSN", "HCN", "WMOID")
names(stns) <- hdrs

The stns dataframe should look like this:


           ID     LAT      LON  ELEV ST                           NAME GSN HCN WMOID
1 ACW00011604 17.1167 -61.7833  10.1    ST JOHNS COOLIDGE FLD                       
2 ACW00011647 17.1333 -61.7833  19.2    ST JOHNS                                    
3 AE000041196 25.3330  55.5170  34.0    SHARJAH INTER. AIRP            GSN     41196
4 AEM00041194 25.2550  55.3640  10.4    DUBAI INTL                             41194
5 AEM00041217 24.4330  54.6510  26.8    ABU DHABI INTL                         41217
6 AEM00041218 24.2620  55.6090 264.9    AL AIN INTL                            41218

In addition to including the LAT and LONG for each weather station, the stations file also includes the elevation of the weather station, a state abbreviation for weather stations in the US and Canada, and the name of the weather station.

Next we read in the inventory file.

inventorycsv <- paste0(noaaout,"inventory.csv")

## Again from readme.txt, we see the format of the inventory.txt file:
# VII. FORMAT OF "ghcnd-inventory.txt"
# ------------------------------
# Variable   Columns   Type
# ------------------------------
# ID            1-11   Character
# LATITUDE     13-20   Real
# LONGITUDE    22-30   Real
# ELEMENT      32-35   Character
# FIRSTYEAR    37-40   Integer
# LASTYEAR     42-45   Integer
# ------------------------------

invcols <- c( "A11", "X1", "F8", "X1", "F9", "X1","A4",
                "X1","I4", "X1", "I4" )
inv <- read.fortran(paste0(noaa_dir,"ghcnd-inventory.txt"),
invhdrs <- c("ID", "LAT", "LON", "ELEM" , "FIRST", "LAST")
names(inv) <- invhdrs

The inv(entory) dataframe should look like this:


           ID     LAT      LON ELEM FIRST LAST
1 ACW00011604 17.1167 -61.7833 TMAX  1949 1949
2 ACW00011604 17.1167 -61.7833 TMIN  1949 1949
3 ACW00011604 17.1167 -61.7833 PRCP  1949 1949
4 ACW00011604 17.1167 -61.7833 SNOW  1949 1949
5 ACW00011604 17.1167 -61.7833 SNWD  1949 1949
6 ACW00011604 17.1167 -61.7833 PGTM  1949 1949

The inventory file contains the weather station ID, the weather ‘elements’ measured by the weather station, and the first and last year for measurements on record. The first two characters of the station ID specify the country in which the weather station is located. The GHCN dataset has five core measurements:

  • PRCP = Precipitation (tenths of mm)
  • SNOW = Snowfall (mm)
  • SNWD = Snow depth (mm)
  • TMAX = Maximum temperature (tenths of degrees C)
  • TMIN = Minimum temperature (tenths of degrees C)

4. Creating a subset of interest
Now that we have the metadata imported in R, we are ready to create the subset of weather stations we are interested in working with. For this analysis, we will focus on all weather stations in the contiguous US that have TMAX, TMIN, and PRCP data for the period 1936-2015.

Below is the R code for finding all weather stations in the US that have TMAX, TMIN, and PRCP data for the years 1936-2015.

#select all stations in the US 
us1 <- inv[grep("US+",inv$ID),]

# stations that monitor max temperature readings (TMAX)
us2 <- us1[us1$ELEM == "TMAX",]
# between 1936 and 2015
us_years <- us2[(us2$FIRST <= 1936) & us2$LAST >= 2015, c(1,5,6)]
colnames(us_years) <- c("ID", "TMAXf", "TMAXl")

# and have also been monitoring TMIN between 1936 and 2015
us3 <- us1[(us1$ELEM == "TMIN") & (us1$FIRST <= 1936) & (us1$LAST >= 2015), c(1,5,6)]
colnames(us3) <- c("ID", "TMINf", "TMINl")
us_years <- merge(us_years, us3, all=FALSE)

# in addition to monitoring PRCP between 1936 and 2015
us4 <- us1[(us1$ELEM == "PRCP") & (us1$FIRST <= 1936) & (us1$LAST >= 2015), c(1,5,6)]
colnames(us4) <- c("ID", "PRCPf", "PRCPl")
us_years <- merge(us_years, us4, all=FALSE)

# Of the stations monitoring TMAX, TMIN, PRCP between 1936 and 2015,
# select only those that are in the contiguous US

us_stns <- stns[grep("US+",stns$ID),]

# [1] "SD" "CO" "NE" "AK" "AL" "AR" "AZ" "CA" "TN" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN" "KS"
#[21] "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR"
#[41] "PA" "RI" "SC" "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY" "PI" "UM"

# PI stands for Pacific Islands, UM for US Minor Outlying Islands

conus_stns <-  us_stns[(us_stns != "AK") & 
                       (us_stns != "HI") & 
                       (us_stns != "PI") & 
                       (us_stns != "UM"), ]
us80 <- merge(conus_stns, us_years, all.y=TRUE)

us80 has ~2,000 stations in the contiguous US that have data spanning the 80 years from 1936-2015.

5. Reading in the daily weather files
We can now read in the daily weather measurements for only these 2,000 or so stations, instead of all ~100,000 stations, and save them as csv files (which are much faster to read in).

numFiles <- length(us80$ID)
dirname <- paste0(noaa_dir,"ghcnd_all/ghcnd_all/")

for (i in 1:numFiles) {
  infile <- paste0(dirname, us80$ID[i], ".dly")
  outfile <- paste0(noaaout, us80$ID[i], ".csv")
#Each record in a file contains one month of daily data.  The variables on each
#line include the following:
#Variable   Columns   Type
#ID            1-11   Character
#YEAR         12-15   Integer
#MONTH        16-17   Integer
#ELEMENT      18-21   Character
#VALUE1       22-26   Integer
#MFLAG1       27-27   Character
#QFLAG1       28-28   Character
#SFLAG1       29-29   Character
#VALUE2       30-34   Integer
#MFLAG2       35-35   Character
#QFLAG2       36-36   Character
#SFLAG2       37-37   Character
#  .           .          .
#  .           .          .
#  .           .          .
#VALUE31    262-266   Integer
#MFLAG31    267-267   Character
#QFLAG31    268-268   Character
#SFLAG31    269-269   Character

    cols <- c( "A11", "I4", "I2", "A4",
            rep( c( "I5", "A1", "A1", "A1"), 31) )
    df <- read.fortran(infile, cols, na.strings="-9999") # -9999 indicates missing data

    # next, fill in the column names
    tmp <- c("Val","xxM","xxQ","xxS") # xx so we can ditch them later
    vhdrs <- paste(   rep(tmp,31),   rep(1:31,each=4), sep="")
    hdrs <- c("ID", "year", "month", "element", vhdrs)
    names(df) <- hdrs
    df <- df[df$year >= 1936 & df$year <= 2015,]
    df_out <- dplyr::select(df, -matches("xx*")) # get rid of M, Q, S 
    write.csv(df_out, outfile)

6. Filtering on missing data
Since weather stations can have missing data, we further estimate the amount of missing data for each station, and remove any station that has > 2% data missing.

us80$natmax <- 0
us80$natmin <- 0
us80$naprcp <- 0
numFiles <- length(us80$ID)

for (i in 1:numFiles) {
  infile <- paste0(noaaout, us80$ID[i], ".csv")
  df <- read.csv(infile)
  us80$natmax[i] <- sum(sapply(df[df$element == "TMAX",c(6:36)], function (x) length(x[is.na(x)])))
  us80$natmin[i] <- sum(sapply(df[df$element == "TMIN",c(6:36)], function (x) length(x[is.na(x)])))
  us80$naprcp[i] <- sum(sapply(df[df$element == "PRCP",c(6:36)], function (x) length(x[is.na(x)])))

# 2% of data = 80 x 12 x 31 x 0.02 = 595.2
stn80 <- us80[us80$natmax <= 595 & us80$natmin <= 595 & us80$naprcp <= 595, ]
write.csv(stn80, paste0(noaaout, "stn80.csv"))

After all the filtering and merging, we finally have stn80 – our list of weather stations in the contiguous US that have TMAX, TMIN, PRCP data for the years 1936-2015, with < 2% missing data. We will start our analysis in the next blog post.