1. Introduction
The GHCN weather data set is a great starting point for exploring trends in global or regional weather patterns. It is a publicly available, observational, dataset that has daily and monthly summaries of weather variables, including high temperature, low temperature and precipitation. The weather data is available for thousands of stations world-wide; and for many of these stations the weather records stretch back over a century. In this blog post, we describe how to:
- read in this fixed-width dataset into R;
- use the metadata information to create a subset of weather stations in the US with data from 1936-2015;
- determine percentage of missing data for each station;
thus creating a list of weather stations in the US with 98% coverage of the weather variables TMAX, TMIN, and PRCP for the 80-year period from 1936 to 2015.
In subsequent blog posts, we will show how to create weather data summaries for these US stations, and possible ways to visualize this data. For an example of how this R code can be used, please see our weather project.
2. Downloading the GHCN weather dataset
The GHCN daily weather dataset can be downloaded from this NOAA FTP site . Some of the files available for download are:
- ghcnd-stations.txt, which has the lat-long for the ~100,000 weather stations in the GHCN.
- ghcnd-inventory.txt, which, in addition to the lat-long, includes the start and end years for monitoring of each weather variable for each station.
- readme.txt, a key file, which gives the file format, variable descriptions, and units, and other useful information.
- ghcnd-all.tar.gz, the tar’ed and zipped file which contains the daily weather measurments, one file for each station.
The GHCN readme.txt also gives instructions on how to cite the weather data (see excerpt below).
README FILE FOR DAILY GLOBAL HISTORICAL CLIMATOLOGY NETWORK (GHCN-DAILY)
Version 3.22
The journal article describing GHCN-Daily is: Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910, doi:10.1175/JTECH-D-11-00103.1.
To acknowledge the specific version of the dataset used, please cite: Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason,and T.G. Houston, 2012: Global Historical Climatology Network – Daily (GHCN-Daily), Version 3. [indicate subset used following decimal, e.g. Version 3.12].
NOAA National Climatic Data Center. http://doi.org/10.7289/V5D21VHZ [access date].
I. DOWNLOAD QUICK START
Start by downloading “ghcnd-stations.txt,” which has metadata for all stations.
Then download one of the following TAR files:
- “ghcnd-all.tar.gz” if you want all of GHCN-Daily, OR
- “ghcnd-gsn.tar.gz” if you only want the GCOS Surface Network (GSN), OR
- “ghcnd-hcn.tar.gz” if you only want the U.S. Historical Climatology Network (U.S. HCN).
Then uncompress and untar the contents of the tar file, e.g., by using the following Linux command:
tar xzvf ghcnd_xxx.tar.gz
Where “xxx” stands for “all”, “hcn”, or “gsn” as applicable. The files will be extracted into a subdirectory under the directory where the command is issued.
…
3. Reading in station metadata
Once the data is downloaded, we can begin identifying weather stations that might be of interest to us. There are ~100,000 weather stations – scattered all over the world – in the GHCN, each collecting data for some subset of weather variables, for different periods of operation. Given the huge amount of data and its diversity, before the start of any analysis, we are likely to identify the geographical region, the weather variables, and the time period that is relevant for the analysis; and use these criteria to identify and read in only the subset of weather stations we are interested in. We will use the metadata information in ghcnd-stations.txt and ghcnd-inventory.txt to identify our weather station subset; so we begin by importing them into R first. Since they are in fixed-width format, we use read.fortran instead of read.csv.
We assume that the GHCN data has been downloaded to a directory whose path is given by noaa_dir, and that the csv file(s) created will be written to a directory whose path is give by noaaout.
We first read in the fixed width file ghcnd-stations.txt, and create both a dataframe and a csv file from the data.
#noaa_dir <- "pathname of directory with GHCN data"
#noaaout <- "pathname of directory for output files"
stnscsv <- paste0(noaaout,"stations.csv")
# Read in the "ghcnd-stations.txt" file, and create a dataframe.
# Write out the dataframe as a csv file for later use.
# As per the readme.txt, the format of ghcnd-stations.txt is:
#
# IV. FORMAT OF "ghcnd-stations.txt"
#
# ------------------------------
# Variable Columns Type
# ------------------------------
# ID 1-11 Character
# LATITUDE 13-20 Real
# LONGITUDE 22-30 Real
# ELEVATION 32-37 Real
# STATE 39-40 Character
# NAME 42-71 Character
# GSN FLAG 73-75 Character
# HCN/CRN FLAG 77-79 Character
# WMO ID 81-85 Character
# ------------------------------
#
# beware: NAME field can contain "#"
# note the blanks between columns -- ditch those
# ("X1" is the read.fortran notation to ignore a character)
typedcols <- c( "A11", "F9", "F10", "F7", "X1","A2",
"X1","A30", "X1", "A3", "X1", "A3", "X1", "A5" )
stns <- read.fortran(paste0(noaa_dir,"ghcnd-stations.txt"),
typedcols,
comment.char="")
hdrs <- c("ID", "LAT", "LON", "ELEV", "ST", "NAME","GSN", "HCN", "WMOID")
names(stns) <- hdrs
write.csv(stns,stnscsv)
The stns dataframe should look like this:
head(stns)
ID LAT LON ELEV ST NAME GSN HCN WMOID
1 ACW00011604 17.1167 -61.7833 10.1 ST JOHNS COOLIDGE FLD
2 ACW00011647 17.1333 -61.7833 19.2 ST JOHNS
3 AE000041196 25.3330 55.5170 34.0 SHARJAH INTER. AIRP GSN 41196
4 AEM00041194 25.2550 55.3640 10.4 DUBAI INTL 41194
5 AEM00041217 24.4330 54.6510 26.8 ABU DHABI INTL 41217
6 AEM00041218 24.2620 55.6090 264.9 AL AIN INTL 41218
In addition to including the LAT and LONG for each weather station, the stations file also includes the elevation of the weather station, a state abbreviation for weather stations in the US and Canada, and the name of the weather station.
Next we read in the inventory file.
inventorycsv <- paste0(noaaout,"inventory.csv")
## Again from readme.txt, we see the format of the inventory.txt file:
#
# VII. FORMAT OF "ghcnd-inventory.txt"
#
# ------------------------------
# Variable Columns Type
# ------------------------------
# ID 1-11 Character
# LATITUDE 13-20 Real
# LONGITUDE 22-30 Real
# ELEMENT 32-35 Character
# FIRSTYEAR 37-40 Integer
# LASTYEAR 42-45 Integer
# ------------------------------
invcols <- c( "A11", "X1", "F8", "X1", "F9", "X1","A4",
"X1","I4", "X1", "I4" )
inv <- read.fortran(paste0(noaa_dir,"ghcnd-inventory.txt"),
invcols,
comment.char="")
invhdrs <- c("ID", "LAT", "LON", "ELEM" , "FIRST", "LAST")
names(inv) <- invhdrs
write.csv(inv,inventorycsv)
The inv(entory) dataframe should look like this:
head(inv)
ID LAT LON ELEM FIRST LAST
1 ACW00011604 17.1167 -61.7833 TMAX 1949 1949
2 ACW00011604 17.1167 -61.7833 TMIN 1949 1949
3 ACW00011604 17.1167 -61.7833 PRCP 1949 1949
4 ACW00011604 17.1167 -61.7833 SNOW 1949 1949
5 ACW00011604 17.1167 -61.7833 SNWD 1949 1949
6 ACW00011604 17.1167 -61.7833 PGTM 1949 1949
The inventory file contains the weather station ID, the weather ‘elements’ measured by the weather station, and the first and last year for measurements on record. The first two characters of the station ID specify the country in which the weather station is located. The GHCN dataset has five core measurements:
- PRCP = Precipitation (tenths of mm)
- SNOW = Snowfall (mm)
- SNWD = Snow depth (mm)
- TMAX = Maximum temperature (tenths of degrees C)
- TMIN = Minimum temperature (tenths of degrees C)
4. Creating a subset of interest
Now that we have the metadata imported in R, we are ready to create the subset of weather stations we are interested in working with. For this analysis, we will focus on all weather stations in the contiguous US that have TMAX, TMIN, and PRCP data for the period 1936-2015.
Below is the R code for finding all weather stations in the US that have TMAX, TMIN, and PRCP data for the years 1936-2015.
#select all stations in the US
us1 <- inv[grep("US+",inv$ID),]
# stations that monitor max temperature readings (TMAX)
us2 <- us1[us1$ELEM == "TMAX",]
# between 1936 and 2015
us_years <- us2[(us2$FIRST <= 1936) & us2$LAST >= 2015, c(1,5,6)]
colnames(us_years) <- c("ID", "TMAXf", "TMAXl")
# and have also been monitoring TMIN between 1936 and 2015
us3 <- us1[(us1$ELEM == "TMIN") & (us1$FIRST <= 1936) & (us1$LAST >= 2015), c(1,5,6)]
colnames(us3) <- c("ID", "TMINf", "TMINl")
us_years <- merge(us_years, us3, all=FALSE)
# in addition to monitoring PRCP between 1936 and 2015
us4 <- us1[(us1$ELEM == "PRCP") & (us1$FIRST <= 1936) & (us1$LAST >= 2015), c(1,5,6)]
colnames(us4) <- c("ID", "PRCPf", "PRCPl")
us_years <- merge(us_years, us4, all=FALSE)
# Of the stations monitoring TMAX, TMIN, PRCP between 1936 and 2015,
# select only those that are in the contiguous US
us_stns <- stns[grep("US+",stns$ID),]
unique(us_stns$ST)
# [1] "SD" "CO" "NE" "AK" "AL" "AR" "AZ" "CA" "TN" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN" "KS"
#[21] "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR"
#[41] "PA" "RI" "SC" "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY" "PI" "UM"
# PI stands for Pacific Islands, UM for US Minor Outlying Islands
conus_stns <- us_stns[(us_stns != "AK") &
(us_stns != "HI") &
(us_stns != "PI") &
(us_stns != "UM"), ]
us80 <- merge(conus_stns, us_years, all.y=TRUE)
us80 has ~2,000 stations in the contiguous US that have data spanning the 80 years from 1936-2015.
5. Reading in the daily weather files
We can now read in the daily weather measurements for only these 2,000 or so stations, instead of all ~100,000 stations, and save them as csv files (which are much faster to read in).
library(dplyr)
numFiles <- length(us80$ID)
dirname <- paste0(noaa_dir,"ghcnd_all/ghcnd_all/")
for (i in 1:numFiles) {
infile <- paste0(dirname, us80$ID[i], ".dly")
outfile <- paste0(noaaout, us80$ID[i], ".csv")
#Each record in a file contains one month of daily data. The variables on each
#line include the following:
#
#------------------------------
#Variable Columns Type
#------------------------------
#ID 1-11 Character
#YEAR 12-15 Integer
#MONTH 16-17 Integer
#ELEMENT 18-21 Character
#VALUE1 22-26 Integer
#MFLAG1 27-27 Character
#QFLAG1 28-28 Character
#SFLAG1 29-29 Character
#VALUE2 30-34 Integer
#MFLAG2 35-35 Character
#QFLAG2 36-36 Character
#SFLAG2 37-37 Character
# . . .
# . . .
# . . .
#VALUE31 262-266 Integer
#MFLAG31 267-267 Character
#QFLAG31 268-268 Character
#SFLAG31 269-269 Character
#------------------------------
cols <- c( "A11", "I4", "I2", "A4",
rep( c( "I5", "A1", "A1", "A1"), 31) )
df <- read.fortran(infile, cols, na.strings="-9999") # -9999 indicates missing data
# next, fill in the column names
tmp <- c("Val","xxM","xxQ","xxS") # xx so we can ditch them later
vhdrs <- paste( rep(tmp,31), rep(1:31,each=4), sep="")
hdrs <- c("ID", "year", "month", "element", vhdrs)
names(df) <- hdrs
df <- df[df$year >= 1936 & df$year <= 2015,]
df_out <- dplyr::select(df, -matches("xx*")) # get rid of M, Q, S
write.csv(df_out, outfile)
}
6. Filtering on missing data
Since weather stations can have missing data, we further estimate the amount of missing data for each station, and remove any station that has > 2% data missing.
us80$natmax <- 0
us80$natmin <- 0
us80$naprcp <- 0
numFiles <- length(us80$ID)
for (i in 1:numFiles) {
infile <- paste0(noaaout, us80$ID[i], ".csv")
df <- read.csv(infile)
us80$natmax[i] <- sum(sapply(df[df$element == "TMAX",c(6:36)], function (x) length(x[is.na(x)])))
us80$natmin[i] <- sum(sapply(df[df$element == "TMIN",c(6:36)], function (x) length(x[is.na(x)])))
us80$naprcp[i] <- sum(sapply(df[df$element == "PRCP",c(6:36)], function (x) length(x[is.na(x)])))
}
# 2% of data = 80 x 12 x 31 x 0.02 = 595.2
stn80 <- us80[us80$natmax <= 595 & us80$natmin <= 595 & us80$naprcp <= 595, ]
write.csv(stn80, paste0(noaaout, "stn80.csv"))
After all the filtering and merging, we finally have stn80 – our list of weather stations in the contiguous US that have TMAX, TMIN, PRCP data for the years 1936-2015, with < 2% missing data. We will start our analysis in the next blog post.