R: Analyzing weather data from the Global Historical Climatology Network (GHCN)

1. Introduction
The R code below can be used to extract some weather metrics such as maximum daily temperature, minimum daily temperature, average or total daily rainfall and other annual metrics from the GHCN weather data set. This code assumes that you have already created a dataframe of the GHCN stations of interest to you. For example, the set of GHCN stations of interest in this exercise consists of the 520 stations within the US that have data for the 80 years from 1936–2015, with less than 2% missing data (see “R: Reading & Filtering GHCN weather data” on how this set was created). This dataframe (stn80 in our case) with the stations of interest should include, at a minimum, the station ID, LAT, LON (the LAT & LON are useful for mapping the metrics).

The GHCN weather data has one data file for each station. The station data file from GHCN has the following format:
Note: the GHCN station datafiles were converted from a fixed width format to a comma separated format.

head(USC00010252)
  X.1  X          ID year month element Val1 Val2 Val3 Val4 Val5 Val6 Val7 Val8 Val9 Val10 Val11 Val12 Val13 Val14 Val15 Val16 Val17 Val18 Val19 Val20 Val21 Val22 Val23 Val24 Val25 Val26 Val27 Val28 Val29 Val30 Val31
1  42 42 USC00010252 1938     1    TMAX   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA   244   256   239   233   222   194   189   233   228   239   239   239   250   250   200    67    67    94   178   233   183
2  43 43 USC00010252 1938     1    TMIN   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    67    94   111   111   117   144   144   167   183   183   139   122   150   139    44   -72   -67   -17   -17    78    56
3  45 45 USC00010252 1938     1    PRCP    0   64   25    0    0   89  191    0    0     0     0    15     0     5     0     0     0    23     0     0     0     0     0   203     0     0     0     0     0     0    41
4  48 48 USC00010252 1938     2    TMAX  172  161  211  233  261  256  256  256  250   261   256   239   256   233   261   233   233   239   233   128   261   261   178   128   117   211   233   206    NA    NA    NA
5  49 49 USC00010252 1938     2    TMIN  -28   33   83   50   78  156   89   94  106    83    94   100    83   122    89   133    78   128    56     6    78   133   111    33     6     0    44    78    NA    NA    NA
6  51 51 USC00010252 1938     2    PRCP    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     3   686     0     0     0   114     0     0     0     0     0    NA    NA    NA

Note: TMAX and TMIN are in tenths of degree Celsius, so 172 is 17.2C
We will manipulate these station data files in R to create several different metrics and write them to their own output files.

2. Creating a daily metrics dataframe for each weather station
We first create a data frame/file for each station that is in a format that can readily be used to plot a “heat map” for the 80 years of data for either max daily temperature (TMAX), min daily temperature (TMIN) or precipitation (PRCP). You can see an example of a TMAX heat map on our weather project page. The tidyr package is used below to convert from rows to columns.

## stn80 has the list of GHCN stations of interest
stn80 <- read.csv("stationlist.csv", stringsAsFactors=FALSE)

numFiles <- length(stn80$ID)

## Create a data frame labeled with the years 1936-2015- the 80 years we are interested in - 
## and the 12 months for each year.
## Each month's data is on a single line, so there is only one row for each month each year.
df <- data.frame( year = rep(c(1936:2015), each = 12), month = rep(c(1:12),80))

for (i in 1:numFiles) {
  df$ID <- stn80$ID[i]
  
  fname <- paste0(stn80$ID[i],".csv")
  stnf <- read.csv(fname, stringsAsFactors=FALSE)  ## read the file for the current station

  ## select only the TMAX elements from the incoming file, and merge with the data frame df.
  ## Use gather (from tidyr) to convert the daily data from row to columns.
  stntmax <- stnf[stnf$element == "TMAX", -c(1,2,6)]
  stntmax <- merge(df, stntmax, all.x=TRUE)
  stntmax1 <- stntmax %>% gather(day, TMAX, Val1:Val31)  ## create one line for each day
  stntmax1$day <- as.integer(substring(stntmax1$day, 4, last = 1000000L)) 
  ## label the day by the integer after "Val". last = 1000000L means go to the end of the string.
  
  ## repeat for TMIN, merge TMIN data frame with TMAX dataframe into the output dataframe
  stntmin <- stnf[stnf$element == "TMIN", -c(1,2,6)]
  stntmin <- merge(df, stntmin, all.x=TRUE)
  stntmin1 <- stntmin %>% gather(day, TMIN, Val1:Val31)
  stntmin1$day <- as.integer(substring(stntmin1$day, 4, last = 1000000L))
  outfr <- merge(stntmax1, stntmin1, all = TRUE)

  ## repeat for PRCP, and merge with output dataframe
  stnprcp <- stnf[stnf$element == "PRCP", -c(1,2,6)]
  stnprcp <- merge(df, stnprcp, all.x=TRUE)
  stnprcp1 <- stnprcp %>% gather(day, PRCP, Val1:Val31)
  stnprcp1$day <- as.integer(substring(stnprcp1$day, 4, last = 1000000L))
  outfr <- merge(outfr, stnprcp1, all=TRUE)
  
  ## save the file
  outfile <- paste0( stn80$ID[i],"_80.csv")
  write.csv(outfr, outfile)
  
}

Each station output file looks like this:

  X year month          ID day TMAX TMIN PRCP
1 1 1936     1 USC00017366   1   61   17    5
2 2 1936     1 USC00017366  10  178   11    0
3 3 1936     1 USC00017366  11  200   44    0
4 4 1936     1 USC00017366  12  217   22    0
5 5 1936     1 USC00017366  13  189  106    5
6 6 1936     1 USC00017366  14  217   22    0

3. Creating an annual average dataframe for each weather station
We next create annual metric files, one for each of our 520 stations and one for each metric. Below is the R code used to create the file with average annual TMAX for each year. Within the same loop, one could also compute the annual average TMIN and the annual average (or total) precipitation. The dplyr package is used to create annual summaries.

numFiles <- length(stn80$ID)

for (i in 1:numFiles) {
    fname <- paste0(stn80$ID[i],"_80.csv")
    tempf <- read.csv(fname, stringsAsFactors=FALSE)
    tmax_summ <- tempf %>% group_by(year) %>% summarise(ID = first(ID), metric = 1, value = mean(TMAX, na.rm=TRUE))
    outfile1 <- paste0(stn80$ID[i],"_ann_tmax.csv")
    write.csv(tmax_summ, outfile1)

}

4. Determining the number of days each year that set record highs
Another useful metric is the number of days each year which set record highs for the day, for each station. Also interesting is to note which year the record high was set for a given day. The R code (annotated) to generate this data is below.


for (i in 1:numFiles) {
  ## read the file that we saved in step 1, which has TMAX, TMIN, and PRCP.
  fname <- paste0(stn80$ID[i],"_80.csv")
  tempf <- read.csv(fname, stringsAsFactors=FALSE) 
   
  ## for convenience, drop the leading rownum as well as tmin and prcp columns
  tempf <- tempf[,c(2:6)]

  ## create a data frame to accumulate record high temperatures for each day
  ## populate it with an entry from tempf to get the types and colnames correct
  rec_hi <- tempf[1,]

  ## iterate over each calendar day to determine which of the 80 years we are analyzing had
  ## record temperature for that day. Note, the climate data has 31 entries for each month (even Feb).
  for (m in 1:12) {
    for (d in 1:31) {

      ## subset all data for the calendar day
      xx <- tempf[tempf$month == m & tempf$day == d,]
      ## this logic/magic bit of code extracts year or years with the max/record temp, 
      ## and binds it to our accumulating data frame
      rec_hi <- rbind(rec_hi, xx[which(xx$TMAX == max(xx$TMAX, na.rm=TRUE)),])
    }
  }
  ## remove the dummy record we used to create the accumulating dataframe
  rec_hi <- rec_hi[-1,]
  
  ## summarize the accumulated data frame to get the number of record highs in each year
  rechi_summ <- rec_hi[!is.na(rec_hi$TMAX),] %>% group_by(year) %>% 
                dplyr::summarise(ID = first(ID), metric = 3, value = n())

  ## the following 3 lines ensure there is an entry for each year, 
  ## with a 0 instead of an NA for years in which no record highs were set  
  df <- data.frame(year = c(1936:2015), ID = stn80$ID[i], metric = 3)
  df <- merge(df, rechi_summ, all.x=TRUE)
  df$value[is.na(df$value)] <- 0
  
  outfile3 <- paste0( stn80$ID[i],"_year_rec_hi.csv")
  outfile4 <- paste0( stn80$ID[i],"_num_rec_hi.csv")
  write.csv(df, outfile3)
  write.csv(rec_hi, outfile4)
  
}

5. Determining the number of days above a threshold value each year
Another metric that is often of interest is the number of days each year above a specified temperature. Below is the R code to determine the number of days each year with temperature > 35C (95F).

for (i in 1:numFiles) {
  fname <- paste0(stn80$ID[i],"_80.csv")
  tempf <- read.csv(fname, stringsAsFactors=FALSE)
  ## create a dataframe without NAs, as NAs mess up the min and max functions
  t.na <- tempf[!is.na(tempf$TMAX),]
  ## The temperature data in the GHCN file is in tenths of degrees Celsius: 350 => 35C
  t95 <-  t.na[t.na$TMAX >=350,]

  ## use dplyr to count the number of days over 35C
  t95_summ <- t95 %>% 
              group_by(year) %>% 
              summarise(ID = first(ID), metric = 7, value = n())
  
  ## create a "master" dataframe which has an entry for each year...
  df <- data.frame(year = c(1936:2015), ID = stn80$ID[i], metric = 7)
  ##...and merge the count of days dataframe with it.
  df <- merge(df, t95_summ, all.x=TRUE)
  ## Replace NAs with 0, as a year which did not have any days above 35C will show up as an NA after the merge.
  df$value[is.na(df$value)] <- 0
  outfile7 <- paste0(stn80$ID[i],"_days_95plus.csv")
  write.csv(df, outfile7)
  high7 <- paste0(stn80$ID[i],"_days_95_plus.csv")
  write.csv(t95[,c("ID","year","month", "day", "TMAX")], high7)

}