1. Introduction
The R code below can be used to extract some weather metrics such as maximum daily temperature, minimum daily temperature, average or total daily rainfall and other annual metrics from the GHCN weather data set. This code assumes that you have already created a dataframe of the GHCN stations of interest to you. For example, the set of GHCN stations of interest in this exercise consists of the 520 stations within the US that have data for the 80 years from 1936–2015, with less than 2% missing data (see “R: Reading & Filtering GHCN weather data” on how this set was created). This dataframe (stn80 in our case) with the stations of interest should include, at a minimum, the station ID, LAT, LON (the LAT & LON are useful for mapping the metrics).
The GHCN weather data has one data file for each station. The station data file from GHCN has the following format:
Note: the GHCN station datafiles were converted from a fixed width format to a comma separated format.
head(USC00010252)
X.1 X ID year month element Val1 Val2 Val3 Val4 Val5 Val6 Val7 Val8 Val9 Val10 Val11 Val12 Val13 Val14 Val15 Val16 Val17 Val18 Val19 Val20 Val21 Val22 Val23 Val24 Val25 Val26 Val27 Val28 Val29 Val30 Val31 1 42 42 USC00010252 1938 1 TMAX NA NA NA NA NA NA NA NA NA NA 244 256 239 233 222 194 189 233 228 239 239 239 250 250 200 67 67 94 178 233 183 2 43 43 USC00010252 1938 1 TMIN NA NA NA NA NA NA NA NA NA NA 67 94 111 111 117 144 144 167 183 183 139 122 150 139 44 -72 -67 -17 -17 78 56 3 45 45 USC00010252 1938 1 PRCP 0 64 25 0 0 89 191 0 0 0 0 15 0 5 0 0 0 23 0 0 0 0 0 203 0 0 0 0 0 0 41 4 48 48 USC00010252 1938 2 TMAX 172 161 211 233 261 256 256 256 250 261 256 239 256 233 261 233 233 239 233 128 261 261 178 128 117 211 233 206 NA NA NA 5 49 49 USC00010252 1938 2 TMIN -28 33 83 50 78 156 89 94 106 83 94 100 83 122 89 133 78 128 56 6 78 133 111 33 6 0 44 78 NA NA NA 6 51 51 USC00010252 1938 2 PRCP 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 686 0 0 0 114 0 0 0 0 0 NA NA NA
Note: TMAX and TMIN are in tenths of degree Celsius, so 172 is 17.2C
We will manipulate these station data files in R to create several different metrics and write them to their own output files.
2. Creating a daily metrics dataframe for each weather station
We first create a data frame/file for each station that is in a format that can readily be used to plot a “heat map” for the 80 years of data for either max daily temperature (TMAX), min daily temperature (TMIN) or precipitation (PRCP). You can see an example of a TMAX heat map on our weather project page. The tidyr package is used below to convert from rows to columns.
## stn80 has the list of GHCN stations of interest
stn80 <- read.csv("stationlist.csv", stringsAsFactors=FALSE)
numFiles <- length(stn80$ID)
## Create a data frame labeled with the years 1936-2015- the 80 years we are interested in -
## and the 12 months for each year.
## Each month's data is on a single line, so there is only one row for each month each year.
df <- data.frame( year = rep(c(1936:2015), each = 12), month = rep(c(1:12),80))
for (i in 1:numFiles) {
df$ID <- stn80$ID[i]
fname <- paste0(stn80$ID[i],".csv")
stnf <- read.csv(fname, stringsAsFactors=FALSE) ## read the file for the current station
## select only the TMAX elements from the incoming file, and merge with the data frame df.
## Use gather (from tidyr) to convert the daily data from row to columns.
stntmax <- stnf[stnf$element == "TMAX", -c(1,2,6)]
stntmax <- merge(df, stntmax, all.x=TRUE)
stntmax1 <- stntmax %>% gather(day, TMAX, Val1:Val31) ## create one line for each day
stntmax1$day <- as.integer(substring(stntmax1$day, 4, last = 1000000L))
## label the day by the integer after "Val". last = 1000000L means go to the end of the string.
## repeat for TMIN, merge TMIN data frame with TMAX dataframe into the output dataframe
stntmin <- stnf[stnf$element == "TMIN", -c(1,2,6)]
stntmin <- merge(df, stntmin, all.x=TRUE)
stntmin1 <- stntmin %>% gather(day, TMIN, Val1:Val31)
stntmin1$day <- as.integer(substring(stntmin1$day, 4, last = 1000000L))
outfr <- merge(stntmax1, stntmin1, all = TRUE)
## repeat for PRCP, and merge with output dataframe
stnprcp <- stnf[stnf$element == "PRCP", -c(1,2,6)]
stnprcp <- merge(df, stnprcp, all.x=TRUE)
stnprcp1 <- stnprcp %>% gather(day, PRCP, Val1:Val31)
stnprcp1$day <- as.integer(substring(stnprcp1$day, 4, last = 1000000L))
outfr <- merge(outfr, stnprcp1, all=TRUE)
## save the file
outfile <- paste0( stn80$ID[i],"_80.csv")
write.csv(outfr, outfile)
}
Each station output file looks like this:
X year month ID day TMAX TMIN PRCP 1 1 1936 1 USC00017366 1 61 17 5 2 2 1936 1 USC00017366 10 178 11 0 3 3 1936 1 USC00017366 11 200 44 0 4 4 1936 1 USC00017366 12 217 22 0 5 5 1936 1 USC00017366 13 189 106 5 6 6 1936 1 USC00017366 14 217 22 0
3. Creating an annual average dataframe for each weather station
We next create annual metric files, one for each of our 520 stations and one for each metric. Below is the R code used to create the file with average annual TMAX for each year. Within the same loop, one could also compute the annual average TMIN and the annual average (or total) precipitation. The dplyr package is used to create annual summaries.
numFiles <- length(stn80$ID)
for (i in 1:numFiles) {
fname <- paste0(stn80$ID[i],"_80.csv")
tempf <- read.csv(fname, stringsAsFactors=FALSE)
tmax_summ <- tempf %>% group_by(year) %>% summarise(ID = first(ID), metric = 1, value = mean(TMAX, na.rm=TRUE))
outfile1 <- paste0(stn80$ID[i],"_ann_tmax.csv")
write.csv(tmax_summ, outfile1)
}
4. Determining the number of days each year that set record highs
Another useful metric is the number of days each year which set record highs for the day, for each station. Also interesting is to note which year the record high was set for a given day. The R code (annotated) to generate this data is below.
for (i in 1:numFiles) {
## read the file that we saved in step 1, which has TMAX, TMIN, and PRCP.
fname <- paste0(stn80$ID[i],"_80.csv")
tempf <- read.csv(fname, stringsAsFactors=FALSE)
## for convenience, drop the leading rownum as well as tmin and prcp columns
tempf <- tempf[,c(2:6)]
## create a data frame to accumulate record high temperatures for each day
## populate it with an entry from tempf to get the types and colnames correct
rec_hi <- tempf[1,]
## iterate over each calendar day to determine which of the 80 years we are analyzing had
## record temperature for that day. Note, the climate data has 31 entries for each month (even Feb).
for (m in 1:12) {
for (d in 1:31) {
## subset all data for the calendar day
xx <- tempf[tempf$month == m & tempf$day == d,]
## this logic/magic bit of code extracts year or years with the max/record temp,
## and binds it to our accumulating data frame
rec_hi <- rbind(rec_hi, xx[which(xx$TMAX == max(xx$TMAX, na.rm=TRUE)),])
}
}
## remove the dummy record we used to create the accumulating dataframe
rec_hi <- rec_hi[-1,]
## summarize the accumulated data frame to get the number of record highs in each year
rechi_summ <- rec_hi[!is.na(rec_hi$TMAX),] %>% group_by(year) %>%
dplyr::summarise(ID = first(ID), metric = 3, value = n())
## the following 3 lines ensure there is an entry for each year,
## with a 0 instead of an NA for years in which no record highs were set
df <- data.frame(year = c(1936:2015), ID = stn80$ID[i], metric = 3)
df <- merge(df, rechi_summ, all.x=TRUE)
df$value[is.na(df$value)] <- 0
outfile3 <- paste0( stn80$ID[i],"_year_rec_hi.csv")
outfile4 <- paste0( stn80$ID[i],"_num_rec_hi.csv")
write.csv(df, outfile3)
write.csv(rec_hi, outfile4)
}
5. Determining the number of days above a threshold value each year
Another metric that is often of interest is the number of days each year above a specified temperature. Below is the R code to determine the number of days each year with temperature > 35C (95F).
for (i in 1:numFiles) {
fname <- paste0(stn80$ID[i],"_80.csv")
tempf <- read.csv(fname, stringsAsFactors=FALSE)
## create a dataframe without NAs, as NAs mess up the min and max functions
t.na <- tempf[!is.na(tempf$TMAX),]
## The temperature data in the GHCN file is in tenths of degrees Celsius: 350 => 35C
t95 <- t.na[t.na$TMAX >=350,]
## use dplyr to count the number of days over 35C
t95_summ <- t95 %>%
group_by(year) %>%
summarise(ID = first(ID), metric = 7, value = n())
## create a "master" dataframe which has an entry for each year...
df <- data.frame(year = c(1936:2015), ID = stn80$ID[i], metric = 7)
##...and merge the count of days dataframe with it.
df <- merge(df, t95_summ, all.x=TRUE)
## Replace NAs with 0, as a year which did not have any days above 35C will show up as an NA after the merge.
df$value[is.na(df$value)] <- 0
outfile7 <- paste0(stn80$ID[i],"_days_95plus.csv")
write.csv(df, outfile7)
high7 <- paste0(stn80$ID[i],"_days_95_plus.csv")
write.csv(t95[,c("ID","year","month", "day", "TMAX")], high7)
}