NAME
ghcn_fetch.pl - Fetch station and weather data from the NOAA GHCN repository
VERSION
version v0.0.002
SYNOPSIS
ghcn_fetch.pl [-gui] [-optfile <filespec>]
ghcn_fetch.pl [<report_type>]
[-country <str>] [-state <str>] [-location <str>] [-gsn]
[-gps "<lat> <long>" [-radius <n>] ]
[-range <str>] [-active <str> [-partial]] [-quality <pct>]
[-fmonth <str>] [-fday <str>]
[-anomalies] [-baseline <str>] [-precip] [-tavg] [-nogaps]
[-kml <filespec> [-color <str>] ]
[-dataonly] [-nonetwork <int>] [-performance] [-verbose]
[-outclip]
[-report <report_type>]
<report_type> ::= id | daily | monthly | weekly | ""
ghcn_fetch.pl -readme
ghcn_fetch.pl -help
ghcn_fetch.pl -usage | -?
DESCRIPTION
Fetch data from the NOAA GHCN database and output as tab-separated lines. Various options are provided to allow filtering of the NOAA stations by country, state, location name, year range, station active year range, etc. When no report type is provided, or -report is an empty string, the output is simply a list of the selected stations.
If report type 'daily', 'monthly' or 'yearly' is given, then the pages for the selected stations are scanned and the data from them aggregated and output as one row per designated period. This is followed by the station list.
If report type 'id' is given, then the daily data for each selected station id is reported, followed by the station list.
The report type can be abbreviated; e.g. d or da for daily. The report type can be provided as the first argument, or it can be provided via the -report option anywhere within the argument list.
In general it's best to narrow your filter criteria as much as possible otherwise it will take a very long time to load and process the station pages. A good strategy is to omit the -report option so you can see how many stations will be queried before asking for any detailed data. Then you can adjust the number of stations using other filters.
If no options are given, and stdin isn't receiving from a pipe or a file, then -gui is assumed. This launches a dialog to provide a user-friendly way to set options, and to save and reload them (if -optfile is provided).
PARAMETERS
Getoptions::Long is used, so either - or -- may be used. Parameter names may be abbreviated, so long as they remains unambiguous.
Report Types
Data obtained from the GHCN database can be reported at various levels of aggregation using the -report option. The string value given to -report specifies the type and level of aggregation. Abbrevations are permitted.
- -report station
-
Generate a list of the stations which match the criteria provided (location, geo coordinates, ranges etc.) This is the default when no report type is requested. No actual weather data is accessed; only station data.
- -report daily
-
Scan the NOAA station pages that meet all the selection criteria and aggregate the data from them by year, month and day. Output the results as a tab-separated table suitable for import into Excel for analysis.
TMAX (temperature maximum) is aggregated by maximum; TMIN by minimum; TAVG values are averaged. Note that while most stations track TMAX and TMIN, a lot fewer track TAVG. When TAVG is missing, a proxy is calculated by averaging TMAX and TMIN.
- -report monthly
-
Same as -daily except the output is summarized to the month level. Note that with this option, TAVG is average across days of the month and may of limited usefulness. Avg will be calculated as the average of the max and min for the month, which is what is typically used as the measure for monthly average temperature.
- -report yearly
-
Same as -daily except the output is summarized to the year level. See the explanation of TAVG vs Avg on -monthly.
- -report id
-
Break the selected aggregation level down by station id and include the station id in the output. This is like -daily, but with a separate set of rows for each station id.
Station Filter
A list of station id's can be provided via stdin, and will be used in lieu of other filtering criteria. Each line of input will be searched for one or more station id's.
Geographic Filters
- -country <str>
-
Filter the station list to include only those from a specific country. The string can be a 2-character GEC (formerly FIPS) country code, a 3-character UN country code, or a 3-character internet country code (including the dot). Longer strings are treated as a pattern and matched (unanchored) against country names.
NOAA uses GEC codes in their database. For a full list of country codes and names see https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-countries.txt and https://www.cia.gov/library/publications/the-world-factbook/appendix/appendix-d.html
- -state <str> (or -province)
-
Filter the station list to include only those within the specified 2-character US state or Canadian province code.
- -location <str>
-
Filter the station list to include only those whose name matches the specified pattern. For a starts-with match, prefix the pattern with ^ (or \A). For an ends-with match, suffix the pattern with $ (or \Z).
You can also specify a station id (e.g. CA006105978) or a comma-delimited list of station id's (e.g. CA006105978,USC00336346).
As a handy shortcut, mappings between user-defined names and a station id or id list can be defined in the locations section of .ghcn_fetch.yaml.
- -gsn
-
Select only GCOS Surface Network stations, which is a baseline network comprising a subset of about 1000 stations chosen mainly to give a fairly uniform spatial coverage from places where there is a good length and quality of data record. See "/www.ncdc.noaa.gov/gosic/global-climate-observing-system-gcos/g cos-surface-network-gsn-program-overview" in https:
- -gps <latitude>,<longitude>
-
Filter the station list to include only those stations that are within -radius kilometers (default 25) of the specified decimal latitude and longitude values; e.g. 45.3822 -75.7167. The two value can be delimited by spaces, or any punctuation character (e.g. comma). If a space is used, the string must be enclosed in quotes.
- -radius <int>
-
Specify the radius, in kilometers, to be used for the -gps option.
- Date Filters
- -range <str>
-
Only include data from the specified range of years. The range is given as a string such as 1990-2018. Any punctuation character can be used to separate the two years. A single year may also be given. Alternatively, two discontiguous years can be given by separating the years with a comma (e.g. -range 1919,2019), although this feature cannot be combined with -active and with -anomalies.
Note that if -active is specified, then -range must be a subset of -active since there's no point in asking for data that lies outside the active range of data collection for a station.
- -active <str>
-
Only include data from stations which have been fully active within the specified range. The range is given as a string such as 1990-2018. Any punctuation character can be used to separate the two years. A single year may also be given.
Instead of a year range, you can use an empty string to set the active range to match the range specified by -range.
- -partial
-
The -partial option can be used in conjunction with -active to include stations that were only active during part of the active range.
- -quality <int>
-
Only include stations which have <int>% days of unflagged data within -range. If -anomalies is given, the number of days within the -baseline range is also checked against <int>%. The default value for -quality is 90, meaning that 90% of the days found within -range (and -baseline) must be present and unflagged in order for the station's data to be included in the output.
- -fday <str>
-
Filter the data so that it includes only the days of the month which match the specified range list; e.g. 5-10,20.
- -fmonth <str>
-
Filter the data so that it includes only the months of the year which match the specified range list; e.g. 1-3,7-9 would select Jan-Mar and Jul-Sep.
Analysis Options
- -anomalies
-
Calculate the mean temperature anomalies for each day at each station relative to a baseline year range (see -baseline). Include these in the output.
- -baseline <str>
-
Use the date range <str> to compute anomalies. Default 1971-2000.
- -precip
-
Include precipitation measures in the output, specifically SNOW, SNWD (snow depth), ans PRCP (all precipitation). Values are in cm. Like TMAX, SNWD is the maximum depth recorded across stations and across time. The others are averaged across stations and then summed across time. In other words, if -year is used you get the maximum snow depth for the year, and the total accumulation of snow and precipitfor the year.
- -tavg
-
Include TAVG (average daily temperature) in the output. TAVG will be averaged across stations and also across months or years if -monthly or -yearly is given.
- -nogaps
-
For report 'id', generate rows for those months and days where data is missing. This enables charting with a complete time x-axis. Without it, large gaps result in horizontal compression of the chart and a distorted picture across time.
Kml Options
- -kml <filespec>
-
Output the coordinates of the selected stations as a KML file, for import into Google Earth as placemarks. The active range of each station will be included as timespans so that you can view the placemarks across time.
- -color <color> (or -colour)
-
Color of the KLM placemark pushpins. Acceptable values are red, green, blue, azure, purple, yellow and white. May be abbreviated down to one letter. Default is red.
Output Options
Misc Options
- -dataonly
-
Print only the data table. Other information, including notes, lists of stations kept and rejected, and statistics are suppressed.
- -nonetwork <int>
-
Set the NoNetwork option used in URI::Fetch in order to alter the behaviour of caching.
By default, -nonetwork is set to -1, which sets the NoNetwork option of URI::Fetch to the number of seconds in the current year at the time the script is run. This means that the HTTP server is not contacted if the page is in cache and the cached page was inserted sometime within the present year. If the cached copy is older than this year, then a normal HTTP request (full or cache check) is done.
If -nonetwork is set to 0 and the requested page is found in the cache, the HTTP server is checked for a fresher copy.
If -nonetwork is set to 1, the HTTP server is never contacted, regardless of the page being in cache or not. If the page is missing from cache, the fetch method will return undef and the script with die. If the page is in cache, that page will be returned, no matter how old it is. This is useful for situations where the NOAA HTTP server is slow or offline and the desired data is available in the cache.
- -performance
-
Include performance statistics in the output. This includes some extra timing information (labelled "(internal)" in the Time Statistics list because they are internal to the other timing metrics) as well as statistics for the memory consumption of the Data hash table. Also some memory statistics are added to some Timing subjects.
- -verbose
-
When given, warning messages about missing data are displayed to stderr.
Command-Line Only Options
- -gui
-
Launch a graphic user interface that can be used to set options. Not available unless modules Tk and Tk::Getopt are installed.
- -optfile <filespec>
-
Designate a file to be used to save or load options.
- -readme
-
Launch the default web browser and display the NOAA Daily Readme.txt file, providing a description of the Daily data files and station data.
- -h | -help
-
Display this documentation.
- -usage | -?
-
Display the Synopsis section of this documentation.
CONFIGURATION FILE
At startup, ghcn_fetch will look for the file .ghcn_fetch.yaml in either %UserProfile% (Windows) or $HOME (unix/linux) in order to capture some additional options. The file content should contain something like this:
---
cache:
root: C:/ghcn_cache
namespace: ghcn
aliases:
yow: CA006106000,CA006106001 # Ottawa airport
cda: CA006105976,CA006105978 # Ottawa CDA and CDA RCS
center: USC00326365 # geographic center of North America
Supported options are:
- cache:
-
This section defines the cache_root and namespace options for URI::Fetch. If present, then any pages which are fetched from the NOAA GHCN repository are cached in the folder and subfolder designated by root: and namespace:. This vastly improves the performance of subsequent invocations of ghcn_fetch, especially when using the same station filtering criteria.
- aliases:
-
This section provides a list of shortcut names that are mapped to station id's or id-lists and which can be used in the -location option. If a -location value matches a key in this section, the station id or id-list is substituted. Note that keys must be lowercase letter only, with or without a leading underscore.
RELATED SCRIPTS
Additional scripts are provided for data analysis. These scripts are designed to take the output ghcn_fetch.
For Windows users, a -outclip option directs the tab-separated output to the Windows clipboard, so it can be pasted into Excel for analysis using PivotTable and PivotChart. Alternatively you can use the usual '>' method to direct the output to a file.
- ghcn_extremes.pl
-
Report patterns of temperature extremes (heatwaves or coldwaves) by analyzing daily temperature records and looking for consecutive days of extreme temperatures; e.g.
ghcn_fetch -country CA -report daily | ghcn_extremes > extremes.tsv
- ghcn_station_counts.pl
-
Report the station counts per year for a list of stations generated by this script using -report stations (which is the default -report option); e.g.
ghcn_fetch -country CA | ghcn_station_counts > stn_counts.tsv
AUTHOR
Gary Puckering (jgpuckering@rogers.com)
LICENSE AND COPYRIGHT
Copyright 2022, Gary Puckering