Cloudnet NetCDF Convention

Robin Hogan
Version 1: 5 December 2001 - original specification for Level 1 radar and lidar data
Version 2: 5 August 2002 - some refinements to conform to the Climate and Forecast (CF) convention
Version 3: 23 August 2004 - more detailed specifications relevant to Level 2 and above


Motivation

It has been agreed that Cloudnet instrument data should be provided by participants in either NetCDF or ASCII format. For instruments such as radiometers that simply produce a time series of a few parameters, ASCII is sufficient. For instruments such as radar and lidar that produce large two-dimensional arrays of data, NetCDF is a much more suitable format, in large part because it is self describing. It is also much more suitable for subsequent meteorological products. However, there is much freedom in how to arrange the data in a NetCDF file, so it makes sense to define a standard that participants should aim to conform to, both in the instrument data that they provide and the meteorological products derived from them. This allows generic programs to read and plot data produced by any participants, specifically the chilncplot program, part of the chil package, which is used to produce the quicklooks currently on the Cloudnet web site.

The Cloudnet convention is applicable to any dataset on a time-height grid, including radar and lidar data, single-site model forecasts and derived meteorological products. It adopts many of the components of other NetCDF conventions, specifically the Climate and Forecast (CF) convention, which is favoured by the British Atmospheric Data Centre (BADC). Conventions generally relate to the attributes that should be supplied, or those that it is recommended to use if a certain piece of information needs to be conveyed.

Suggestions for improvements/clarifications to the convention would be welcome.

Sample radar and lidar NetCDF files (older data so may not conform fully)

Files and filenames

Each level 1 (instrumental or model data) or level 2 (meteorological product) file should contain data from a single day. The times reported in the file should be in hours UTC, so for instruments that operate continuously, each individual file should run from midnight to midnight UTC.

Filenames should be of the form YYYYMMDD_WHERE_WHAT.nc, where the fields are as follows

YYYYMMDD
The date, UTC.
WHERE
A lower case string identifying the site (currently one of chilbolton, cabauw, palaiseau, arm-sgp, arm-nsa, arm-darwin, arm-manus and arm-nauru).
WHAT
This field either identifies the instrument (e.g. galileo for the Galileo radar), the model (met-office-global-12-35 for the 12-35 hour forecast of the Met Office global model) or the meteorological product (e.g. iwc-Z-T-method for ice water content derived using the reflectivity+temperature method).
The fields should not contain underscores (_); hyphens (-) should be used to separate information within fields. This then allows for an additional field to be added to indicate the version number, under consideration for a future revision of the convention. Thus filenames should contain only the characters [-_.a-z0-9]. Spaces are forbidden as they have the habit of breaking unix scripts.

Processed data (i.e. data with the clear-sky noise removed and set to a constant value) zip down very small because of all the repeated values in the file.

Dimensions

The NetCDF dataset should contain the dimension time, which should be the first dimension defined. The vertical dimension may be range, height or level:

time
This dimension may have the length unlimited. NetCDF permits one dimension to be unlimited, which means that variables using this dimension can grow along this dimension. However, if the data are read one variable at a time then the use of an unlimited dimension seems to slow down the read speed.
range
This dimension is used for instrumental data up to level 1b, and indicates that distance is measured from the instrument rather than from mean sea level, and also allows for instruments not pointing at zenith.
height
For level 1c and 2, ranges from instruments are converted to heights above mean sea level, and this dimension name is used.
level
For level 1 model data this is used to indicate model level rather than height, since model levels often do not correspond to unique heights.
Other dimensions may be defined. For example, the level 1b model data contains microwave propagation parameters derived from the model fields for several different frequencies, so uses the dimension frequency. The level 1c Instrument Synergy/Target Categorization dataset holds model data on the original vertical model grid (to save space), which is referenced using the model_height dimension.

Variables

Compulsory variables

The following compulsory variables are stored as variables rather than global attributes because they have a unit or other describing attribute associated with them; the attributes that should be set are shown indented after each variable name. Each NetCDF attribute consists of a "name" and a "value", where the value can be a text string or a vector of numbers. All these variables are of type float, i.e. a 4-byte floating-point number (real*4 in FORTRAN nomenclature).

latitude
units = "degrees_north"
long_name = "Latitude of site"
longitude
It is conventional to always report positive longitudes, i.e. +359.0 rather than -1.0.
units = "degrees_east"
long_name = "Longitude of site"
For each dimension a "coordinate variable" must be defined, i.e. a vector variable with the same name as the dimension. Typically these would be of type float. Thus all datasets should contain a time variable:
time(time)
Note that the float type has enough precision for time in hours to be discretised to better than 0.007 seconds.
units = "hours since YYYY-MM-DD 00:00:00 00:00"
where YYYY-MM-DD must contain the date that the data were taken (e.g. 2002-09-05). The zeros at the end indicate that the time is from midnight UTC (i.e. timezone 00:00). This reporting of time is from the CF convention. Note that reporting time in hours rather than seconds from midnight is much more convenient for the user.
long_name = "Time UTC"
axis = "T"
A range, height or level variable should then also be defined, depending on the dimensions present, e.g.
range(range)
units = "km"
Note that reporting range in metres ("m") is also permissible.
long_name = "Range from antenna to the centre of each range gate"
An example long name.
axis = "Z"
height(height)
units = "m"
long_name = "Height above mean sea level"
axis = "Z"
Note that the axis attribute is the CF way of stating the dominant temporal and vertical variables against which 2D variables in the file should be plotted. No more than one axis of a given type should be present in the file.

Compulsory variable attributes

All variables should set the following two attributes:
units
The units should be readable by the UDUNITS package, as required by the CF convention description. An additionally accepted unit is dBZ. If possible, units should be SI. Consider also using the units_html attribute discussed below. The main points for uniform use of units are as follows: Note that bit fields and status fields, defined below, need not use the units attribute.
long_name
This should be a concise but informative phrase describing the variable, short enough to fit comfortably in the axis or title of a plot (i.e. shorter than around 60 characters). It should start with an upper case letter.

Recommended variable attributes

The following attributes are good ways to express information about a variable. They should conform to the conventions indicated.

comment
This is by far the most important attribute that a variable can have as it describes to the user what the variable is. Do not assume that the user has a copy of documentation that should have been distributed with the file: put enough information here to explain what the variable contains, how it was derived, what the calibration convention was and things the user should be aware of when using this variable. If there are references specific to this variable (i.e. those that would be inappropriate in the global references attribute) then include them here. Ideally this attribute should start with "This variable contains...", such that it may be used as a general description of the variable for use with programs that generate automatic documentation from a NetCDF file; for example, the detailed descriptions of variables on the IWC product page were contained in comment attributes. Use complete sentences terminated with a full-stop/period so that extra comments can be easily appended. New line characters (ASCII code: decimal 10) should be used to break long lines. Note that the use of the plural comments has been deprecated.
_FillValue and missing_value
If the variable contains missing data (e.g. because an instrument was not working or the variable indicates cloud particle size but not cloud is present etc.) then both _FillValue and missing_value should be present to indicate which value has been used to flag that no valid data are available. They must be of the same type as the variable itself. The use of two different attributes is an unfortunate consequence of the fact that older programs may only expect missing_value while newer programs tend to use _FillValue.
units_html
If units contains subscripts or superscripts, consider adding a units_html attribute containing <sup></sup> or <sub></sub> HTML tags, which display programs (specifically chilncplot) can use to show exponents properly. So if units was "g m-3" then units_html would be "g m<sup>-3</sup>".
plot_range
A two-element vector of numbers with the same units as the variable itself, which indicate the recommended range to plot the variable over. This does not mean that variables outside this range are invalid. The attribute must be of the same type as the variable (i.e. float, short etc.). This attribute should be used in combination with plot_scale.
plot_scale
This attribute either contains "linear" or "logarithmic", indicating the best way to plot the variable. It should be used in combination with plot_range and is interpretted by programs such as chilncplot.
source
For datasets containing variables derived from different sources, it is useful to indicate the particular source here. Typically one would take the global source attribute from the dataset from which this variable was derived.

Variables indicating error and sensitivity

All derived meteorological products at level 2 and above should ideally be accompanied by an indication of their error. Typically errors can be divided into random error that decorrelates rapidly with time, and a bias due to the accuracy with which an instrument was calibrated and which may affect all measurements in a day uniformly. Additionally, many instruments and the products derived from them have a sensitivity, or a minimum detectable value, which should be reported in order that comparison with models be fair. Variables affected in this way should define one or more of the following attributes:

error_variable
Contains the name of the variable in the file that indicates the random error of the variable in question. Typically if the variable name were Z, then the corresponding error_variable would be Z_error.
bias_variable
As above, but for the bias. Similarly, the typical name for the bias in Z would be Z_bias.
sensitivity_variable
As above, but for the sensitivity. The typical name for the minimum detectable Z would be Z_sensitivity.
Sometimes errors can have a long (and difficult to define) decorrelation time, and it is not obvious how to differentiate between random error and bias. In this case only an error_variable need be defined. The variables used to report error and sensitivity should conform to the following conventions:

Bit fields and status fields

It is often necessary to indicate the status of a retrieval, enabling the user to distinguish pixels for which the retrieval was (for example) "reliable", "probably reliable but...", "unreliable", "not possible". Sometimes targets need to be distinguished between a number of different types, such as "liquid clouds", "ice clouds", "aerosol", "insects". In this case one can use a status field, where the integer variable will be one of a limited number of values, or a bit field, where each bit of the integer variable should be interpretted as a separate flag. Such variables should always be of NetCDF type byte, to avoid the byte-order confusion that is likely to arise with two-byte and four-byte integers due to different CPU architectures. Additionally, it is probably best not to use the most significant bit of the field, as this is used to indicate the sign of the byte and could easily be misread by badly written programs. Hence use no more than 7 bits per byte, and if you need more bits, consider providing two bit fields. Rather than use a units attribute, the variable should use a definition attribute, where each line (separated by the new-line character) indicates the meaning either of each value, or of each bit. In the case of status fields, we could have:

definition =
"0: No cloud present
1: Reliable retrieval
2: Possibly unreliable retrieval due to spiders in the waveguide
3: Unreliable retrieval
"
while in the case of bit fields we could have:
definition =
"Bit 0: Liquid droplets are present
Bit 1: Ice particles are present
Bit 2: Raindrops are present
Bit 3: Aerosol particles are present
"
Note that definition is used by programs such as chilncplot in the key at the side of the plot to indicate the meaning of each colour, so the descriptions should be fairly concise. Use of a long_definition attribute is therefore recommended where more complete descriptions may be placed, but the same format should be used, with a single line terminated by a new-line character (except the last) for each entry.

Typically status fields are very suitable to be plotted, so to assist plotting programs it is helpful if the following attributes are defined:

plot_range
As defined above (although it need not be accompanied by plot_scale), this attribute would be a vector of two bytes indicating the lowest (invariably 0) and highest value to be displayed.
legend_key_red, legend_key_green and legend_key_blue
Each attribute is a vector of type float with length equal to the number of categories in the status field. The numbers should lie between 0.0 and 1.0 and indicate the RGB values recommended for displaying the field.

Global attributes

Global attributes provide important information about the data in a NetCDF file.

Compulsory global attributes

The following attributes should be present and of type short. They replicate information present in the units attribute of the time variable, but are much easier to obtain from scalar global attributes than by parsing a string.

day
The day of the month on which the data were taken.
month
The month of the year, where January = 1 etc.
year
The year as a full four-digit number (e.g. 2001)

The following attributes should be present and of type text:

Conventions = "CF-1.0"
Indicates that your data satisfies the CF conventions. If your data doesn't satisfy the CF conventions, don't include this attribute.
location
The site at which the instrument was operating, such as "Chilbolton", "Cabauw", "Palaiseau" and "ARM Southern Great Plains".
title
A suitable title for plots created from the dataset, such as "Ice water content from Chilbolton", "Chilbolton 94-GHz Cloud Radar (Galileo)" or "Cabauw 905-nm CT75K Vaisala Lidar Ceilometer".
history
Each program that acts on the file should append to this attribute a brief description of what they did, and when they did it (again using the new-line character as a separator). Extra information can include the user and the name of the machine. For example, "Wed Nov 28 18:38:12 GMT 2001 - NetCDF generated from original data by Robin Hogan <r.j.hogan@reading.ac.uk> on voldemort". If the calibration needs to be changed then it may be appended by "\nThu Nov 29 18:38:12 GMT 2001 - Recalibrated (+3 dB) by Robin Hogan <r.j.hogan@reading.ac.uk> on voldemort", where '\n' indicates the new-line character (i.e. not a backslash character followed by an "n" character).
institution
The institution that produced the data, such as "Royal Dutch Meteorological Institute (KNMI)". It may be necessary to refer to several institutes, in which case the two should be separated by a new line, e.g. "Data recorded at Chilbolton Observatory (part of the Radio Communications Research Unit, RAL, UK)\n
Processed by the University of Reading
".
source
In the case of instrumental data, this would contain a brief specification of the instrument. The spec of a radar should include frequency, antenna diameter, pulse repetition freqiency, pulse width (in microseconds) and peak power, and the spec of a lidar should include wavelength, divergence, field of view and pulse repetition frequency. The fields would be new-line separated. In the case of model data a single-line title for the model is sufficient, e.g. "UK Met Office mesoscale model". Data derived from a variety of sources should concatenate the global source attributes from the input datasets, separated by semi-colon (;) and new-line.
references
Any web-based or published information about the data, e.g. "Information on the data is available at http://www.met.rdg.ac.uk/radar/doc/galileo.html". Obviously please ensure that the web site referred to is maintained for the likely lifetime of the data.

Recommended global attributes

comment
Any further general information for the user (that is not specific to individual variables) should be added here. Use complete sentences terminated with a full-stop/period so that extra comments can be easily appended. It is also useful to add new-line characters to break up long lines.
command_line
The full Unix (or DOS) command line used to call the program that generated the data. This is essential for Chilbolton data where the various processing options (such as the calibration figure applied) are all decided by command-line arguments, and one often needs to know exactly what processing was applied. If more than one program operates on the file (such as if the data need to be recalibrated) then each program should append their own command line, separated by a new-line character. Therefore each element of the command_line attribute should correspond to each element of the history attribute.
software_version
If the processing program changes over time then it is useful to store the version number (as a string) of the program here.

Sample radar and lidar NetCDF files

Recommended variables for radar and lidar data

The following describes additional conventions that should make radar and lidar data from different sites as similar as possible.

Scalar variables

The following variables are single values that are stored as variables rather than global attributes because they have a unit or other describing attribute associated with them; the attributes that should be set are shown indented after each variable name. All these variables are of type float.

altitude
To get the altitude of each range gate above mean sea level, the user of this data should add this value to the values in the range variable (assuming the instrument is vertically pointing, and taking account of the fact that altitude is in metres and range is in km).
units = "m"
long_name = "Altitude of antenna above mean sea level"
elevation
Most radars will be vertically pointing, so their elevation will be 90°. Lidars may be deployed off-zenith to avoid specular reflection from horizontally aligned plate crystals, in which case the elevation will be less than 90°.
units = "degrees"
long_name = "Elevation above horizon"
azimuth
An optional variable that gives the azimuth of instruments that are not vertically pointing.
units = "degrees"
long_name = "Azimuth clockwise from due north"

For radar the following should also be defined:

frequency
units = "GHz"
long_name = "Radar frequency"

For lidar, use:

wavelength
If this is a multi-wavelength lidar, then wavelength should be a one-dimensional array containing all the wavelengths available. This requires an extra dimension, also with name wavelength.
units = "nm"
long_name = "Lidar wavelength"

Two-dimensional variables

Most two-dimensional variables will be of type float. However, for some data it may make sense to use the short data type (a signed 2-byte integer; integer*2 in FORTRAN nomenclature). The CT75K lidar ceilometer is a good candidate as the raw data are stored to this precision so no information is lost. You may then use scale_factor and/or add_offset attributes to get the data into suitable units and to provide the correct calibration. If both are present then the data in the file should be scaled first before the offset is added. Note also that the missing_value and _FillValue attributes apply to the data before it has been scaled and shifted in this way. Usually scale_factor and add_offset would be of type float.

For some variables, notably radar reflectivity, accurate calibration can be difficult and the data may need to be recalibrated after the initial release. These variables should therefore indicate the calibration that has been applied to them in the processing stage in the calibration_applied attribute.

The following are variable names that could be used in radar data, and some of the attributes that should be present:

Z(time, range)
units = "dBZ"
long_name = "Radar reflectivity factor"
comment = "Calibration convention: in the absence of attenuation, a cloud at 273 K containing one million 100-micron droplets per cubic metre will have a reflectivity of 0 dBZ at all frequencies."
calibration_applied
...in dB.
v(time, range)
units = "m s-1"
units_html = "m s<sup>-1</sup>"
long_name = "Doppler velocity"
comment = "Positive velocities are away from the radar."
folding_velocity
This attribute indicates that the velocities may be folded, lying in the range -folding_velocity to folding_velocity.
width(time, range)
units = "m s-1"
units_html = "m s<sup>-1</sup>"
long_name = "Spectral width"
comment = "This variable is the standard deviation of the reflectivity-weighted velocities in the radar pulse volume."
sigma_v(time, range)
Level 1 data is typically averaged to 30 seconds, so the velocity variable in the NetCDF file is typically an average of a number of high-resolution mean velocity values measured in the averaging time. The sigma_v variable is the standard deviation of these high-resolution mean velocities. Spectral width is the standard deviation of actual particle velocities measured within the radar pulse volume in a short time (typically around 1 second), so tends to be dominated by the differential fall speeds of the different sized particles. This variable, on the other hand, is dominated by turbulence.
units = "m s-1"
units_html = "m s<sup>-1</sup>"
long_name = "Standard deviation of mean velocity"
comment = "The data in this file are at a lower resolution than the raw data, and this variable is the standard deviation of the raw Doppler velocities measured during in each output gate and ray."
Ldr(time, range)
units = "dB"
long_name = "Linear depolarisation ratio"

Similarly, the following are variable names that could be used with lidar data:

beta(time, range)
If attenuated backscatter coefficient is measured at more than one wavelength, then the wavelength could be indicated in the variable name, such as beta1064, beta532 etc.
units = "m-1 sr-1"
units_html = "m<sup>-1</sup> sr<sup>-1</sup>"
long_name = "Attenuated backscatter coefficient"
Ldr(time, range)
units = "1"
Lidar depolarisation ratio normally lies in the range 0 to 1.
long_name = "Linear depolarisation ratio"

If there is a need to have an unprocessed version of a variable in the file then I suggest using the names Z_raw, beta_raw and so on.