imported on: 2025-04-18
This notebook is from NSIDC's Processing Large-scale Time Series of ICESat-2 Sea Ice Height in the Cloud tutorial
The original source for this document is https://
github .com /nsidc /NSIDC -Data -Tutorials
1. Tutorial Overview¶
This tutorial is designed for the “DAAC data access in the cloud hands-on experience” session at the 2023 NSIDC DAAC User Working Group (UWG) Meeting.
The NSIDC DAAC archives and distributes Daily and Monthly Gridded Sea Ice Freeboard (ATL20) and Polar Sea Surface Height Anomaly (ATL21) data sets from the ICESat-2 Mission, derived from the lower level ATL10 data set. However, we may want these lower level point data to be gridded and averaged at a weekly cadence, or using a different projection or other gridding parameters.
This tutorial session is in two parts:
We will first guide you through this Jupyter Notebook running in the AWS
us-west-2
region, where data are hosted in the NASA Earthdata Cloud. The notebook utilizes several libraries to performantly search, access, read, and grid the data includingearthaccess
,h5coro
, andgeopandas
.This notebook will focus on the Ross Sea, Antarctica. But let’s say we want to scale this analysis to the entire continent. In the second portion, we will present how to scale and run this same workflow from a script (see workflow.py in the
h5cloud
folder within this notebook’s directory) that can be run from your laptop, using Coiled.
Credits¶
The notebook was created by Andy Barrett and Luis Lopez of NSIDC.
For questions regarding the notebook, or to report problems, please create a new issue in the NSIDC-Data-Tutorials repo.
Learning Objectives¶
By the end of this demonstration you will be able to:
- Use
earthaccess
to authenticate with Earthdata Login, search for ICESat-2 data using spatial and temporal filters, and directly access files in the cloud. - Open data granules using
h5coro
to efficiently read HDF5 data from the NSIDC DAAC S3 bucket. - Load data into a geopandas.DataFrame containing geodetic coordinates, ancillary variables, and date/time converted from ATLAS Epoch.
- Grid track data to EASE-Grid v2 6.25 km projected grid using drop-in-the-bucket resampling.
- Calculate mean statistics and assign aggregated data to grid cells.
- Visualize aggregated sea ice height data on a map.
Prerequisites¶
- We are running this notebook in the CryoCloud JupyterHub. For more information, see the CryoCloud Getting Started documentation. It is advised that you use at least a 16GB instance for this notebook.
- An Earthdata Login is required for data access. If you don’t have one, you can register for one here.
- It is recommended that you create a .netrc file that contains your Earthdata Login credentials, stored in your home directory. If you do not have a .netrc file,
earthaccess
will prompt you to enter your Earthdata Login username and password.
Example of end product¶
At the end of this tutorial, the following figure will be generated, demonstrating a year’s worth of ATL10 Sea Ice Freeboard height data gridded over the Ross Sea, Antarctica:
Time requirement¶
Allow approximately 40 minutes to complete this tutorial.
2. Tutorial steps¶
Installing the latest version of earthaccess¶
The CryoCloud environment currently does not have the latest earthaccess
version installed, along with new features in h5coro
that are not yet released, so we will first manually install those below:
%%capture
# suppress install outputs
!pip uninstall -y earthaccess h5coro
!pip install earthaccess==0.6.1
# h5coro has new features that we need that are not released
!pip install git+https://github.com/ICESat2-SlideRule/h5coro.git@main
NOTE: Restart the kernel and clean output after running the cell above.
Import Packages¶
# To force use of shapely
import os
os.environ['USE_PYGEOS'] = '0'
# For searching NASA data
import earthaccess
# For reading data, analysis and plotting
import numpy as np
import pandas as pd
# For resampling
from affine import Affine
# For plotting
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from h5cloud.read_atl10 import read_atl10
print(f"earthaccess: {earthaccess.__version__}")
Authenticate¶
We need to authenticate and get AWS token
auth = earthaccess.login()
Search for ICESat-2 ATL10 data¶
We use earthaccess
to search CMR for granules in the region of interest for the time period of interest.
The region is set by name below. Currently, we have two options: the Ross Sea, and the Southern Ocean and adjoining seas.
The range of dates is set by assigning a start year and end year to year_begin
and year_end
. Setting year_begin
and year_end
to the same year retreives data for one year.
# To avoid copying and pasting region tuples
region = "Ross Sea" # Set region to "Ross Sea" for just Ross Sea or "Antarctica" for southern ocean
ross_sea = (-180, -78, -160, -74)
antarctic = (-180, -90, 180, -60)
this_region = antarctic if region == "Antarctica" else ross_sea
year_begin = 2019
year_end = 2019
atl10 = {}
total_results = 0
approx_size = 0
for year in range(year_begin, year_end+1):
print(f"Searching year {year} ...")
granules = earthaccess.search_data(
short_name = 'ATL10',
version = '006',
cloud_hosted = True,
bounding_box = this_region,
temporal = (f'{year}-09-01',f'{year}-09-30'),
)
total_results += len(granules)
approx_size += sum([g.size() for g in granules])
atl10[str(year)] = granules
print(f"Total retrieved: {total_results}, approx size: {round(approx_size, 2)} MB")
Access the Granules¶
Because the CryoCloud is hub is running on servers in AWS region us-west-2
, which is the same region as the NASA Earthdata Cloud, granules can be accessed directly without having to download the files first. This is analogous to how you would work with files on your local filesystem. However, under the hood there are differences.
Initially, we load data for each year into a geopandas.DataFrame
. geopandas
is an extension of the pandas
package. pandas
is designed to work with tabular
data - think data you might put into a spreadsheet. geopandas
, extends pandas
to work with geospatial data by adding geometries (points, lines and polygons) and a coordinate reference system (CRS), so that data in each row is associated with a geospatial feature located on Earth. ICESat-2 track data is well suited to the DataFrame data model because data are related to points or segments. Once data is in a geopandas.DataFrame
, the data can be reprojected and queried using methods you may be used to using in a GIS.
Read data into geopandas.DataFrame
¶
The first step is to read the data and put it into a Dataframe. We use h5coro
, which is a package developed by the SlideRule project to efficiently read HDF5 files in the cloud. Recall from the Cloud Optimized Format presentation, the HDF5 format and the HDF5 library for reading and writing those files are not well suited to accessing data in the cloud. h5coro
was developed to solve some of the problems related to HDF5 format and tools. Using h5coro
with dask
, a python package for parallel processing on multicore local machines and distributed cluster in the cloud, reading data from ATL10 files is 5x faster than using the h5py
package, an HDF5 reader that uses the HDF5 library.
The code to read the data is long, so we have created the read_atl10
function and put it in a module. The function is imported into this notebook. If your are interested, take a look at read_atl10
in read_atl10.py
. The main features of the function are briefly described here.
We follow the processing steps for ATL20 to generate our freeboard grids. For each grid cell that contain one or more freeboard segments, a grid cell mean freeboard is calculated as a mean of gtx/freeboard_segment/beam_fb_height
from ATL10, weighted by segment length gtx/freeboard_segment/heights/height_segment_length_seg
. To resample segments to grid cells, we also need the geodetic coordinates for each segment in gtx/freeboard_segment/latitude
and gtx/freeboard_segment/longitude
. As an additional locator, we also read gtx/freeboard_segment/delta_time
. gtx
is the beam number.
In addition to the segment data, we also need some ancillary data from each file. In ATL20 gridded freeboards are calculated using only the strong beams of each beam pair. Which of the six beams are strong and which are weak depends on the orientation of the ICESat-2 satellite. Satellite orientation is given in the orbit_info/sc_orient
dataset. We also need to read the Atlas Standard Data Product Epoch that is stored in ancillary_data/atlas_sdp_gps_epoch
to convert delta_time
from seconds since launch to date and time.
The datasets containing segment data are stored in the DATASETS
constant, which is a python list
, in reader.py
. If you want additional or different datasets, you can modify this list. See NSIDC DAAC’s ATL10 User Guide and ATL10 Data Dictionary for detailed descriptions.
A ATL10 file is read using the function read_atl10
. This function encapsulates opening an HDF5 file and reading the datasets using h5coro
, and then creating a geopandas.DataFrame
containing the data. We parallelize the reading of all files in a year using pqdm
, so files are read using different processors. File for a given year are then concatenated into a single dataframe.
%%time
env = "cloud" # or local
if env == "local":
files = [g.data_links(access="out_of_region")[0] for g in atl10["2019"]]
cred = auth.token['access_token']
else:
files = [g.data_links(access="direct")[0].replace("s3://", "") for g in granules]
aws_credentials = earthaccess.get_s3_credentials("NSIDC")
cred = {
"aws_access_key_id": aws_credentials["accessKeyId"],
"aws_secret_access_key": aws_credentials["secretAccessKey"],
"aws_session_token": aws_credentials["sessionToken"]
}
tracks = read_atl10(files, executors=4, environment=env, credentials=cred)
tracks
Grid the track data¶
The resampling and calculation of statistics follows the processing steps described in the ATL20 - Gridded Sea Ice Freeboard - ATBD but gridding to a EASE-Grid v2 6.25 km grid. Any projected coordinate system or grid could be chosen. The procedure could be modified with extra QC steps or modifications. The world is your oyster - or Aplacophoran.
The processing steps are:
- remove non-ice and low quality segments
- resample freeboard segments to a grid
- calculate aggregate statistics
- mean segment length
- segment count
- length weighted mean freeboard
- length weighted standard deviation of freeboard
Resample Freeboard Segments to a Grid¶
Following the ATL20 ATBD, we will use a drop-in-the-bucket resampling scheme. This is simple and relatively easy to implement. More complex resampling schemes could be substituted.
To demonstrate resampling we will resample freeboard segments to WGS84 / NSIDC EASE-Grid v2.0 South with a grid resolution of 6.25 km. The EPSG code for the WGS84 / NSIDC EASE-Grid South coordinate reference system is 6932.
We will use the standard 6.25 km grid. To define the grid, we need the grid dimensions (nrows and ncols), the x and y projected coordinates of the upper-left corner of the upper-left grid cell, and the height and width of the grid cells in the same units as the projected coordinates. In this case, the units are meters.
easegrid2_epsg = 6932
# nrow = 2880
# ncol = 2880
# upper_left_x = -9000000.0
# upper_left_y = 9000000.0
# width = 6250.0
# height = -6250.0
nrow = 151
ncol = 147
width = 10000.0
height = -10000.0
upper_left_x = -1040000.0
upper_left_y = -560000.0
map_extent = [upper_left_x, (upper_left_x + (ncol*width)), (upper_left_y + (nrow*height)), upper_left_y]
The first step is to reproject the points from geodetic coordinates (latitude and longitude) to projected coordinates (x, y). Because the data are in a geopandas.DataFrame
we can use the to_crs
method. This takes an EPSG code either as a numeric value (6932
) or as a string ("EPSG:6932"
).
You can see that the POINT
objects in the geometry
have changed from having latitudes and longitudes as coordinates to x and y in meters.
%%time
tracks = tracks.to_crs(easegrid2_epsg)
tracks.head()
A Drop-in-the-Bucket resampling scheme collects points into the grid cells that they intersect with, and then calculates aggregate statistics for each grid cell using attributes associated with those points.
We’ll find the grid cell that contains each segment by calculating the row and column coordinates for each segment from the projected coordinates. This is done by creating an Affine transformation matrix for the grid. The Affine matrix is just a matrix representation of the algebraic expressions to convert row and column indices of the grid to projected coordinates. The equations below give the forward transformation from (row, col)
to (x, y)
.
These are expressed in matrix form:
where is , is , is , and is .
We use the affine
package to create a forward transformation matrix (fwd
) using the grid parameters above. To transform (x, y)
projected coordinates to (row, col)
, we can calculate the reverse transformation matrix using ~fwd
.
(row, col)
coordinates are still rational numbers. We want an integer row and column indices for grid cells. We can use the floor
function to get integers. row
and column
indices are zero based.
We want to be able to leverage the geopandas.Dataframe.groupby
functionality to collect points into grid cells, so we need a unique identifier to group the data. We can calculate a unique cell index from row
and column
indices as follows:
This is encapsulated in the function get_grid_index
. This function is then applied to the geometry
of tracks.
def get_grid_index(xy):
geotransform = (upper_left_x, width, 0., upper_left_y, 0., height)
fwd = Affine.from_gdal(*geotransform)
col, row = ~fwd * xy
return (np.floor(row) * ncol) + np.floor(col)
%%time
tracks["grid_index"] = [get_grid_index((x, y)) for x, y in zip(tracks.geometry.x, tracks.geometry.y)]
tracks.head()
Calculate grid cell mean statistics¶
We calculate four statistics for grid cells that contain segments.
Grid Cell Mean Segment Length ¶
where is /gtx/freeboard_beam_segment/height_segments/height_segment_length_seg
, and are projected coordinates for grid centers, and is day.
Grid Cell Mean Freeboard ¶
where is gtx/freeboard_beam_segment/beam_freeboard/beam_fb_height
.
Grid Cell Standard Deviation of Freeboard ¶
The functions to calculate these statistics are given below. These functions are applied to the grouped data. The geopandas.apply
method only accepts a single method when operating on multiple columns in a dataframe. We could just have multiple calls for each aggregating function. However, we can collect the individual aggregating functions into a single function and pass that to the apply
method.
def mean_segment_length(df):
"""Returns mean segment length"""
return df["height_segment_length_seg"].mean()
def mean_freeboard(df):
"""Returns length weighted mean freeboard"""
return (df.beam_fb_height * df.height_segment_length_seg).sum() / df.height_segment_length_seg.sum()
def stdev_freeboard(df):
"""Returns weighted standard deviation of freeboard"""
hmean = mean_freeboard(df)
stdev = (df.beam_fb_height**2 * df.height_segment_length_seg).sum() / df.height_segment_length_seg.sum()
return stdev - hmean**2
def count_segments(df):
"""Number of segments in grid cell"""
return df.beam_fb_height.count()
def all_funcs(x):
"""Wrapper that allows all the aggregation functions to be applied at once"""
funcs = {
mean_segment_length.__name__: mean_segment_length(x), #__name__ gets the name of a function
mean_freeboard.__name__: mean_freeboard(x),
stdev_freeboard.__name__: stdev_freeboard(x),
count_segments.__name__: count_segments(x),
}
# `apply` is expected to return a series or a scaler so we collect the results
# into a series indexed by aggregating function name
return pd.Series(funcs, index=funcs.keys())
Testing the functions¶
It is always a good idea to test your code. Below are some test data and expected results. The functions are tested on test_df
. We then use pandas.testing.assert_frame_equal
to check that the result and expected dataframes are the same. In this case we are only interested getting the same values, so we do not check the names or datatypes.
test_df = pd.DataFrame(
{
'grid_index': [1, 1, 1, 2, 2, 2, 2],
"height_segment_length_seg": [1.2, 1.1, 0.7, 2.3, 1.5, .9, 1.],
"beam_fb_height": [0., 0.2, 0.5, 1.1, 2., .9, 1.5],
}
)
expected = pd.DataFrame(
{
"mean_segment_length": [1.0, 1.425],
"mean_freeboard": [0.19000000000000003, 1.375438596491228],
"stdev_freeboard": [0.03689999999999998, 0.17167743921206569],
"count_segments": [3, 4],
},
index = [1, 2]
)
result = test_df.groupby("grid_index").apply(all_funcs)
result
pd.testing.assert_frame_equal(expected, result, check_names=False, check_dtype=False)
Now that we have functions that work we can apply them to the real data.
%%time
aggregated_data = tracks.groupby("grid_index").apply(all_funcs)
aggregated_data
Assign aggregated data to grid cells¶
We now have a dataframe that contains grid cell statistics indexed by a unique array index. We can now create a grid for each of these statistics.
The procedure is relatively straight forward.
- Create an 1D array with the same number of elements as cells in our grid.
- Use the
grid_index
of the dataframe as an array index to assign values to grid cells, where we have data. - Reshape the grid to the dimension of the grid.
We can encapsulate this in a series_to_grid
function.
def series_to_grid(series, nrow, ncol):
"""Converts a geopandas.Series to a grid using the index"""
these_points = (series.index >= 0) & (series.index < (nrow*ncol - 1))
array_index = series[these_points].index.values.astype(int) # the array index must be an integer
vector = np.full(nrow*ncol, np.nan)
vector[array_index] = series[these_points]
return vector.reshape(nrow, ncol)
%%time
grids = {name: series_to_grid(values, nrow, ncol) for name, values in aggregated_data.items()}
%matplotlib widget
plt.imshow(grids['count_segments'], interpolation='none')
plt.show()
Plot data on a map¶
#proj = EASEGrid2South()
plt.close("all")
proj = ccrs.LambertAzimuthalEqualArea(central_latitude=-90)
test_extent = [-3000000.0, 3000000.0, -3000000.0, 3000000.0]
#np.array(map_extent)+np.array([-1e6,-1e6,1e6,1e6])
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection=proj)
ax.set_extent(map_extent, proj)
ax.add_feature(cfeature.LAND)
ax.coastlines()
plt.imshow(grids['count_segments'], interpolation='none', extent=map_extent)
#ax.set_extent(
# Import GeoJSON of Ross Sea - this is very approximate
import geopandas as gpd
ross_sea_gdf = gpd.read_file("ross_sea.json")
bounds = ross_sea_gdf.to_crs(easegrid2_epsg).bounds.values
# Calculate parameters for a grid with resolution that covers region
resolution = 10000.
minx, miny, maxx, maxy = [func(bound/resolution) * resolution for bound, func in zip(list(bounds), [np.floor, np.floor, np.ceil, np.ceil])][0]
grid_extent_x = maxx - minx
grid_extent_y = maxy - miny
width = height = resolution
ncol = grid_extent_x / width
nrow = grid_extent_y / height
upper_left_x = minx
upper_left_y = maxy
print(f"nrow = {int(nrow)}")
print(f"ncol = {int(ncol)}")
print(f"width = {width}")
print(f"height = -{height}")
print(f"upper_left_x = {upper_left_x}")
print(f"upper_left_y = {upper_left_y}")
# from cartopy.crs import AzimuthalEquidistant
# class EASEGrid2South(AzimuthalEquidistant):
# def __init__(self):
# super(EASEGrid2South, self).__init__(central_longitude=0.0, central_latitude=-90.0,
# false_easting=0.0, false_northing=0.0,
# globe=None)
# self._bounds = [-9000000.0, -9000000.0, 9000000.0, 9000000.0]
# self._x_limits = self._bounds[0], self._bounds[2]
# self._y_limits = self._bounds[1], self._bounds[3]
# @property
# def bounds(self):
# return self._bounds
# @property
# def threshold(self):
# return 1e5
# @property
# def x_limits(self):
# return self._x_limits
# @property
# def y_limits(self):
# return self._y_limits