# ICESat-2 ATL10-h5coro large-scale time series

## Contents

# ICESat-2 ATL10-h5coro large-scale time series#

imported on: **2023-10-03**

This notebook is from NSIDC's Processing Large-scale Time Series of ICESat-2 Sea Ice Height in the Cloud tutorial

The original source for this document is https://github.com/nsidc/NSIDC-Data-Tutorials

**Processing Large-scale Time Series of ICESat-2 Sea Ice Height in the Cloud**#

**1. Tutorial Overview**#

This tutorial is designed for the “DAAC data access in the cloud hands-on experience” session at the 2023 NSIDC DAAC User Working Group (UWG) Meeting.

The NSIDC DAAC archives and distributes Daily and Monthly Gridded Sea Ice Freeboard (ATL20) and Polar Sea Surface Height Anomaly (ATL21) data sets from the ICESat-2 Mission, derived from the lower level ATL10 data set. However, we may want these lower level point data to be gridded and averaged at a weekly cadence, or using a different projection or other gridding parameters.

This tutorial session is in two parts:

We will first guide you through this Jupyter Notebook running in the AWS

`us-west-2`

region, where data are hosted in the NASA Earthdata Cloud. The notebook utilizes several libraries to performantly search, access, read, and grid the data including`earthaccess`

,`h5coro`

, and`geopandas`

.This notebook will focus on the Ross Sea, Antarctica. But let’s say we want to scale this analysis to the entire continent. In the second portion, we will present how to scale and run this same workflow from a script (see workflow.py in the

`h5cloud`

folder within this notebook’s directory) that can be run from your laptop, using Coiled.

**Credits**#

The notebook was created by Andy Barrett and Luis Lopez of NSIDC.

For questions regarding the notebook, or to report problems, please create a new issue in the NSIDC-Data-Tutorials repo.

**Learning Objectives**#

By the end of this demonstration you will be able to:

Use

`earthaccess`

to authenticate with Earthdata Login, search for ICESat-2 data using spatial and temporal filters, and directly access files in the cloud.Open data granules using

`h5coro`

to efficiently read HDF5 data from the NSIDC DAAC S3 bucket.Load data into a geopandas.DataFrame containing geodetic coordinates, ancillary variables, and date/time converted from ATLAS Epoch.

Grid track data to EASE-Grid v2 6.25 km projected grid using drop-in-the-bucket resampling.

Calculate mean statistics and assign aggregated data to grid cells.

Visualize aggregated sea ice height data on a map.

**Prerequisites**#

We are running this notebook in the CryoCloud JupyterHub. For more information, see the CryoCloud Getting Started documentation.

**It is advised that you use at least a 16GB instance for this notebook.**An Earthdata Login is required for data access. If you don’t have one, you can register for one here.

It is recommended that you create a .netrc file that contains your Earthdata Login credentials, stored in your home directory. If you do not have a .netrc file,

`earthaccess`

will prompt you to enter your Earthdata Login username and password.

**Example of end product**#

At the end of this tutorial, the following figure will be generated, demonstrating a year’s worth of ATL10 Sea Ice Freeboard height data gridded over the Ross Sea, Antarctica:

**Time requirement**#

Allow approximately 40 minutes to complete this tutorial.

**2. Tutorial steps**#

### Installing the latest version of earthaccess#

The CryoCloud environment currently does not have the latest `earthaccess`

version installed, along with new features in `h5coro`

that are not yet released, so we will first manually install those below:

```
%%capture
# suppress install outputs
!pip uninstall -y earthaccess h5coro
!pip install earthaccess==0.6.1
# h5coro has new features that we need that are not released
!pip install git+https://github.com/ICESat2-SlideRule/h5coro.git@main
```

**NOTE**: Restart the kernel and clean output after running the cell above.

**Import Packages**#

```
# To force use of shapely
import os
os.environ['USE_PYGEOS'] = '0'
# For searching NASA data
import earthaccess
# For reading data, analysis and plotting
import numpy as np
import pandas as pd
import datetime as dt
# For resampling
from affine import Affine
# For plotting
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from h5cloud.read_atl10 import read_atl10, get_data_links, get_credentials
print(f"earthaccess: {earthaccess.__version__}")
```

### Authenticate#

We need to authenticate and get AWS token

```
auth = earthaccess.login()
```

**Search for ICESat-2 ATL10 data**#

We use `earthaccess`

to search CMR for granules in the region of interest for the time period of interest.

The region is set by name below. Currently, we have two options: the Ross Sea, and the Southern Ocean and adjoining seas.

The range of dates is set by assigning a start year and end year to `year_begin`

and `year_end`

. Setting `year_begin`

and `year_end`

to the same year retreives data for one year.

```
# To avoid copying and pasting region tuples
region = "Ross Sea" #"Ross Sea" # Set region to "Ross Sea" for just Ross Sea or "Antarctica" for southern ocean
ross_sea = (-180, -78, -160, -74)
antarctic = (-180, -90, 180, -60)
this_region = antarctic if region == "Antarctica" else ross_sea
year_begin = 2019
year_end = 2019
month = 9
```

```
atl10 = {}
total_results = 0
approx_size = 0
for year in range(year_begin, year_end+1):
date_beg = dt.datetime(year, month, 1).strftime("%Y-%m-%d")
date_end = dt.datetime(year, month, 30).strftime("%Y-%m-%d")
print(f"Searching period {date_beg} to {date_end} ...")
granules = earthaccess.search_data(
short_name = 'ATL10',
version = '006',
cloud_hosted = True,
bounding_box = this_region,
temporal = (date_beg, date_end) #(f'{year}-09-01',f'{year}-09-30'),
)
total_results += len(granules)
approx_size += sum([g.size() for g in granules])
atl10[str(year)] = granules
print(f"Total retrieved: {total_results}, approx size: {round(approx_size, 2)} MB")
```

### Access the Granules#

Because the CryoCloud is hub is running on servers in AWS region `us-west-2`

, which is the same region as the NASA Earthdata Cloud, granules can be accessed directly without having to download the files first. This is analogous to how you would work with files on your local filesystem. However, *under the hood* there are differences.

Initially, we load data for each year into a `geopandas.DataFrame`

. `geopandas`

is an extension of the `pandas`

package. `pandas`

is designed to work with `tabular`

data - *think data you might put into a spreadsheet*. `geopandas`

, extends `pandas`

to work with geospatial data by adding geometries (points, lines and polygons) and a coordinate reference system (CRS), so that data in each row is associated with a geospatial feature located on Earth. ICESat-2 track data is well suited to the DataFrame data model because data are related to points or segments. Once data is in a `geopandas.DataFrame`

, the data can be reprojected and queried using methods you may be used to using in a GIS.

#### Read data into `geopandas.DataFrame`

#

The first step is to read the data and put it into a Dataframe. We use `h5coro`

, which is a package developed by the SlideRule project to efficiently read HDF5 files in the cloud. Recall from the Cloud Optimized Format presentation, the HDF5 format and the HDF5 library for reading and writing those files are not well suited to accessing data in the cloud. `h5coro`

was developed to solve some of the problems related to HDF5 format and tools. Using `h5coro`

with `dask`

, a python package for parallel processing on multicore local machines and distributed cluster in the cloud, reading data from ATL10 files is 5x faster than using the `h5py`

package, an HDF5 reader that uses the HDF5 library.

The code to read the data is long, so we have created the `read_atl10`

function and put it in a module. The function is imported into this notebook. If your are interested, take a look at `read_atl10`

in `read_atl10.py`

. The main features of the function are briefly described here.

We follow the processing steps for ATL20 to generate our freeboard grids. For each grid cell that contain one or more freeboard segments, a grid cell mean freeboard is calculated as a mean of `gtx/freeboard_segment/beam_fb_height`

from ATL10, weighted by segment length `gtx/freeboard_segment/heights/height_segment_length_seg`

. To resample segments to grid cells, we also need the geodetic coordinates for each segment in `gtx/freeboard_segment/latitude`

and `gtx/freeboard_segment/longitude`

. As an additional locator, we also read `gtx/freeboard_segment/delta_time`

. `gtx`

is the beam number.

In addition to the segment data, we also need some ancillary data from each file. In ATL20 gridded freeboards are calculated using only the *strong beams* of each beam pair. Which of the six beams are strong and which are weak depends on the orientation of the ICESat-2 satellite. Satellite orientation is given in the `orbit_info/sc_orient`

dataset. We also need to read the Atlas Standard Data Product Epoch that is stored in `ancillary_data/atlas_sdp_gps_epoch`

to convert `delta_time`

from seconds since launch to date and time.

Note

There are three beam pairs numbered 1, 2 and 3. Each of these beam pairs has a left and right beam. Beams are numbered `gt1l`

and `gt1r`

, `gt2l`

and `gt2r`

, and `gt3l`

and `gt3r`

. Depending on the orientation of the ICESat-2 satellite, left beams or right beams are the *strong beams*. The orientation can be *forward* or *backward*, or *transition*. We only use data in forward or backward orientations.

The datasets containing segment data are stored in the `DATASETS`

constant, which is a python `list`

, in `reader.py`

. If you want additional or different datasets, you can modify this list. See NSIDC DAAC’s ATL10 User Guide and ATL10 Data Dictionary for detailed descriptions.

A ATL10 file is read using the function `read_atl10`

. This function encapsulates opening an HDF5 file and reading the datasets using `h5coro`

, and then creating a `geopandas.DataFrame`

containing the data. We parallelize the reading of all files in a year using `pqdm`

, so files are read using different processors. File for a given year are then concatenated into a single dataframe.

```
%%time
files = get_data_links(atl10["2019"], environment="cloud")
cred = get_credentials(environment="cloud")
tracks = read_atl10(files, executors=4, environment="cloud", credentials=cred)
```

```
tracks
```

## Grid the track data#

The resampling and calculation of statistics follows the processing steps described in the ATL20 - Gridded Sea Ice Freeboard - ATBD but gridding to a EASE-Grid v2 6.25 km grid. Any projected coordinate system or grid could be chosen. The procedure could be modified with extra QC steps or modifications. **The world is your oyster - or Aplacophoran**.

The processing steps are:

remove non-ice and low quality segments

resample freeboard segments to a grid

calculate aggregate statistics

mean segment length

segment count

length weighted mean freeboard

length weighted standard deviation of freeboard

### Resample Freeboard Segments to a Grid#

Following the ATL20 ATBD, we will use a *drop-in-the-bucket* resampling scheme. This is simple and relatively easy to implement. More complex resampling schemes could be substituted.

To demonstrate resampling we will resample freeboard segments to WGS84 / NSIDC EASE-Grid v2.0 South with a grid resolution of 6.25 km. The EPSG code for the WGS84 / NSIDC EASE-Grid South coordinate reference system is 6932.

We will use the standard 6.25 km grid. To define the grid, we need the grid dimensions (nrows and ncols), the x and y projected coordinates of the upper-left corner of the upper-left grid cell, and the height and width of the grid cells in the same units as the projected coordinates. In this case, the units are meters.

```
easegrid2_epsg = 6932
# Parameters for standard NSIDC EASE Grid v2.0 South 6.25 km grid
# nrow = 2880
# ncol = 2880
# upper_left_x = -9000000.0
# upper_left_y = 9000000.0
# width = 6250.0
# height = -6250.0
# Parameters for a 10 km grid over Ross Sea Region
nrow = 151
ncol = 147
width = 10000.0
height = -10000.0
upper_left_x = -1040000.0
upper_left_y = -560000.0
map_extent = [upper_left_x, (upper_left_x + (ncol*width)), (upper_left_y + (nrow*height)), upper_left_y]
```

The first step is to reproject the points from geodetic coordinates (latitude and longitude) to projected coordinates (x, y). Because the data are in a `geopandas.DataFrame`

we can use the `to_crs`

method. This takes an EPSG code either as a numeric value (`6932`

) or as a string (`"EPSG:6932"`

).

You can see that the `POINT`

objects in the `geometry`

have changed from having latitudes and longitudes as coordinates to x and y in meters.

```
%%time
tracks = tracks.to_crs(easegrid2_epsg)
tracks.head()
```

A *Drop-in-the-Bucket* resampling scheme collects points into the grid cells that they intersect with, and then calculates aggregate statistics for each grid cell using attributes associated with those points.

We’ll find the grid cell that contains each segment by calculating the row and column coordinates for each segment from the projected coordinates. This is done by creating an *Affine* transformation matrix for the grid. The Affine matrix is just a matrix representation of the algebraic expressions to convert row and column indices of the grid to projected coordinates. The equations below give the forward transformation from `(row, col)`

to `(x, y)`

.

These are expressed in matrix form:

where \(a\) is \(\mathsf{width}\), \(c\) is \(\mathsf{upper\_left\_x}\), \(d\) is \(height\), and \(e\) is \(upper\_left\_y\).

Note

The projected coordinate system we are using is a cartesian plane with the origin at the South Pole. The `x`

coordinates increase to the right, and `y`

coordinates increase up. For raster data, which includes grids and images, have the origin at the upper-left corner of the grid. Column indices increase from right to left, and row indices increase from top to bottom.

We use the `affine`

package to create a forward transformation matrix (`fwd`

) using the grid parameters above. To transform `(x, y)`

projected coordinates to `(row, col)`

, we can calculate the reverse transformation matrix using `~fwd`

.

```
fwd = Affine(width, 0., upper_left_x, 0., height, upper_left_y)
```

`(row, col)`

coordinates are still rational numbers. We want an integer row and column indices for grid cells. We can use the `floor`

function to get integers. `row`

and `column`

indices are zero based.

We want to be able to leverage the `geopandas.Dataframe.groupby`

functionality to collect points into grid cells, so we need a unique identifier to group the data. We can calculate a unique cell index from `row`

and `column`

indices as follows:

This is encapsulated in the function `get_grid_index`

. This function is then applied to the `geometry`

of tracks.

```
def ingrid(j, i, ncol, nrow):
"""Returns True if raster coordinates fall within grid"""
return (j >= 0) & (j < ncol) & (i >= 0) & (i < nrow)
def get_grid_index(xy, shape, transform):
"""Returns array index for a map coordinate pair
xy : tuple with x and y map coordinates
shape : list-like containing raster shape (nrow, ncol), where
nrow is number of rows in grid and ncol is number of
columns in grid
transform : Affine transformation matrix to transform
raster coordinates to map coordinates
"""
nrow, ncol = shape
j, i = transform * xy
j, i = np.floor(j).astype(int), np.floor(i).astype(int)
if ingrid(j, i, ncol, nrow):
return (i * ncol) + j
return -1
```

```
%%time
tracks["grid_index"] = [get_grid_index((x, y), (nrow, ncol), ~fwd) for x, y in zip(tracks.geometry.x, tracks.geometry.y)]
tracks.head()
```

`get_grid_index`

returns `-1`

if a point is outside the grid extent, so we need to filter out points that have a `grid_index`

of `-1`

.

```
tracks = tracks[tracks.grid_index > -1]
```

### Calculate grid cell mean statistics#

We calculate four statistics for grid cells that contain segments.

#### Grid Cell Mean Segment Length \(\bar{L}\)#

where \(L_i\) is `/gtx/freeboard_beam_segment/height_segments/height_segment_length_seg`

, \(x\) and \(y\) are projected coordinates for grid centers, and \(D\) is day.

#### Grid Cell Mean Freeboard \(\bar{h}\)#

where \(h_i\) is `gtx/freeboard_beam_segment/beam_freeboard/beam_fb_height`

.

#### Grid Cell Standard Deviation of Freeboard \(\sigma^2 (x, y, D)\)#

The functions to calculate these statistics are given below. These functions are applied to the grouped data. The `geopandas.apply`

method only accepts a single method when operating on multiple columns in a dataframe. We could just have multiple calls for each aggregating function. However, we can collect the individual aggregating functions into a single function and pass that to the `apply`

method.

```
def mean_segment_length(df):
"""Returns mean segment length"""
return df["height_segment_length_seg"].mean()
def mean_freeboard(df):
"""Returns length weighted mean freeboard"""
return (df.beam_fb_height * df.height_segment_length_seg).sum() / df.height_segment_length_seg.sum()
def stdev_freeboard(df):
"""Returns weighted standard deviation of freeboard"""
hmean = mean_freeboard(df)
stdev = (df.beam_fb_height**2 * df.height_segment_length_seg).sum() / df.height_segment_length_seg.sum()
return stdev - hmean**2
def count_segments(df):
"""Number of segments in grid cell"""
return df.beam_fb_height.count()
def all_funcs(x):
"""Wrapper that allows all the aggregation functions to be applied at once"""
funcs = {
mean_segment_length.__name__: mean_segment_length(x), #__name__ gets the name of a function
mean_freeboard.__name__: mean_freeboard(x),
stdev_freeboard.__name__: stdev_freeboard(x),
count_segments.__name__: count_segments(x),
}
# `apply` is expected to return a series or a scaler so we collect the results
# into a series indexed by aggregating function name
return pd.Series(funcs, index=funcs.keys())
```

#### Testing the functions#

It is always a good idea to test your code. Below are some test data and expected results. The functions are tested on `test_df`

. We then use `pandas.testing.assert_frame_equal`

to check that the result and expected dataframes are the same. In this case we are only interested getting the same values, so we do not check the names or datatypes.

```
test_df = pd.DataFrame(
{
'grid_index': [1, 1, 1, 2, 2, 2, 2],
"height_segment_length_seg": [1.2, 1.1, 0.7, 2.3, 1.5, .9, 1.],
"beam_fb_height": [0., 0.2, 0.5, 1.1, 2., .9, 1.5],
}
)
expected = pd.DataFrame(
{
"mean_segment_length": [1.0, 1.425],
"mean_freeboard": [0.19000000000000003, 1.375438596491228],
"stdev_freeboard": [0.03689999999999998, 0.17167743921206569],
"count_segments": [3, 4],
},
index = [1, 2]
)
```

```
result = test_df.groupby("grid_index").apply(all_funcs)
result
```

```
pd.testing.assert_frame_equal(expected, result, check_names=False, check_dtype=False)
```

Now that we have functions that work we can apply them to the real data.

```
%%time
aggregated_data = tracks.groupby("grid_index").apply(all_funcs)
aggregated_data
```

### Assign aggregated data to grid cells#

We now have a dataframe that contains grid cell statistics indexed by a unique array index. We can now create a grid for each of these statistics.

The procedure is relatively straight forward.

Create an 1D array with the same number of elements as cells in our grid.

Use the

`grid_index`

of the dataframe as an array index to assign values to grid cells, where we have data.Reshape the grid to the dimension of the grid.

We can encapsulate this in a `series_to_grid`

function.

```
def series_to_grid(series, nrow, ncol):
"""Converts a geopandas.Series to a grid using the index"""
these_points = (series.index >= 0) & (series.index < (nrow*ncol - 1))
array_index = series[these_points].index.values.astype(int) # the array index must be an integer
vector = np.full(nrow*ncol, np.nan)
vector[array_index] = series[these_points]
return vector.reshape(nrow, ncol)
```

```
%%time
grids = {name: series_to_grid(values, nrow, ncol) for name, values in aggregated_data.items()}
```

```
%matplotlib widget
plt.imshow(grids['count_segments'], interpolation='none')
plt.show()
```

## Plot data on a map#

```
#proj = EASEGrid2South()
plt.close("all")
proj = ccrs.LambertAzimuthalEqualArea(central_latitude=-90)
if min(map_extent) < -3000000.0:
plot_extent = [-3000000.0, 3000000.0, -3000000.0, 3000000.0]
else:
plot_extent = map_extent
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection=proj)
ax.set_extent(plot_extent, proj)
ax.add_feature(cfeature.LAND)
ax.coastlines()
plt.imshow(grids['count_segments'], interpolation='none', extent=map_extent)
```

## Appendix#

### Get grid parameters for Ross Sea region#

```
# Import GeoJSON of Ross Sea - this is very approximate
import geopandas as gpd
ross_sea_gdf = gpd.read_file("ross_sea.json")
bounds = ross_sea_gdf.to_crs(easegrid2_epsg).bounds.values
```

```
# Calculate parameters for a grid with resolution that covers region
resolution = 10000.
minx, miny, maxx, maxy = [func(bound/resolution) * resolution for bound, func in zip(list(bounds), [np.floor, np.floor, np.ceil, np.ceil])][0]
grid_extent_x = maxx - minx
grid_extent_y = maxy - miny
width = height = resolution
ncol = grid_extent_x / width
nrow = grid_extent_y / height
upper_left_x = minx
upper_left_y = maxy
print(f"nrow = {int(nrow)}")
print(f"ncol = {int(ncol)}")
print(f"width = {width}")
print(f"height = -{height}")
print(f"upper_left_x = {upper_left_x}")
print(f"upper_left_y = {upper_left_y}")
```