Part 1: Introduction to the earthaccess python library#

Tutorial Overview#

This tutorial is designed for the “Cloud Computing and Open-Source Scientific Software for Cryosphere Communities” Learning Workshop at the 2023 AGU Fall Meeting.

This notebook demonstrates how to search for, access, and work with a cloud-hosted NASA dataset using the earthaccess package. Data in the “NASA Earthdata Cloud” are stored in Amazon Web Services (AWS) Simple Storage Service (S3) Buckets. Direct Access is an efficient way to work with data stored in an S3 Bucket using an Amazon Compute Cloud (EC2) instance. Cloud-hosted granules can be opened and loaded into memory without the need to download them first. This allows you take advantage of the scalability and power of cloud computing.

We use earthaccess, a package developed by Luis Lopez (NSIDC developer) to allow easy search of the NASA Common Metadata Repository (CMR) and download of NASA data collections. It can be used for programmatic search and access for both DAAC-hosted and cloud-hosted data. It manages authenticating using Earthdata Login credentials which are then used to obtain the S3 tokens that are needed for S3 direct access. earthaccess can be used to find and access both DAAC-hosted and cloud-hosted data in just three lines of code. See https://github.com/nsidc/earthaccess.

As an example data collection, we use ICESat-2 Land Ice Height (ATL06) granules over the Juneau Icefield, AK, for March and April 2020. ICESat-2 data granules, including ATL06, are stored in HDF5 format. We demonstrate how to open an HDF5 granule and access data variables using xarray. Land Ice Heights are then plotted using hvplot.

ExamplePlotusingTutorialData

ATL06 Land Ice Heights for the margin of the Juneau Ice Field

Learning Objectives#

In this tutorial you will learn:

  1. how to use earthaccess to search for (ICESat-2) data using spatial and temporal filters and explore the search results;

  2. how to open data granules using direct access to the appropriate S3 bucket;

  3. how to load an HDF5 group into an xarray.Dataset;

  4. how visualize the land ice heights using hvplot.

Prerequisites#

The workflow described in this tutorial forms the initial steps of an Analysis in Place workflow that would be run on a AWS cloud compute resource. You will need:

  1. a JupyterHub, such as CryoHub, or AWS EC2 instance in the us-west-2 region.

  2. a NASA Earthdata Login. If you need to register for an Earthdata Login see the Getting an Earthdata Login section of the ICESat-2 Hackweek 2023 Jupyter Book.

  3. A .netrc file, that contains your Earthdata Login credentials, in your home directory. See Configure Programmatic Access to NASA Servers to create a .netrc file.

Credits#

This notebook is based on an NSIDC Data Tutorial originally created by Luis Lopez, NSIDC, and Mikala Beig, NSIDC, modified by Andy Barrett, NSIDC, Jennifer Roebuck, NSIDC, Amy Steiker, NSIDC, and Jessica Scheick, Univ. of New Hampshire.

Computing Environment#

The tutorial uses python and requires the following packages:

  • earthaccess, which enables Earthdata Login authentication and retrieves AWS credentials; enables collection and granule searches; and S3 access;

  • xarray, used to load N-dimensional data with labeled axes;

  • hvplot, used to visualize land ice height data.

We are going to import the whole earthaccess package.

We will also import the whole xarray package but use a standard short name xr, using the import <package> as <short_name> syntax. We could use anything for a short name but xr is an accepted standard that most xarray users are familiar with.

xarray is a powerful library for working with multi-dimensional data using labeled indices (analogous to Pandas for tabular data). It is leverages numpy, pandas, matplotlib and dask to build Dataset and DataArray objects with built-in methods to subset, analyze, interpolate, and plot multi-dimensional data. It makes working with multi-dimensional data cubes efficient and fun. A few great tutorials for learning Xarray are here and here.

We only need the xarray module from hvplot so we import that using the import <package>.<module> syntax.

# For searching and accessing NASA data
import earthaccess

# For reading data, analysis and plotting
import xarray as xr
import hvplot.xarray

import pprint  # For nice printing of python objects

Authenticate#

The first step is to get the correct authentication to access cloud-hosted ICESat-2 data. This is all done through Earthdata Login. The login method also gets the correct AWS credentials.

Login requires your Earthdata Login username and password. The login method will automatically search for these credentials as environment variables or in a .netrc file, and if those aren’t available it will prompt you to enter your username and password. We use the prompt strategy here. A .netrc file is a text file located in our home directory that contains login information for remote machines. If you don’t have a .netrc file, login will create one for you if you use persist=True.

earthaccess.login(strategy='interactive', persist=True)
auth = earthaccess.login()
Enter your Earthdata Login username:  amy.steiker
Enter your Earthdata password:  ········

Search for ICESat-2 Collections#

earthaccess leverages the Common Metadata Repository (CMR) API to search for collections and granules. Earthdata Search also uses the CMR API.

We can use the search_datasets method to search for ICESat-2 collections by setting keyword="ICESat-2". The argument passed to keyword can be any string and can include wildcard characters ? or *.

Note

To see a full list of search parameters you can type earthaccess.search_datasets?. Using ? after a python object displays the docstring for that object.

A count of the number of data collections (Datasets) found is given.

query = earthaccess.search_datasets(
            keyword="ICESat-2",
)
Datasets found: 89

In this case, there are 89 datasets that have the keyword ICESat-2.

search_datasets returns a python list of DataCollection objects. We can view metadata for each collection in long form by passing a DataCollection object to print or as a summary using the summary method for the DataCollection object. Here, I use the pprint function to Pretty Print each object.

for collection in query[:10]:
    pprint.pprint(collection.summary(), sort_dicts=True, indent=4)
    print('')  # Add a space between collections for readability
{   'concept-id': 'C2559919423-NSIDC_ECS',
    'file-type': "[{'FormatType': 'Native', 'Format': 'HDF5', "
                 "'FormatDescription': 'HTTPS'}]",
    'get-data': [   'https://n5eil01u.ecs.nsidc.org/ATLAS/ATL03.006/',
                    'https://search.earthdata.nasa.gov/search?q=ATL03+V006',
                    'http://openaltimetry.org/',
                    'https://nsidc.org/data/data-access-tool/ATL03/versions/6/'],
    'short-name': 'ATL03',
    'version': '006'}

{   'cloud-info': {   'Region': 'us-west-2',
                      'S3BucketAndObjectPrefixNames': [   'nsidc-cumulus-prod-protected/ATLAS/ATL03/006',
                                                          'nsidc-cumulus-prod-public/ATLAS/ATL03/006'],
                      'S3CredentialsAPIDocumentationURL': 'https://data.nsidc.earthdatacloud.nasa.gov/s3credentialsREADME',
                      'S3CredentialsAPIEndpoint': 'https://data.nsidc.earthdatacloud.nasa.gov/s3credentials'},
    'concept-id': 'C2596864127-NSIDC_CPRD',
    'file-type': "[{'FormatType': 'Native', 'Format': 'HDF5', "
                 "'FormatDescription': 'HTTPS'}]",
    'get-data': ['https://search.earthdata.nasa.gov/search?q=ATL03+V006'],
    'short-name': 'ATL03',
    'version': '006'}

{   'concept-id': 'C2120512202-NSIDC_ECS',
    'file-type': "[{'FormatType': 'Native', 'Format': 'HDF5', "
                 "'FormatDescription': 'HTTPS'}]",
    'get-data': [   'https://n5eil01u.ecs.nsidc.org/ATLAS/ATL03.005/',
                    'https://search.earthdata.nasa.gov/search?q=ATL03+V005',
                    'http://openaltimetry.org/',
                    'https://nsidc.org/data/data-access-tool/ATL03/versions/5/'],
    'short-name': 'ATL03',
    'version': '005'}

{   'cloud-info': {   'Region': 'us-west-2',
                      'S3BucketAndObjectPrefixNames': [   'nsidc-cumulus-prod-protected/ATLAS/ATL03/005',
                                                          'nsidc-cumulus-prod-public/ATLAS/ATL03/005'],
                      'S3CredentialsAPIDocumentationURL': 'https://data.nsidc.earthdatacloud.nasa.gov/s3credentialsREADME',
                      'S3CredentialsAPIEndpoint': 'https://data.nsidc.earthdatacloud.nasa.gov/s3credentials'},
    'concept-id': 'C2153572325-NSIDC_CPRD',
    'file-type': "[{'FormatType': 'Native', 'Format': 'HDF5', "
                 "'FormatDescription': 'HTTPS'}]",
    'get-data': ['https://search.earthdata.nasa.gov/search?q=ATL03+V005'],
    'short-name': 'ATL03',
    'version': '005'}

{   'concept-id': 'C2564427300-NSIDC_ECS',
    'file-type': "[{'FormatType': 'Native', 'Format': 'HDF5', "
                 "'FormatDescription': 'HTTPS'}]",
    'get-data': [   'https://n5eil01u.ecs.nsidc.org/ATLAS/ATL06.006/',
                    'https://search.earthdata.nasa.gov/search?q=ATL06+V006',
                    'https://openaltimetry.org/',
                    'https://nsidc.org/data/data-access-tool/ATL06/versions/6/'],
    'short-name': 'ATL06',
    'version': '006'}

{   'cloud-info': {   'Region': 'us-west-2',
                      'S3BucketAndObjectPrefixNames': [   'nsidc-cumulus-prod-protected/ATLAS/ATL06/006',
                                                          'nsidc-cumulus-prod-public/ATLAS/ATL06/006'],
                      'S3CredentialsAPIDocumentationURL': 'https://data.nsidc.earthdatacloud.nasa.gov/s3credentialsREADME',
                      'S3CredentialsAPIEndpoint': 'https://data.nsidc.earthdatacloud.nasa.gov/s3credentials'},
    'concept-id': 'C2670138092-NSIDC_CPRD',
    'file-type': "[{'FormatType': 'Native', 'Format': 'HDF5', "
                 "'FormatDescription': 'HTTPS'}]",
    'get-data': ['https://search.earthdata.nasa.gov/search?q=ATL06+V006'],
    'short-name': 'ATL06',
    'version': '006'}

{   'concept-id': 'C2144439155-NSIDC_ECS',
    'file-type': "[{'FormatType': 'Native', 'Format': 'HDF5', "
                 "'FormatDescription': 'HTTPS'}]",
    'get-data': [   'https://n5eil01u.ecs.nsidc.org/ATLAS/ATL06.005/',
                    'https://search.earthdata.nasa.gov/search?q=ATL06+V005',
                    'https://openaltimetry.org/',
                    'https://nsidc.org/data/data-access-tool/ATL06/versions/5/'],
    'short-name': 'ATL06',
    'version': '005'}

{   'cloud-info': {   'Region': 'us-west-2',
                      'S3BucketAndObjectPrefixNames': [   'nsidc-cumulus-prod-protected/ATLAS/ATL06/005',
                                                          'nsidc-cumulus-prod-public/ATLAS/ATL06/005'],
                      'S3CredentialsAPIDocumentationURL': 'https://data.nsidc.earthdatacloud.nasa.gov/s3credentialsREADME',
                      'S3CredentialsAPIEndpoint': 'https://data.nsidc.earthdatacloud.nasa.gov/s3credentials'},
    'concept-id': 'C2153572614-NSIDC_CPRD',
    'file-type': "[{'FormatType': 'Native', 'Format': 'HDF5', "
                 "'FormatDescription': 'HTTPS'}]",
    'get-data': ['https://search.earthdata.nasa.gov/search?q=ATL06+V005'],
    'short-name': 'ATL06',
    'version': '005'}

{   'concept-id': 'C2565090645-NSIDC_ECS',
    'file-type': "[{'FormatType': 'Native', 'Format': 'HDF5', "
                 "'FormatDescription': 'HTTPS'}]",
    'get-data': [   'https://n5eil01u.ecs.nsidc.org/ATLAS/ATL08.006/',
                    'https://search.earthdata.nasa.gov/search?q=ATL08+V006',
                    'https://openaltimetry.org/',
                    'https://nsidc.org/data/data-access-tool/ATL08/versions/6/'],
    'short-name': 'ATL08',
    'version': '006'}

{   'cloud-info': {   'Region': 'us-west-2',
                      'S3BucketAndObjectPrefixNames': [   'nsidc-cumulus-prod-protected/ATLAS/ATL08/006',
                                                          'nsidc-cumulus-prod-public/ATLAS/ATL08/006'],
                      'S3CredentialsAPIDocumentationURL': 'https://data.nsidc.earthdatacloud.nasa.gov/s3credentialsREADME',
                      'S3CredentialsAPIEndpoint': 'https://data.nsidc.earthdatacloud.nasa.gov/s3credentials'},
    'concept-id': 'C2613553260-NSIDC_CPRD',
    'file-type': "[{'FormatType': 'Native', 'Format': 'HDF5', "
                 "'FormatDescription': 'HTTPS'}]",
    'get-data': ['https://search.earthdata.nasa.gov/search?q=ATL08+V006'],
    'short-name': 'ATL08',
    'version': '006'}

For each collection, summary returns a subset of fields from the collection metadata and Unified Metadata Model (UMM) entry.

  • concept-id is an unique identifier for the collection that is composed of a alphanumeric code and the provider-id for the DAAC.

  • short-name is the name of the dataset that appears on the dataset set landing page. For ICESat-2, ShortNames are generally how different products are referred to.

  • version is the version of each collection.

  • file-type gives information about the file format of the collection files.

  • get-data is a collection of URLs that can be used to access data, dataset landing pages, and tools.

For cloud-hosted data, there is additional information about the location of the S3 bucket that holds the data and where to get credentials to access the S3 buckets. In general, you don’t need to worry about this information because earthaccess handles S3 credentials for you. Nevertheless it may be useful for troubleshooting.

Note

In Python, all data are represented by objects. These objects contain both data and methods (think functions) that operate on the data. earthaccess includes DataCollection and DataGranule objects that contain data about collections and granules returned by search_datasets and search_data respectively. If you are familiar with Python, you will see that the data in each DataCollection object is organized as a hierarchy of python dictionaries, lists and strings. So if you know the dictionary key for the metadata entry you want you can get that metadata using standard dictionary methods. For example, to get the dataset short name from the example below, you could just use collection['meta']['concept-id']. However, in this example the `concept-id’ method for the DataCollection object returns the same information. Take a look at https://github.com/nsidc/earthaccess/blob/main/earthaccess/results.py#L80 to see how this is done.

For the ICESat-2 search results the concept-id is NSIDC_ECS or NSIDC_CPRD. NSIDC_ECS is for collections archived at the NSIDC DAAC and NSIDC_CPRD is for the cloud-hosted collections.

For ICESat-2 short-name refers to the following products.

ShortName

Product Description

ATL03

ATLAS/ICESat-2 L2A Global Geolocated Photon Data

ATL06

ATLAS/ICESat-2 L3A Land Ice Height

ATL07

ATLAS/ICESat-2 L3A Sea Ice Height

ATL08

ATLAS/ICESat-2 L3A Land and Vegetation Height

ATL09

ATLAS/ICESat-2 L3A Calibrated Backscatter Profiles and Atmospheric Layer Characteristics

ATL10

ATLAS/ICESat-2 L3A Sea Ice Freeboard

ATL11

ATLAS/ICESat-2 L3B Slope-Corrected Land Ice Height Time Series

ATL12

ATLAS/ICESat-2 L3A Ocean Surface Height

ATL13

ATLAS/ICESat-2 L3A Along Track Inland Surface Water Data

Search for cloud-hosted data#

If you only want to search for data in the cloud, you can set cloud_hosted=True.

Query = earthaccess.search_datasets(
    keyword = 'ICESat-2',
    cloud_hosted = True,
)
Datasets found: 40

Search a data set using spatial and temporal filters#

Once, you have identified the dataset you want to work with, you can use the search_data method to search a data set with spatial and temporal filters. As an example, we’ll search for ATL06 granules over the Juneau Icefield, AK, for March and April 2020.

Either concept-id or short-name can be used to search for granules from a particular dataset. If you use short-name you also need to set version. If you use concept-id, this is all that is required because concept-id is unique.

The temporal range is identified with standard date strings. Latitude-longitude corners of a bounding box are specified as lower left, upper right. Polygons and points, as well as shapefiles can also be specified.

This will display the number of granules that match our search.

results = earthaccess.search_data(
    short_name = 'ATL06',
    version = '006',
    cloud_hosted = True,
    bounding_box = (-134.7,58.9,-133.9,59.2),
    temporal = ('2020-03-01','2020-04-30'),
    count = 100
)
Granules found: 4

We’ll get metadata for these 4 granules and display it. The rendered metadata shows a download link, granule size and two images of the data.

The download link is https and can be used download the granule to your local machine. This is similar to downloading DAAC-hosted data but in this case the data are coming from the Earthdata Cloud. For NASA data in the Earthdata Cloud, there is no charge to the user for egress from AWS Cloud servers. This is not the case for other data in the cloud.

[display(r) for r in results]

Data: ATL06_20200310121504_11420606_006_01.h5

Size: 3.03 MB

Cloud Hosted: True

Data PreviewData Preview

Data: ATL06_20200312233336_11800602_006_01.h5

Size: 29.99 MB

Cloud Hosted: True

Data PreviewData Preview

Data: ATL06_20200410220936_02350702_006_02.h5

Size: 32.95 MB

Cloud Hosted: True

Data PreviewData Preview

Data: ATL06_20200412104246_02580706_006_02.h5

Size: 20.34 MB

Cloud Hosted: True

Data PreviewData Preview
[None, None, None, None]

Use Direct-Access to open, load and display data stored on S3#

Direct-access to data from an S3 bucket is a two step process. First, the files are opened using the open method. This first step creates a Python file-like object that is used to load the data in the second step.

Authentication is required for this step. The auth object created at the start of the notebook is used to provide Earthdata Login authentication and AWS credentials “behind-the-scenes”. These credentials expire after one hour so the auth object must be executed within that time window prior to these next steps.

Note

The open step to create a file-like object is required because AWS S3, and other cloud storage systems, use object storage but most HDF5 libraries work with POSIX-compliant file systems. POSIX stands for Portable Operating System Interface for Unix and is a set of guidelines that include how to interact with files and file systems. Linux, Unix, MacOS (which is Unix-like), and Windows are POSIX-compliant. Critically, POSIX-compliant systems allows blocks of bytes or individual bytes to be read from a file. With object storage the whole file has to be read. To get around this limitation, an intermediary is used, in this case s3fs. This intermediary creates a local POSIX-compliant virtual file system. S3 objects are loaded into this virtual file system so they can be accessed using POSIX-style file functions.

In this example, data are loaded into an xarray.Dataset. Data could be read into numpy arrays or a pandas.Dataframe. However, each granule would have to be read using a package that reads HDF5 granules such as h5py. xarray does this all under-the-hood in a single line but only for a single group in the HDF5 granule, in this case land ice heights for the gt1l beam*.

*ICESat-2 measures photon returns from 3 beam pairs numbered 1, 2 and 3 that each consist of a left and a right beam

%time
files = earthaccess.open(results)
ds = xr.open_dataset(files[1], group='/gt1l/land_ice_segments')
CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 6.2 µs
Opening 4 granules, approx size: 0.08 GB
using provider: NSIDC_CPRD
ds
<xarray.Dataset>
Dimensions:                (delta_time: 24471)
Coordinates:
  * delta_time             (delta_time) datetime64[ns] 2020-03-12T23:40:48.24...
    latitude               (delta_time) float64 ...
    longitude              (delta_time) float64 ...
Data variables:
    atl06_quality_summary  (delta_time) int8 ...
    h_li                   (delta_time) float32 ...
    h_li_sigma             (delta_time) float32 ...
    segment_id             (delta_time) float64 ...
    sigma_geo_h            (delta_time) float32 ...
Attributes:
    Description:  The land_ice_height group contains the primary set of deriv...
    data_rate:    Data within this group are sparse.  Data values are provide...

hvplot is an interactive plotting tool that is useful for exploring data.

ds['h_li'].hvplot(kind='scatter', s=2)

Additional resources#

For general information about NSIDC DAAC data in the Earthdata Cloud:

FAQs About NSIDC DAAC’s Earthdata Cloud Migration

NASA Earthdata Cloud Data Access Guide

Additional tutorials and How Tos:

NASA Earthdata Cloud Cookbook