NetCDF/HDF5 Direct S3#

imported on: 2022-11-28

This notebook is from NASA's OpenScapes NetCDF/HDF5 Direct S3 access notebook

The original source for this document is https://github.com/NASA-Openscapes/earthdata-cloud-cookbook

Accessing a NetCDF4/HDF5 File - S3 Direct Access#

Summary#

In this notebook, we will access monthly sea surface height from ECCO V4r4 (10.5067/ECG5D-SSH44). The data are provided as a time series of monthly netCDFs on a 0.5-degree latitude/longitude grid.

We will access a single netCDF file from inside the AWS cloud (us-west-2 region, specifically) and load it into Python as an xarray dataset. This approach leverages S3 native protocols for efficient access to the data.

Requirements#

1. AWS instance running in us-west-2#

NASA Earthdata Cloud data in S3 can be directly accessed via temporary credentials; this access is limited to requests made within the US West (Oregon) (code: us-west-2) AWS region.

2. Earthdata Login#

An Earthdata Login account is required to access data, as well as discover restricted data, from the NASA Earthdata system. Thus, to access NASA data, you need Earthdata Login. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. This account is free to create and only takes a moment to set up.

3. netrc File#

You will need a netrc file containing your NASA Earthdata Login credentials in order to execute the notebooks. A netrc file can be created manually within text editor and saved to your home directory. For additional information see: Authentication for NASA Earthdata.

Learning Objectives#

  • how to retrieve temporary S3 credentials for in-region direct S3 bucket access

  • how to perform in-region direct access of ECCO_L4_SSH_05DEG_MONTHLY_V4R4 data in S3

  • how to plot the data


Import Packages#

%matplotlib inline
import matplotlib.pyplot as plt
import os
import requests
import s3fs
from osgeo import gdal
import xarray as xr
import hvplot.xarray
import holoviews as hv

Get Temporary AWS Credentials#

Direct S3 access is achieved by passing NASA supplied temporary credentials to AWS so we can interact with S3 objects from applicable Earthdata Cloud buckets. For now, each NASA DAAC has different AWS credentials endpoints. Below are some of the credential endpoints to various DAACs:

s3_cred_endpoint = {
    'podaac':'https://archive.podaac.earthdata.nasa.gov/s3credentials',
    'gesdisc': 'https://data.gesdisc.earthdata.nasa.gov/s3credentials',
    'lpdaac':'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials',
    'ornldaac': 'https://data.ornldaac.earthdata.nasa.gov/s3credentials',
    'ghrcdaac': 'https://data.ghrc.earthdata.nasa.gov/s3credentials'
}

Create a function to make a request to an endpoint for temporary credentials. Remember, each DAAC has their own endpoint and credentials are not usable for cloud data from other DAACs.

def get_temp_creds(provider):
    return requests.get(s3_cred_endpoint[provider]).json()
temp_creds_req = get_temp_creds('podaac')
#temp_creds_req

Set up an s3fs session for Direct Access#

s3fs sessions are used for authenticated access to s3 bucket and allows for typical file-system style operations. Below we create session by passing in the temporary credentials we recieved from our temporary credentials endpoint.

fs_s3 = s3fs.S3FileSystem(anon=False, 
                          key=temp_creds_req['accessKeyId'], 
                          secret=temp_creds_req['secretAccessKey'], 
                          token=temp_creds_req['sessionToken'])

In this example we’re interested in the ECCO data collection from NASA’s PO.DAAC in Earthdata Cloud. Below we specify the s3 URL to the data asset in Earthdata Cloud. This URL can be found via Earthdata Search or programmatically through the CMR and CMR-STAC APIs.

s3_url = 's3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/SEA_SURFACE_HEIGHT_mon_mean_2015-01_ECCO_V4r4_latlon_0p50deg.nc'

Direct In-region Access#

Open with the netCDF file using the s3fs package, then load the cloud asset into an xarray dataset.

s3_file_obj = fs_s3.open(s3_url, mode='rb')
ssh_ds = xr.open_dataset(s3_file_obj, engine='h5netcdf')
ssh_ds

Get the SSH variable as an xarray dataarray

ssh_da = ssh_ds.SSH
ssh_da

Plot the SSH dataarray for time 2015-01-16T12:00:00 using hvplot.

ssh_da.hvplot.image(x='longitude', y='latitude', cmap='Spectral_r', geo=True, tiles='ESRI', global_extent=True)

Resources#

Direct access to ECCO data in S3 (from us-west-2)

Data_Access__Direct_S3_Access__PODAAC_ECCO_SSH using CMR-STAC API to retrieve S3 links