Using CryoCloud S3 Scratch Bucket#

CryoCloud JupyterHub has a preconfigured S3 “Scratch Bucket” that automatically deletes files after 7 days. This is a great resource for experimenting with large datasets and working collaboratively on a shared dataset with other CryoCloud users.

Tip

This notebook walks through, uploading, downloading and streaming data from a S3 scratch bucket

Access the scratch bucket#

The CryoCloud scratch bucket is hosted at s3://nasa-cryo-scratch. CryoCloud JupyterHub automatically sets an environment variable SCRATCH_BUCKET that appends a suffix to the s3 url with your GitHub username. This is intended to keep track of file ownership, stay organized, and prevent users from overwriting data!

Warning

Everyone has full access to the scratch bucket, so be careful not to overwrite data from other users when uploading files. Also, any data you put there will be deleted 7 days after it is uploaded

Hint

If you need more permanent S3 bucket storage refer to These Docs to configure your own S3 Bucket.

We’ll use the S3FS Python package, which provides a nice interface for interacting with S3 buckets.

import os
import s3fs
import fsspec
import boto3
import xarray as xr
import geopandas as gpd
# My GitHub username is `scottyhq`
scratch = os.environ['SCRATCH_BUCKET']
scratch 
's3://nasa-cryo-scratch/scottyhq'
# Here you see I previously uploaded files
s3 = s3fs.S3FileSystem()
s3.ls(scratch)
['nasa-cryo-scratch/scottyhq/ATL03_20230103090928_02111806_006_01.h5',
 'nasa-cryo-scratch/scottyhq/IS2_Alaska.parquet',
 'nasa-cryo-scratch/scottyhq/Notes.txt',
 'nasa-cryo-scratch/scottyhq/example',
 'nasa-cryo-scratch/scottyhq/example_ATL03',
 'nasa-cryo-scratch/scottyhq/grandmesa-sliderule.parquet']
# But you can set a different S3 object prefix to use:
scratch = 's3://nasa-cryo-scratch/octocat-project'
s3.ls(scratch)
[]

Uploading data#

It’s great to store data in S3 buckets because this storage features very high network throughput. If many users are simultaneously accessing the same file on a spinning networked harddrive (/home/jovyan/shared) performance can be quite slow. S3 has much higher performance for such cases.

Single file#

# I'm working with this file downloaded from NSIDC:
local_file = '/tmp/ATL03_20230103090928_02111806_006_01.h5'

remote_object = f"{scratch}/ATL03_20230103090928_02111806_006_01.h5"

s3.upload(local_file, remote_object)
[None]
s3.stat(remote_object)
{'ETag': '"489f0191a8e9c844576ff2d18adfea59-21"',
 'LastModified': datetime.datetime(2023, 7, 21, 19, 4, 55, tzinfo=tzutc()),
 'size': 1063571816,
 'name': 'nasa-cryo-scratch/octocat-project/ATL03_20230103090928_02111806_006_01.h5',
 'type': 'file',
 'StorageClass': 'STANDARD',
 'VersionId': None,
 'ContentType': 'application/x-hdf5'}

Directory#

local_dir = '/tmp/example'

!ls -lh {local_dir}
total 8.0K
-rw-r--r-- 1 jovyan jovyan 22 Jul 20 23:26 data.txt
-rw-r--r-- 1 jovyan jovyan 11 Jul 20 23:26 icesat.csv
s3.upload(local_dir, scratch, recursive=True)
[None, None]
s3.ls(f'{scratch}/example')
['nasa-cryo-scratch/octocat-project/example/data.txt',
 'nasa-cryo-scratch/octocat-project/example/icesat.csv']

Accessing Data#

Some software packages allow you to stream data directly from S3 Buckets. But you can always pull objects from S3 and work with local file paths.

This download-first, then analyze workflow typically works well for older file formats like HDF and netCDF that were designed to perform well on local hard drives rather than Cloud storage systems like S3.

Important

For best performance do not work with data in your home directory. Instead use a local scratch space like /tmp

local_object = '/tmp/test.h5'
s3.download(remote_object, local_object)
[None]
ds = xr.open_dataset(local_object, group='/gt3r/heights')
ds
<xarray.Dataset>
Dimensions:         (delta_time: 14226389, ds_surf_type: 5)
Coordinates:
  * delta_time      (delta_time) datetime64[ns] 2023-01-03T09:09:31.975149376...
    lat_ph          (delta_time) float64 ...
    lon_ph          (delta_time) float64 ...
Dimensions without coordinates: ds_surf_type
Data variables:
    dist_ph_across  (delta_time) float32 ...
    dist_ph_along   (delta_time) float32 ...
    h_ph            (delta_time) float32 ...
    pce_mframe_cnt  (delta_time) uint32 ...
    ph_id_channel   (delta_time) uint8 ...
    ph_id_count     (delta_time) uint8 ...
    ph_id_pulse     (delta_time) uint8 ...
    quality_ph      (delta_time) int8 ...
    signal_conf_ph  (delta_time, ds_surf_type) int8 ...
    weight_ph       (delta_time) uint8 ...
Attributes:
    Description:  Contains arrays of the parameters for each received photon.
    data_rate:    Data are stored at the photon detection rate.

Tip

If you don’t want to think about downloading files you can let fsspec handle this behind the scenes for you! This way you only need to think about remote paths

fs = fsspec.filesystem("simplecache", 
                       cache_storage='/tmp/files/',
                       same_names=True,  
                       target_protocol='s3',
                       )
# The `simplecache` setting above will download the full file to /tmp/files
print(remote_object)
with fs.open(remote_object) as f:
    ds = xr.open_dataset(f.name, group='/gt3r/heights') # NOTE: pass f.name for local cached path
s3://nasa-cryo-scratch/octocat-project/ATL03_20230103090928_02111806_006_01.h5
ds
<xarray.Dataset>
Dimensions:         (delta_time: 14226389, ds_surf_type: 5)
Coordinates:
  * delta_time      (delta_time) datetime64[ns] 2023-01-03T09:09:31.975149376...
    lat_ph          (delta_time) float64 ...
    lon_ph          (delta_time) float64 ...
Dimensions without coordinates: ds_surf_type
Data variables:
    dist_ph_across  (delta_time) float32 ...
    dist_ph_along   (delta_time) float32 ...
    h_ph            (delta_time) float32 ...
    pce_mframe_cnt  (delta_time) uint32 ...
    ph_id_channel   (delta_time) uint8 ...
    ph_id_count     (delta_time) uint8 ...
    ph_id_pulse     (delta_time) uint8 ...
    quality_ph      (delta_time) int8 ...
    signal_conf_ph  (delta_time, ds_surf_type) int8 ...
    weight_ph       (delta_time) uint8 ...
Attributes:
    Description:  Contains arrays of the parameters for each received photon.
    data_rate:    Data are stored at the photon detection rate.

Cloud-optimized formats#

Other formats like COG, ZARR, Parquet are ‘Cloud-optimized’ and allow for very efficient streaming directly from S3. In other words, you do not need to download entire files and instead can easily read subsets of the data.

The example below reads a Parquet file directly into memory (RAM) from S3 without using a local disk:

gf = gpd.read_parquet('s3://nasa-cryo-scratch/scottyhq/IS2_Alaska.parquet')
gf.head(2)
producer_granule_id time_start time_end datetime geometry
0 ATL03_20181014015337_02360103_006_02.h5 2018-10-14 01:53:36.912 2018-10-14 01:59:02.315 2018-10-14 01:56:19.613500 POLYGON ((-166.98121 80.05247, -167.61386 80.0...
1 ATL03_20181014130413_02430105_006_02.h5 2018-10-14 13:04:12.567 2018-10-14 13:09:37.946 2018-10-14 13:06:55.256500 POLYGON ((-130.81600 80.02773, -131.44724 80.0...

Advanced: Access Scratch bucket outside of JupyterHub#

Let’s say you have a lot of files on your laptop you want to work with on CryoCloud. The S3 Bucket is a convient way to upload large datasets for collaborative analysis. To do this, you need to copy AWS Credentials from the JupyterHub to use on other machines. More extensive documentation on this workflow can be found in this repository https://github.com/scottyhq/jupyter-cloud-scoped-creds.

The following code must be run on CryoCloud JupyterHub to get temporary credentials:

client = boto3.client('sts')

with open(os.environ['AWS_WEB_IDENTITY_TOKEN_FILE']) as f:
    TOKEN = f.read()

response = client.assume_role_with_web_identity(
    RoleArn=os.environ['AWS_ROLE_ARN'],
    RoleSessionName=os.environ['JUPYTERHUB_CLIENT_ID'],
    WebIdentityToken=TOKEN,
    DurationSeconds=3600
)

reponse will be a python dictionary that looks like this:

{'Credentials': {'AccessKeyId': 'ASIAYLNAJMXY2KXXXXX',
  'SecretAccessKey': 'J06p5IOHcxq1Rgv8XE4BYCYl8TG1XXXXXXX',
  'SessionToken': 'IQoJb3JpZ2luX2VjEDsaCXVzLXdlc////0dsD4zHfjdGi/0+s3XKOUKkLrhdXgZ8nrch2KtzKyYyb...',
  'Expiration': datetime.datetime(2023, 7, 21, 19, 51, 56, tzinfo=tzlocal())},
  ...

You can copy and paste the values to another computer, and use them to configure your access to S3:

s3 = s3fs.S3FileSystem(key=response['Credentials']['AccessKeyId'],
                       secret=response['Credentials']['SecretAccessKey'],
                       token=response['Credentials']['SessionToken'] )
# Confirm your credentials give you access
s3.ls('nasa-cryo-scratch', refresh=True)
['nasa-cryo-scratch/octocat-project',
 'nasa-cryo-scratch/scottyhq',
 'nasa-cryo-scratch/sliderule-example']