Tutorial by Ellianna Abrahams
In this tutorial, we will cover how to integrate your CryoCloud Jupyter interface with an outside cloud storage bucket from Amazon Web Services (AWS) S3 buckets. This tutorial will cover interaction using the command line interface (Terminal) and python (Jupyter notebooks, scripts, or Terminal).
Since CryoCloud is currently on AWS, this tutorial covers how to integrate between your account CryoCloud’s AWS instance and an external AWS account, perhaps through your organization, in which you have your own storage bucket. This tutorial will show you how to import data that you have stored elsewhere in the cloud and interact with it using CryoCloud’s compute, and how to export data that you have created through CryoCloud back to your bucket.
Before we begin, please create an AWS S3 bucket. You can create a bucket yourself, but you will need to pay for it personally. Another alternative is to create a bucket through an organizational affiliation, so that they can bill the cost to any grant funding that you have for cloud compute. Many organizations have protocols for how they create buckets and bucket permissions, so you will need to check with your orginization’s IAM protocol to ensure that you have permission to upload (in AWS, s3:PutObject) and download (in AWS, s3:GetObject) files from your bucket. Currently (Spring 2025), CryoCloud is operated off of an AWS EC2 instance in the us-west-2 region, which means that you will have less interaction costs if your bucket is an AWS S3 bucket in the same region. If you are following this tutorial at a later date, please check the CryoCloud documentation to confirm these details.
While we go through the step-by-step process for integrating your external AWS account, similar protocols exist for integrating buckets from other providers like Google Cloud Storage or similar.
1. Configure access to Your Amazon S3 Bucket using SSO Authentication¶
In this section, we are going to work with Amazon’s command line interface (CLI) via the Linux terminal that is built into your hub.
Since CryoCloud keeps AWS CLI updated for you, you don’t need to install an update. If you are working on another AWS Jupyter Hub, instructions for installing the AWS CLI update are at the end of this tutorial.
Open a terminal window in CryoCloud, and we will work from there using the built-in Linux command line. You can check that your AWS CLI is updated to version 2 or higher directly from the command line.
aws --versionFor me, this prints:
aws-cli/2.17.31 Python/3.11.9 Linux/5.10.227-219.884.amzn2.x86_64 source/x86_64.ubuntu.22Now you are ready to start uploading data to your bucket using single sign-on (SSO) authentication. Your organization will provide you with a URL for your AWS SSO Portal.
Your SSO portal will provide you with the following information:
- SSO Start URL
- SSO Region
- AWS Access Key ID
- AWS Secret Access Key
- AWS Session Token
Your start URL and region will stay the same unless your organization changes these values. However, your access keys and session token will update on a set schedule. When these values update, you will need to refresh your SSO login from the command line.
Since CryoCloud is also an AWS instance, and therefore has its own start url, region, and access keys, you will need to create a separate profile to store the SSO login values for your bucket. That can be done directly in the CLI. We’ll start by configuring your SSO profile from the Linux command line:
aws configure ssoAfter running configure, the following lines will pop up in the command line for you to enter text. Make sure to save your session name somewhere, as this will be needed to access your SSO configuration in the future:
SSO session name (Recommended): #name your SSO session, provides access to this AWS profile configuration in the future
start URL: #use the start url from the AWS SSO Portal associated with your bucket
region: #use the region from the AWS SSO Portal associated with your bucket
SSO registration scopes: #unless your admin gives you instructions, just hit enterAfter hitting enter on SSO registration scopes:, your browser will pop open a window for you to authenticate the AWS account associated with your new SSO profile. Choose your bucket’s AWS account from the pop-up and return to the CryoCloud terminal window that you were working from. Now you can complete your SSO profile:
CLI default client Region: #use the same region as CryoCloud, currently us-west-2
CLI default output format: #this defaults to json, you can leave blank and hit enter
CLI profile name: #name your SSO profile, this name will be used to interact with your bucket (see Note)You now have the SSO profile for your bucket configured!
Let’s check that the configuration works by listing the buckets in your new SSO profile using the CLI:
aws s3 ls --profile my-new-profile2. Open S3 Bucket data in CryoCloud without a local copy using Python¶
AWS provides multiple Python packages, including boto3 and s3fs, for interacting with AWS services like S3 buckets and EC2 compute instances directly from Python. Both of these packages are included in the default CryoCloud environment, and don’t require installation. s3fs is built on boto3, and offers a more intutitive approach for loading bucket files into the memory programatically. Open a Jupyter notebook, and run the following imports in python.
import s3fs
import aiobotocoreWe can use the aiobotocore package to initialize access with your S3 bucket that is associated with the SSO profile that you created in Section 1. The benefit of initializing with aiobotocore is that it allows you to connect to your configured AWS profile and that it allows asynchronous, non-blocking interactions with AWS services. This means it can run multiple AWS requests concurrently without waiting for one request to complete before starting the next.
Here we’ll use the CLI profile name that you defined above when configuring your SSO. Just replace my-profile with that name here.
session = aiobotocore.session.AioSession(profile='my-profile')
fs = s3fs.S3FileSystem(session=session)Now this fs connection will allow you to access anything that you have stored in Amazon S3.
In order to access specific files, you will need to specify some of the path within your bucket. First specify your bucket name, by replacing my-bucket-name with the name of your S3 bucket.
bucket_name = 'my-bucket-name'Next, specify the path within your bucket to the folder that contains the files you want to access. This can have as many nested folders as needed, but shouldn’t include any file names, only the folder path.
prefix = 'folder/'
#This will also work
prefix = 'path/to/your/folder/'Now that we’ve defined the file path, we can find all of the files of a particular type within that folder. In this tutorial, we’ll show you how to find all of the TIFF files in your subfolder. s3fs accesses your files through the URL that specifies their location within your bucket.
We’ll start by listing out the file urls for each file in the subfolder of your bucket, using the fs connection that we made earlier. Then we can limit this list to the urls of a particular file type.
#List all of the files in your subfolder
all_files = fs.ls(f's3://{bucket_name}/{prefix}')
#Next, limit this list to only files that end with .tif
tif_files = [file for file in all_files if file.endswith('.tif')]Once we have a list of all of the TIFF file URLs, we can start interacting with one of the files in python. We do this using the built in open function in s3fs. We’ll use rasterio to open our .tif file, because it uses dask under the hood, making the data import more memory efficient.
import rasterio
import xarray as xr
import rioxarray as rxrwith fs.open(tif_files[0], 'rb') as f:
tif_data = rxr.open_rasterio(f)We can interact this data entirely from the memory, even though we haven’t made a local hard copy.
tif_data.plot()
plt.show()Suppose you want to stitch a bunch of TIFFs of different color bands into a concatenated xarray, assuming that each one contains a band dimension. You can run those functions using the data held entirely in memory without making a local copy.
#Initialize list to store xarrays
xarrays = []
#Open each .tif file one by one
for tif_file in tif_files:
with fs.open(tif_file, 'rb') as f:
tif_data = rxr.open_rasterio(f)
xarrays.append(tif_data)
#Stack all xarrays along the band dimension
stacked_xarray = xr.concat(xarrays, dim='band')
#Clear the temporary list `xarrays` from memory
del xarraysNow you can check that the stacked_xarray contains all of the .tif files from your bucket directory, by checking that the following statement is True.
stacked_xarray.sizes['band'] == len(tif_files)The benefit of this xarray approach is that all your .tif files are still lazy loaded, which means they haven’t been fully pulled into your local CryoCloud memory. All xarray has done is set up a task graph for reading the data in chunks in preparation for when we are ready to interact with it.
Even though xarray has built in methods that interact with boto3 and s3fs to save files directly from python to an S3 bucket, at this time they are only configured to work with the default SSO profile (which is CryoCloud’s internal profile and not the profile connected to your bucket). To save your data to your bucket, you will need to make a temporary hard copy and move that to your bucket. This could be done by running commands entirely from python, but is more time and memory efficient from the command line. We step through how to do this in the next section.
#Save your data as a temporary file
stacked_xarray.to_netcdf('/path/to/file.nc')3. Move data from CryoCloud to your S3 Storage Bucket¶
You can move data that is stored as a hard copy on CryoCloud into your bucket in several ways using the AWS command line interface (CLI). Open a terminal window in CryoCloud, and we will work from there using the built-in Linux command line.
First let’s determine where in your bucket you want to move the data into. We can see the file structure in your bucket by running an s3 ls from the command line. Throughout this section, replace my-profile with the CLI profile name that you defined in Section 1:
aws s3 ls --profile my-profileYou can pick the name of the bucket that you’d like to work with from the output list.
Once you have decided which bucket to copy data into, you don’t need to create a new directory to move your data into, instead you can create it directly as you copy over your data using the AWS CLI. To copy a small amount of files, use aws s3 cp, but for a large amount of files, aws s3 sync (below) is recommended. Make sure to replace my-bucket-name with the name of your bucket throughout this section:
aws s3 cp /path/to/local/file.txt s3://my-bucket-name/new-folder/ --profile my-profileAWS will automatically create a folder named new-folder in your bucket and store a copy of file.txt there. This has to do with file structure conventions in cloud storage that we won’t dive into here, but it’s convenient for moving over files. If you prefer to copy into an existing directory, you can run the following:
aws s3 cp /path/to/local/file.nc s3://my-bucket-name/folder/ --profile my-profileIf you want to copy over several files of the same type (for example any NetCDF files that you might have generated in the previous section), command line wildcards work with the AWS CLI:
aws s3 cp /path/to/local/*.nc s3://my-bucket-name/folder/ --profile my-profileAnd if you want to copy over an entire directory, you can do that too (though we recommend a much better method below):
aws s3 cp /path/to/local/folder/ s3://my-bucket-name/folder/ --recursive --profile my-profileWhen copying over very large amounts of data, your SSO profile is likely to time out and require a new login session. aws s3 cp will copy over your data regardless of whether or not that data already exists in your bucket. That means if you rerun the same aws s3 cp command after renewing your login, then the command will waste login time and charge you for making the same copies again before moving onto new data.
Instead you can use aws s3 sync to replicate the full contents of your CryoCloud directory onto your bucket. The aws s3 sync command compares the contents of your local directory with your bucket’s file structure and only uploads files that are new or have been modified, making it more efficient for syncing large directories. This will only copy over data that doesn’t exist in your bucket, following the same exact file structure that you have locally. This does mean that data in a different file structure will still be copied over. Similarly data with the same file name and stored in the same structure will also be overwritten on your bucket if its size or modification date has changed in your local copy (i.e. if you have updated it). This makes aws s3 sync an incredibly useful tool for moving over directories in which you want your bucket to mirror your local directory.
aws s3 sync /path/to/local/folder/ s3://my-bucket-name/folder/ --profile my-profileOnce you are done copying over data, you can check that your data is in your bucket using the AWS interface that your organization provided you with, or through the command line:
aws s3 ls s3://my-bucket-name/folder/ --profile my-profileFinally, if you have copied over all of your large data to your bucket, please remove that data from CryoCloud to help us conserve costs. If you want to interact with that data the next time you’re on CryoCloud, see how you can do that using python in the previous section.
4. How to choose an appropriate S3 Bucket for your use case¶
There are many cloud based storage options available, including those from providers like AWS, Google Cloud Storage, Microsoft Azure Files, and others. In this tutorial we have specifically focused on how to interact with AWS storage options. This is because CryoCloud is currently (Spring 2025) operated off of an AWS EC2 instance in the us-west-2 region, which means that you will have less interaction costs if you choose an AWS storage bucket located in the same region. If you are following this tutorial at a later date, please check the CryoCloud documentation to confirm these details.
AWS offers a range of S3 storage classes that you can choose from, ranging from those created for frequent access to those created for archival storage. There are three key considerations to keep in mind when selecting a bucket: your intended data access patterns (i.e. how frequently you will access your data), budget (i.e. how much you will spend for storage and interaction costs), and performance requirements (i.e. how quickly you will need to pull your data into compute memory).
As a quick overview, for large amounts of data:¶
- If frequent, quick access is required, S3 Standard is the simplest but most expensive option.
- If access frequency varies over time, S3 Intelligent-Tiering might be more affordable depending on your access pattern.
- If infrequent access is acceptable, S3 Standard-IA or S3 Glacier have cheaper storage cost options, but have slower retrieval times and higher interaction costs, so you pay an additional fee on top of any egress fees.
- For archival data with very rare access, S3 Glacier Deep Archive would be your most economical choice.
Here is a closer look at some of the main S3 bucket types:
- General Purpose: Amazon S3 Standard This is ideal for frequently accessed data, but it’s the most expensive storage class compared to others, especially for storing large volumes of data. If your data isn’t being accessed frequently, there are more cost-effective options. Best for: Frequently accessed data that requires high performance and availability.
- Changing Access Patterns: Amazon S3 Intelligent-Tiering S3 Intelligent-Tiering automatically moves data between two access tiers (frequent and infrequent access) based on changing access patterns, so you don’t have to manage that change manually. However, AWS charges a small fee to monitor and automate storage class changes. For very predictable access patterns, this might be more expensive than Amazon S3 Standard. Best for: Data with unpredictable access patterns where you want AWS to optimize storage costs automatically.
- Infrequent Access: Amazon S3 Standard-IA This S3 class is much cheaper than S3 standard, but only for infrequent data access. Every time you access the data, you incur retrieval charges, which could add up. Additionally, data stored for less than 30 days is charged as if it were stored for 30 days. If you’re using this bucket to store data that needs to be accessed regularly, the retrieval charges could overcome any storage cost savings. Best for: Data that is not frequently accessed but still needs to be quickly retrievable when required.
- Archival: Amazon S3 Glacier Archival classes provide cheaper storage for data that is rarely accessed. Retrieval can take from minutes to hours, depending on the retrieval option chosen (Expedited, Standard, or Bulk). Costs for retrieving data can be high depending on how quickly you need access. Data stored for less than 90 days is charged as if it were stored for 90 days. Best for: Archival data that you only need to access infrequently, and won’t need to access quickly.
- Archival: Amazon S3 Glacier Deep Archive This storage class is the lowest-cost storage option for long-term archival data. Retrieval costs are higher than Amazon S3 glacier and retrieval times can range up to 12 hours, making it unsuitable for anything that needs quick access. Data stored for less than 90 days is charged as if it were stored for 90 days. Best for: Data you need to store for long-term retention, with very rare retrieval.
Extra: How to Update your AWS CLI¶
If your AWS CLI is not v2 or higher, you won’t be able to log into your bucket account using SSO. You can update your CLI.
To install the AWS CLI v2, run the following commands, line by line, from the Linux command line:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install -i ~/.local/aws-cli -b ~/.local/bin
export PATH=~/.local/bin:$PATHIf you are working from a 2i2c hub space, your python path will be automatically reset every time you start your instance. You can rerun the line to export your path each time you start an instance, or you can save the path to your .bashrc file so that it updates automatically on start. Here’s how:
echo 'export PATH=~/.local/bin:$PATH' >> .bashrc
source .bashrcYou can now check that your AWS CLI is updated to version 2. The following code will print out your newly installed version.
aws --versionFor me, this prints:
aws-cli/2.17.31 Python/3.11.9 Linux/5.10.227-219.884.amzn2.x86_64 source/x86_64.ubuntu.22Now you are ready to interact with your S3 bucket using SSO.