CASALS L1B Waveform Data Tutorial

This tutorial demonstrates how to access and visualize NASA CASALS (Concurrent Artificially-intelligent Spectrometry and Adaptive Lidar System) L1B geolocated waveform data from AWS S3.

About CASALS L1B Data¶

Data Organization:

Pulse: A single laser ‘shot’ with unique timestamp and Tx/Rx waveforms (up to 256 possible wavelengths/channels)
Track: A collection of pulses with the same wavelength, illuminates surface in along-track direction (direction of flight)
Sweep: A collection of coincident pulses spanning available wavelengths, illuminates surface in across-track direction

File Naming Convention: casals_LEVEL_UTC_VERSIONID_REV.FORMAT

Example: casals_l1b_20241118T165452_001_02.h5
LEVEL: Product level (l1a, l1b, l2a...)
UTC: J2000 UTC start time (yyyymmddThhmmss)
VERSIONID: Version number (001, 002...)
REV: Revision number (01, 02...)
FORMAT: File format (.h5)

File Sizes (per ~20-30 seconds of data):

File Type	Size	Notes
L0 (TDMS)	19 GB	Native LabView
L1A (H5)	13 GB	Internal compression
ARD (H5)	100 MB	No waveforms saves space
L1B (H5)	14 GB	Contains reference height and waveforms

import xarray as xr
import matplotlib.pyplot as plt
import numpy as np
import boto3
from botocore import UNSIGNED
from botocore.config import Config
import s3fs

1. List available data files¶

First, let’s see what L1B files are available in the S3 bucket. We are only looking at the first 10 files and showing only hdf5.

# Configure S3 client for anonymous (public) access
s3 = boto3.client('s3', 
                  region_name='us-west-2',
                  config=Config(signature_version=UNSIGNED))

# Bucket information
bucket_name = 'casals-data'
prefix = 'lidar/'

# List files
print("Available CASALS data files:\n")
response = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix, MaxKeys=10)

if 'Contents' in response:
    for obj in response['Contents']:
        # Only show H5 files
        if obj['Key'].endswith('.h5'):
            print(f"{obj['Key']}")
            print(f"  Size: {obj['Size'] / (1024**3):.2f} GB")
            print(f"  Last Modified: {obj['LastModified']}\n")
else:
    print("No files found")

Available CASALS data files:

lidar/2024-11-12/casals_l1b_20241112T163923_001_02.h5
  Size: 3.04 GB
  Last Modified: 2025-12-11 15:02:45+00:00

lidar/2024-11-12/casals_l1b_20241112T163934_001_02.h5
  Size: 1.30 GB
  Last Modified: 2025-12-11 15:02:45+00:00

lidar/2024-11-12/casals_l1b_20241112T163941_001_02.h5
  Size: 1.57 GB
  Last Modified: 2025-12-11 15:02:45+00:00

lidar/2024-11-12/casals_l1b_20241112T164250_001_02.h5
  Size: 12.98 GB
  Last Modified: 2025-12-11 15:09:39+00:00

lidar/2024-11-12/casals_l1b_20241112T164315_001_02.h5
  Size: 12.95 GB
  Last Modified: 2025-12-11 15:09:39+00:00

lidar/2024-11-12/casals_l1b_20241112T164340_001_02.h5
  Size: 12.92 GB
  Last Modified: 2025-12-11 15:09:39+00:00

lidar/2024-11-12/casals_l1b_20241112T164406_001_02.h5
  Size: 13.02 GB
  Last Modified: 2025-12-11 15:09:39+00:00

Alternatively, you can choose to use s3fs instead.

# Create filesystem object for anonymous access
fs = s3fs.S3FileSystem(anon=True)

# Bucket information
bucket_name = 'casals-data'
prefix = 'lidar/'

# List files in the bucket
print("Available CASALS data files:\n")
files = fs.ls(f'{bucket_name}/{prefix}')

for file_path in files[:20]:  # Limit to first 20
    print(f"{file_path}")
    # Get file info for size
    info = fs.info(file_path)
    print(f"  Size: {info['size'] / (1024**3):.2f} GB")
    print(f"  Last Modified: {info.get('LastModified', 'N/A')}\n")

Available CASALS data files:

casals-data/lidar/
  Size: 0.00 GB
  Last Modified: N/A

casals-data/lidar/00README_CASALS_GE_Tours.txt
  Size: 0.00 GB
  Last Modified: 2026-01-20 18:35:40+00:00

casals-data/lidar/2024-11-12
  Size: 0.00 GB
  Last Modified: N/A

casals-data/lidar/2024-11-18
  Size: 0.00 GB
  Last Modified: N/A

casals-data/lidar/CASALS_2024-11-12_All_Amplitudes_Tour.m4v
  Size: 7.50 GB
  Last Modified: 2026-01-20 18:35:40+00:00

casals-data/lidar/CASALS_2024-11-12_All_Heights_Tour.m4v
  Size: 7.50 GB
  Last Modified: 2026-01-20 18:35:40+00:00

casals-data/lidar/CASALS_2024-11-18_All_Amplitudes_Tour.m4v
  Size: 14.01 GB
  Last Modified: 2026-01-20 18:35:40+00:00

casals-data/lidar/CASALS_2024-11-18_All_Heights_Tour.m4v
  Size: 14.00 GB
  Last Modified: 2026-01-20 18:35:40+00:00

2. Open dataset with lazy loading¶

Using xarray with h5netcdf engine allows us to access specific subsets without loading the entire ~14 GB file into memory.

%%time
# S3 url of the file to read
s3url = 's3://casals-data/lidar/2024-11-18/casals_l1b_20241118T172656_001_02.h5'

# Open dataset with lazy loading - doesn't load data into memory yet
# storage_options={'anon': True} enables anonymous (public) access
ds = xr.open_dataset(s3url, engine='h5netcdf', storage_options={'anon': True})

print("Dataset opened successfully! (data not yet loaded)")

Dataset opened successfully! (data not yet loaded)
CPU times: user 7.19 s, sys: 3.87 s, total: 11.1 s
Wall time: 39.4 s

<timed exec>:6: UserWarning: The 'phony_dims' kwarg now defaults to 'access'. Previously 'phony_dims=None' would raise an error. For full netcdf equivalence please use phony_dims='sort'.

3. Inspect Dataset Structure¶

Let’s examine the L1B file structure and key variables without loading the full data.

# Inspect the overall structure
print("Dataset structure:")
print(ds)

print("\n" + "="*60)
print("Data variables (measurements and derived products):")
print("="*60)
for var in list(ds.data_vars):
    print(f"  {var}")

print("\n" + "="*60)
print("Dimensions:")
print("="*60)
for dim, size in dict(ds.dims).items():
    print(f"  {dim}: {size}")

print("\n" + "="*60)
print("Coordinates (indices for data organization):")
print("="*60)
for coord in list(ds.coords):
    print(f"  {coord}")

Dataset structure:
<xarray.Dataset> Size: 21GB
Dimensions:                           (delta_time: 3604480, rx_bins: 2728,
                                       phony_dim_0: 14080, phony_dim_1: 256,
                                       tx_bins: 80)
Coordinates:
  * delta_time                        (delta_time) float64 29MB 2.172e+08 ......
  * rx_bins                           (rx_bins) int64 22kB 0 1 2 ... 2726 2727
  * tx_bins                           (tx_bins) int64 640B 0 1 2 3 ... 77 78 79
Dimensions without coordinates: phony_dim_0, phony_dim_1
Data variables: (12/65)
    bg_mean                           (delta_time) float64 29MB ...
    bg_std                            (delta_time) float64 29MB ...
    bin_size                          (delta_time) float64 29MB ...
    dac                               (delta_time) float64 29MB ...
    geoid                             (delta_time) float32 14MB ...
    geoid_free2mean                   (delta_time) float32 14MB ...
    ...                                ...
    tide_ocean                        (delta_time) float32 14MB ...
    tide_ocean_pole                   (delta_time) float32 14MB ...
    tide_pole                         (delta_time) float32 14MB ...
    track_num                         (delta_time) int64 29MB ...
    tracks                            (phony_dim_1) int64 2kB ...
    tx_waveform                       (delta_time, tx_bins) int16 577MB ...
Attributes: (12/25)
    ard_file:               casals_ard_20241118T172656_001_02.h5
    casals_epoch_gps_sec:   1198800018.0
    casals_epoch_gps_sow:   86418.0
    casals_epoch_gps_week:  1982
    casals_epoch_utc_a:     2018-01-01t00:00:00.000000Z
    end_delta_time:         217186041.14098355
    ...                     ...
    speed_of_light:         299792458.0
    start_delta_time:       217186016.11141512
    start_utca:             2024-11-18T17:26:56.111415Z
    style:                  1D
    tdms_file:              casals_18-11-2024_12-27-36_151.tdms
    ticks:                  8

============================================================
Data variables (measurements and derived products):
============================================================
  bg_mean
  bg_std
  bin_size
  dac
  geoid
  geoid_free2mean
  good_snr
  instrument_altitude
  instrument_altitude_error
  instrument_latitude
  instrument_latitude_error
  instrument_longitude
  instrument_longitude_error
  local_beam_azimuth
  local_beam_azimuth_error
  local_beam_elevation
  local_beam_elevation_error
  range_bias_correction
  refh
  refh_amp
  refh_bounce_time_offset
  refh_bounce_time_offset_error
  refh_error
  refh_latitude
  refh_latitude_error
  refh_longitude
  refh_longitude_error
  refh_neutat_delay_derivative
  refh_neutat_delay_total
  refh_snr
  refh_thres
  rwstart
  rwstart_bounce_time_offset
  rwstart_bounce_time_offset_error
  rwstart_error
  rwstart_latitude
  rwstart_latitude_error
  rwstart_longitude
  rwstart_longitude_error
  rwstart_neutat_delay_derivative
  rwstart_neutat_delay_total
  rwstop
  rwstop_bounce_time_offset
  rwstop_bounce_time_offset_error
  rwstop_error
  rwstop_latitude
  rwstop_latitude_error
  rwstop_longitude
  rwstop_longitude_error
  rwstop_neutat_delay_derivative
  rwstop_neutat_delay_total
  rx_waveform
  solar_azimuth
  solar_elevation
  sweep_num
  sweeps
  tide_earth
  tide_earth_free2mean
  tide_load
  tide_ocean
  tide_ocean_pole
  tide_pole
  track_num
  tracks
  tx_waveform

============================================================
Dimensions:
============================================================
  delta_time: 3604480
  rx_bins: 2728
  phony_dim_0: 14080
  phony_dim_1: 256
  tx_bins: 80

============================================================
Coordinates (indices for data organization):
============================================================
  delta_time
  rx_bins
  tx_bins

/tmp/ipykernel_127/3837883000.py:14: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  for dim, size in dict(ds.dims).items():

4. Examine Key Variables¶

Important L1B datasets:

Variable	Description
`tx_waveform`	Transmit waveform data
`rx_waveform`	Receive waveform data (contains surface returns)
`refh`	Reference height - geolocated height of the waveform bin with maximum amplitude (WGS84 ellipsoidal height)
`tx_bins`	List of Tx waveform bin indices
`rx_bins`	List of Rx waveform bin indices
`delta_time`	Laser transmit time in GPS seconds since 2018-01-01
`sweep_num`	Unique 0-based counter for each sweep
`track_num`	Track (or channel) number

# Check shapes of key waveform variables
if 'tx_waveform' in ds:
    print(f"tx_waveform shape: {ds.tx_waveform.shape}")
    print(f"tx_waveform dims: {ds.tx_waveform.dims}")
    print(f"  -> Dimensions: (pulses × bins)")

if 'rx_waveform' in ds:
    print(f"\nrx_waveform shape: {ds.rx_waveform.shape}")
    print(f"rx_waveform dims: {ds.rx_waveform.dims}")
    print(f"  -> Dimensions: (pulses × bins)")

if 'refh' in ds:
    print(f"\nrefh (reference height) shape: {ds.refh.shape}")
    print(f"refh dims: {ds.refh.dims}")
    print(f"  -> One reference height per pulse")

tx_waveform shape: (3604480, 80)
tx_waveform dims: ('delta_time', 'tx_bins')
  -> Dimensions: (pulses × bins)

rx_waveform shape: (3604480, 2728)
rx_waveform dims: ('delta_time', 'rx_bins')
  -> Dimensions: (pulses × bins)

refh (reference height) shape: (3604480,)
refh dims: ('delta_time',)
  -> One reference height per pulse

5. Efficient Data Subsetting¶

Strategy: Load only indexing variables first (lightweight), then use them to create masks for the data you actually need. This avoids loading the full ~14 GB file when you only need a small subset.

track_num = ds.track_num.load()
sweep_num = ds.sweep_num.load()

print(f"Loaded indexing variables:")
print(f"  track_num: {len(track_num)} values (track/channel numbers)")
print(f"  sweep_num: {len(sweep_num)} values (sweep numbers)")
print(f"\nTrack range: {track_num.min().values} to {track_num.max().values}")
print(f"Sweep range: {sweep_num.min().values} to {sweep_num.max().values}")

Loaded indexing variables:
  track_num: 3604480 values (track/channel numbers)
  sweep_num: 3604480 values (sweep numbers)

Track range: 0 to 255
Sweep range: 0 to 14079

6. Select Data Subset¶

Selection criteria:

Tracks 0-255: All available wavelengths/channels
Sweeps 1000-1001: A small subset for visualization

This demonstrates how to efficiently extract specific pulses from the large dataset.

%%time
# Set the data selection parameters
track0 = 0      # Start track (channel)
track1 = 255    # End track (all channels)
sweep0 = 1000   # Start sweep
sweep1 = 1001   # End sweep (just 2 sweeps for this example)

# Create boolean mask to identify pulses matching our criteria
mask = ((track_num >= track0) & (track_num <= track1) & 
        (sweep_num >= sweep0) & (sweep_num <= sweep1))

# Get indices where mask is True
indices = np.where(mask)[0]
print(f"Found {len(indices)} matching pulses (records)")
print(f"  Tracks: {track0} to {track1}")
print(f"  Sweeps: {sweep0} to {sweep1}")

# Now load only the waveform data we need at those specific indices
# This is much faster than loading the entire dataset!
tx_waveform = ds.tx_waveform.isel(delta_time=indices).load()
rx_waveform = ds.rx_waveform.isel(delta_time=indices).load()

print(f"\nLoaded waveform data:")
print(f"  tx_waveform: {tx_waveform.shape}")
print(f"  rx_waveform: {rx_waveform.shape}")

Found 512 matching pulses (records)
  Tracks: 0 to 255
  Sweeps: 1000 to 1001

Loaded waveform data:
  tx_waveform: (512, 80)
  rx_waveform: (512, 2728)
CPU times: user 1.17 s, sys: 469 ms, total: 1.64 s
Wall time: 8.09 s

7. Visualize TX (Transmit) Waveforms¶

TX waveforms show the outgoing laser pulse characteristics. Multiple waveforms are plotted to show variation across the selected pulses.

%%time
# Plot the Tx Waveforms
plt.figure(figsize=(12, 6))
for i in range(len(indices)):
    plt.plot(ds.tx_bins, tx_waveform[i], alpha=0.7, linewidth=1)

plt.xlabel('TX Bin Number', fontsize=12)
plt.ylabel('Amplitude', fontsize=12)
plt.title(f'CASALS TX Waveforms (Sweeps {sweep0}-{sweep1}, Tracks {track0}-{track1})', 
          fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Plotted {len(indices)} TX waveforms")

Plotted 512 TX waveforms
CPU times: user 689 ms, sys: 3.67 ms, total: 692 ms
Wall time: 692 ms

8. Visualize RX (Receive) Waveforms¶

RX waveforms show the return signal including:

Surface returns (peaks in the waveform)
Atmospheric scattering
Background noise

%%time
# Plot the Rx Waveforms
plt.figure(figsize=(12, 6))
for i in range(len(indices)):
    plt.plot(ds.rx_bins, rx_waveform[i], alpha=0.7, linewidth=1)

plt.xlabel('RX Bin Number', fontsize=12)
plt.ylabel('Amplitude', fontsize=12)
plt.title(f'CASALS RX Waveforms (Sweeps {sweep0}-{sweep1}, Tracks {track0}-{track1})', 
          fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Plotted {len(indices)} RX waveforms")
print("\nNote: The peaks in RX waveforms represent surface returns")
print("The bin with maximum amplitude is used to determine the reference height (refh)")

Plotted 512 RX waveforms

Note: The peaks in RX waveforms represent surface returns
The bin with maximum amplitude is used to determine the reference height (refh)
CPU times: user 2.15 s, sys: 34.2 ms, total: 2.19 s
Wall time: 2.19 s

9. Zoom into RX Waveform Peaks¶

Key L1B variable: refh (reference height)

Defined as: Geolocated height of the WF bin with the maximum amplitude (WGS84 ellipsoidal height)
The peaks we see here are the surface returns used to determine refh
Each pulse has associated uncertainty estimates (refh_error)

%%time
# Zoom into the peak region to see surface returns in detail
peak0 = 1625  # Start bin for zoom
peak1 = 1700  # End bin for zoom

plt.figure(figsize=(12, 6))
for i in range(len(indices)):
    plt.plot(ds.rx_bins[peak0:peak1], rx_waveform[i, peak0:peak1], 
             alpha=0.7, linewidth=1.5)

plt.xlabel('RX Bin Number', fontsize=12)
plt.ylabel('Amplitude', fontsize=12)
plt.title(f'CASALS RX Waveform Peaks - Surface Returns (bins {peak0}-{peak1})', 
          fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Zoomed view shows surface return peaks")
print(f"The bin with maximum amplitude in this region is used for reference height")

Zoomed view shows surface return peaks
The bin with maximum amplitude in this region is used for reference height
CPU times: user 728 ms, sys: 16.2 ms, total: 744 ms
Wall time: 743 ms

11. Summary of L1B CASALS data¶

L1B Global Metadata Attributes:¶

Names of input TDMS, L1A, GPS, ARD, and Geolocation files
Time epoch information (J2000 UTC-A format)
Number of Pulses, Sweeps, Tracks
Number of Tx and Rx Waveform Bins
Begin/End time of data
Processing constants (speed of light, conversions, etc.)

Key Dataset Categories:¶

1. Identifying Datasets (per pulse)¶

delta_time, sweep_num, track_num

2. Instrument Datasets (per pulse)¶

Instrument position and errors

3. Range Window Datasets (per pulse)¶

rwstart, rwstop (with position/errors)

4. Reference Height Datasets (per pulse)¶

refh (with position, amplitude, SNR, errors)

5. Beam Angle Datasets (per pulse)¶

local_beam_azimuth, local_beam_elevation (with errors)

6. Solar Angle Datasets (per pulse, at refh)¶

solar_azimuth, solar_elevation

7. Geophysical Datasets (per pulse, at refh)¶

Tide corrections, geoid, dac

8. Additional Quality Datasets¶

bg_mean, bg_std, bin_size, good_snr
range_bias_correction, refh_amp, refh_snr, refh_thres

12. Accessing Data from a Local Computer¶

If you’re running this notebook on your local machine (not in a cloud environment), you’ll need to configure AWS credentials for S3 access.

Three Access Methods

There are three different approaches:

Anonymous access (for public buckets like CASALS)
AWS credentials (for authenticated access)
Download locally first (for slower connections or repeated analysis)

Option 1: Use Anonymous (Public) Access¶

Since the data is in a public bucket, you can use anonymous access - same as in cloud environments!

import xarray as xr

# For public buckets - same as cloud environment
s3url = 's3://casals-data/lidar/2024-11-18/casals_l1b_20241118T172656_001_02.h5'
ds = xr.open_dataset(s3url, engine='h5netcdf', storage_options={'anon': True})

Option 2: Configure AWS Credentials¶

For accessing data that requires authentication, you have several options:

Method 1: Using Environment Variables¶

# Set these in your terminal before running Python:
# export AWS_ACCESS_KEY_ID='your_access_key'
# export AWS_SECRET_ACCESS_KEY='your_secret_key'
# export AWS_DEFAULT_REGION='us-west-2'

import xarray as xr

s3url = 's3://casals-data/lidar/2024-11-18/casals_l1b_20241118T172656_001_02.h5'
ds = xr.open_dataset(s3url, engine='h5netcdf')

Method 2: Using AWS CLI Configuration¶

# Run in terminal:
aws configure

# Then provide your credentials when prompted
# This stores credentials in ~/.aws/credentials

Then use the dataset normally in Python:

import xarray as xr

s3url = 's3://casals-data/lidar/2024-11-18/casals_l1b_20241118T172656_001_02.h5'
ds = xr.open_dataset(s3url, engine='h5netcdf')

Method 3: Specify Credentials Directly¶

storage_opts = {
    'key': 'your_access_key',
    'secret': 'your_secret_key',
    'client_kwargs': {'region_name': 'us-west-2'}
}
ds = xr.open_dataset(s3url, engine='h5netcdf', storage_options=storage_opts)

Option 3: Download File Locally First¶

For slower internet connections or repeated analysis, download the file once:

import boto3
from botocore import UNSIGNED
from botocore.config import Config
import xarray as xr

# Configure S3 client
s3 = boto3.client('s3', 
                  region_name='us-west-2',
                  config=Config(signature_version=UNSIGNED))

# Download file
bucket_name = 'casals-data'
file_key = 'lidar/2024-11-18/casals_l1b_20241118T172656_001_02.h5'
local_filename = '~/Downloads/casals_l1b_local.h5' # adjust this to fit where you want your file saved on your computer

print(f"Downloading {file_key}...")
print(f"This is a ~14 GB file, it may take several minutes...")
s3.download_file(bucket_name, file_key, local_filename)
print(f"Download complete! Saved as {local_filename}")

# Now open the local file
ds = xr.open_dataset(local_filename, engine='h5netcdf')
print("Dataset opened from local file")

# Use the dataset as normal
print(ds)

Required Python Packages¶

Install these packages on your local machine:

Using mamba (recommended):¶

mamba install -c conda-forge xarray h5netcdf s3fs boto3 matplotlib numpy

Or using pip:¶

pip install xarray h5netcdf s3fs boto3 matplotlib numpy

Tutorial Complete! 🎉¶

Happy analyzing! 🚀