DataStore Tutorial

DataStore Tutorial#

Introduction to the NoisePy DataStore class.

# Uncomment and run this line if the environment doesn't have noisepy already installed:
# ! pip install noisepy-seis 

Warning: NoisePy uses obspy as a core Python module to manipulate seismic data. Restart the runtime now for proper installation of obspy on Colab.

This tutorial should be ran after installing the noisepy package.

from noisepy.seis import  __version__       # noisepy core functions
from noisepy.seis.io.s3store import SCEDCS3DataStore # Object to query SCEDC data from on S3
from noisepy.seis.io.channel_filter_store import channel_filter
from noisepy.seis.io.channelcatalog import XMLStationChannelCatalog        # Required stationXML handling object
from datetime import datetime
from datetimerange import DateTimeRange

print(f"Using NoisePy version {__version__}")

S3_STORAGE_OPTIONS = {"s3": {"anon": True}}

/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/noisepy/seis/io/utils.py:13: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

Using NoisePy version 0.1.dev1

# timeframe for analysis
start = datetime(2022, 1, 2)
end = datetime(2022, 1, 4)
timerange = DateTimeRange(start, end)
print(timerange)

2022-01-02T00:00:00 - 2022-01-04T00:00:00

DataStore#

A noisepy DataStore is a set of classes to accommodate the various types of data store that are necessary because how reseachers store their data, which can be dramatically different w.r.t. formats (mSEED, SAC, SEG-Y), file system (local, S3), and naming conventions. Our noisepy team does not impose a definite data structure, but instead suggest to wrap the data storage structure into a python class. A Data Store class can be the front-end of the real back-end data storage, and return data through read_data function. It allows users to customize based on how they store the data, and leaving the rest of the workflow untouched.

S3 DataStore#

Here, we instantiate a SCEDCS3DataStore class as raw_store as an example of Data Store on the cloud. This variable allows reading data from the real data storage backend during the later processing. The initialization parameters of SCEDCS3DataStore are

S3_DATA: path to the data in the "s3://" format.
catalog: path to the station XML available in the "s3://" format.
channel_filter: channel selection, based on station name and/or channel type.
time_range: DateTimeRange of data for processing.
storage_option: optimal storage option to read S3 data. This is where you can put AWS keys/credential if applicable.

We will work with a single day worth of data on SCEDC. The continuous data is organized with a single day and channel per miniseed (https://scedc.caltech.edu/data/cloud.html). For this example, you can choose any year since 2002. We will just cross correlate a single day.

# SCEDC S3 bucket common URL characters for that day.
S3_DATA = "s3://scedc-pds/continuous_waveforms/"

# S3 storage of stationXML
S3_STATION_XML = "s3://scedc-pds/FDSNstationXML/CI/"  

stations = "SBC,RIO,DEV".split(",") # filter to these stations
catalog = XMLStationChannelCatalog(S3_STATION_XML, storage_options=S3_STORAGE_OPTIONS) # Station catalog
raw_store = SCEDCS3DataStore(S3_DATA, catalog, 
                             channel_filter(["CI"], stations, ["BH?", "EH?"]), timerange, 
                             storage_options=S3_STORAGE_OPTIONS) # Store for reading raw data from S3 bucket
raw_store.fs

<s3fs.core.S3FileSystem at 0x7f18c06192a0>

To know what method was defined under the DataStore, we can list them as follow

method_list = [method for method in dir(raw_store) if method.startswith('__') is False]
print(method_list)

['_abc_impl', '_ensure_channels_loaded', '_get_datepath', '_get_filename', '_load_channels', '_parse_channel', '_parse_timespan', 'chan_catalog', 'chan_filter', 'channels', 'date_range', 'file_re', 'fs', 'get_channels', 'get_inventory', 'get_timespans', 'path', 'paths', 'read_data']

The `get_timespan` function cuts the whole time span into each day#

span = raw_store.get_timespans()
print(span)

[2022-01-02T00:00:00+0000 - 2022-01-03T00:00:00+0000, 2022-01-03T00:00:00+0000 - 2022-01-04T00:00:00+0000]

Get metadata of available channels#

The get_channel function takes a time span, and read all stationXML for that specific day

channels = raw_store.get_channels(span[0])
channels

2025-04-17 02:01:36,246 139744393649024 INFO utils.log_raw(): TIMING:  1.428 secs for Listing 3951 files from s3://scedc-pds/continuous_waveforms/2022/2022_002/

2025-04-17 02:01:36,321 139744393649024 INFO utils.log_raw(): TIMING:  0.076 secs for Init: 1 timespans and 9 channels

2025-04-17 02:01:36,516 139742897243840 INFO channelcatalog._get_inventory_from_file(): Reading StationXML file s3://scedc-pds/FDSNstationXML/CI/CI_DEV.xml

2025-04-17 02:01:36,742 139742798677696 INFO channelcatalog._get_inventory_from_file(): Reading StationXML file s3://scedc-pds/FDSNstationXML/CI/CI_SBC.xml

2025-04-17 02:01:36,748 139742779803328 INFO channelcatalog._get_inventory_from_file(): Reading StationXML file s3://scedc-pds/FDSNstationXML/CI/CI_RIO.xml

2025-04-17 02:01:40,902 139744393649024 INFO s3store.get_channels(): Getting 9 channels for 2022-01-02T00:00:00+0000 - 2022-01-03T00:00:00+0000

[CI.DEV.BHE,
 CI.DEV.BHN,
 CI.DEV.BHZ,
 CI.RIO.BHE,
 CI.RIO.BHN,
 CI.RIO.BHZ,
 CI.SBC.BHE,
 CI.SBC.BHN,
 CI.SBC.BHZ]

Get data#

With the time and channel list, we can use read_data function to read the data. Note that the returned channel data is parsed into NoisePy ChannelData type.

The data type stream is a typical obspy stream.

d = raw_store.read_data(span[0], channels[2])
d.stream

1 Trace(s) in Stream:
CI.DEV..BHZ | 2022-01-02T00:00:00.019539Z - 2022-01-02T23:59:59.994539Z | 40.0 Hz, 3456000 samples

d.stream.plot();

_images/607f06e35a3411160f1b1e1568d440c44708c23d22f4fdb7e8698adfa4a306fa.png