# NoisePy DataStore Tutorial
Introduction to the NoisePy DataStore class.

In [None]:
# Uncomment and run this line if the environment doesn't have noisepy already installed:
# ! pip install noisepy-seis 

__Warning__: NoisePy uses ```obspy``` as a core Python module to manipulate seismic data. Restart the runtime now for proper installation of ```obspy``` on Colab.

This tutorial should be ran after installing the noisepy package. 

In [None]:
from noisepy.seis import  __version__       # noisepy core functions
from noisepy.seis.io.s3store import SCEDCS3DataStore # Object to query SCEDC data from on S3
from noisepy.seis.io.channel_filter_store import channel_filter
from noisepy.seis.io.channelcatalog import XMLStationChannelCatalog        # Required stationXML handling object
from datetime import datetime
from datetimerange import DateTimeRange

print(f"Using NoisePy version {__version__}")

S3_STORAGE_OPTIONS = {"s3": {"anon": True}}

In [None]:
# timeframe for analysis
start = datetime(2022, 1, 2)
end = datetime(2022, 1, 4)
time_range = DateTimeRange(start, end)
print(time_range)

## DataStore

A noisepy DataStore is a set of classes to accommodate the various types of data store that are necessary because how reseachers store their data, which can be dramatically different w.r.t. formats (mSEED, SAC, SEG-Y), file system (local, S3), and naming conventions. Our noisepy team does not impose a definite data structure, but instead suggest to wrap the data storage structure into a python class. A Data Store class can be the front-end of the real back-end data storage, and return data through read_data function. It allows users to customize based on how they store the data, and leaving the rest of the workflow untouched.

See https://github.com/noisepy/noisepy-io/blob/main/src/noisepy/seis/stores.py for more about `DataStore` Class.

### S3 DataStore
Here, we instantiate a `SCEDCS3DataStore` class as `raw_store` as an example of Data Store on the cloud. This variable allows reading data from the real data storage backend during the later processing. The initialization parameters of `SCEDCS3DataStore` are
- S3_DATA: path to the data in the `"s3://"` format. 
- catalog: path to the station XML available in the `"s3://"` format.
- channel_filter: channel selection, based on station name and/or channel type.
- time_range: DateTimeRange of data for processing.
- storage_option: optimal storage option to read S3 data. This is where you can put AWS keys/credential if applicable.

See https://github.com/noisepy/noisepy-io/blob/main/src/noisepy/seis/io/s3store.py for `SCEDCS3DataStore`

We will work with a single day worth of data on SCEDC. The continuous data is organized with a single day and channel per miniseed (https://scedc.caltech.edu/data/cloud.html). For this example, you can choose any year since 2002. We will just cross correlate a single day.

In [None]:
# SCEDC S3 bucket common URL characters for that day.
S3_DATA = "s3://scedc-pds/continuous_waveforms/"

# S3 storage of stationXML
S3_STATION_XML = "s3://scedc-pds/FDSNstationXML/CI/"  

stations = "SBC,RIO,DEV".split(",") # filter to these stations
catalog = XMLStationChannelCatalog(S3_STATION_XML, storage_options=S3_STORAGE_OPTIONS) # Station catalog
raw_store = SCEDCS3DataStore(S3_DATA, catalog, 
                             channel_filter(["CI"], stations, ["BHE", "BHN", "BHZ",
                                                               "EHE", "EHN", "EHZ"]), 
                             time_range, 
                             storage_options=S3_STORAGE_OPTIONS) # Store for reading raw data from S3 bucket
raw_store.fs

To know what method was defined under the DataStore, we can list them as follow

In [None]:
method_list = [method for method in dir(raw_store) if method.startswith('__') is False]
print(method_list)

### The `get_timespan` function cuts the whole time span into each day

In [None]:
span = raw_store.get_timespans()
print(span)

## Get metadata of available channels

The `get_channel` function takes a time span, and read all stationXML for that specific day

In [None]:
channels = raw_store.get_channels(span[0])
channels

### Get data
With the time and channel list, we can use `read_data` function to read the data. Note that the returned channel data is parsed into NoisePy `ChannelData` type. 

The data type ``stream`` is a typical obspy stream.

In [None]:
d = raw_store.read_data(span[0], channels[2])
d.stream

In [None]:
d.stream.plot();