QuakeLabeler Tutorial

Overview

Preparing training dataset is the first step of machine learning. However annotate seismic label is a time consuming and tedious work. QuakeLabeler (QL) is your best solution to automatic produce earthquake labels. To save your time for the next brilliant AI method.

QuakeLabeler runs in a very human way and it can help researchers to create training data with little experience of making labels. The rendered datasets is mainly prepared for the training procedure of subsequent machine learning (deep learning) applications. For example:

  1. Earthquake detection;

  2. Phase picking;

  3. Waveform classification;

  4. Magnitude prediction.

  5. Earthquake location.

QuakeLabeler is a Python package containing command-line interactive tools to help you quick deploy your personal seismic datasets. QuakeLabeler provides one-stop service to convert raw seismic data into valuable training datasets for machine learning through professional collection and annotation techniques.

Data Collection

Artificial intelligence (AI) needs significant amount of high-quality training data. In Seismology, we don’t lack raw data. While for AI research, proper data can be difficult and time-consuming to collect (revise from original seismic traces).

Different from other static seismic datasets, QL provides flexible tailored data. QL first retrieves raw seismic data from online data centres (i.e. IRIS). Then transfer these data(seismograms) into standard training samples. Several signal pre-processing methods are implemented to ensure the datasets to final reach user’s demands.

Data Annotation

With annotated data, models learn to handle complex scenarios. The higher the data accuracy, the better the model performance. With a wide-range of data annotation tools, QL can automatic create seismic labels according to user’s input options. So you never need worry about the export data format have trouble with your AI models.

Workflow

QuakeLabeler has a tight pipeline of functions. It automatically builds required seismic datasets. Here are the main steps of the producing procedure:

  1. Define research region and time range

  2. Design dataset

    • Custom waveform formats

    • Custom dataset formats

    • Custom export formats

  3. Request data from online data center

  4. Signal processing

  5. Annotation

  6. Make dataset

  7. Export statistical results

Usage

Start QuakeLabeler in any of your interactive shell (eg. in macOS, open terminal), type:

# get start QuakeLabeler
QuakeLabeler

QuakeLabeler will initialize and notify you select one of the running mode:

(ql) hao@HaodeMacBook-Pro QuakeLabeler % QuakeLabeler
Welcome to QuakeLabeler----Fast AI Earthquake Dataset Deployment Tool!
QuakeLabeler provides multiple modes for different levels of Seismic AI researher

[Beginner] mode -- well prepared case studies;
[Advanced] mode -- produce earthquake samples based on Customized parameters.

Please select a mode: [1/Beginner/2/Advanced]

Beginner Mode

If you have little knowledge of how to create training dataset, Beginner mode is best for you to quick start:

# you can also input: 1 or beginner for simplify.
Beginner

The package will start Beginner function, several study regions are listed for you to choose:

Initialize Beginner Mode...
Select one of the following sample fields:  [1/2/3/4]
                   [1] 2010 Cascadia subduction zone earthquake activities
                   [2] 2011 Tōhoku earthquake and tsunami
                   [3] 2016 Oklahoma human activity-induced earthquakes
                   [4] 2018 Big quakes in Southern California
                   [0] Re-direct to Advanced mode.

For example, you can enter 1 to create training data base on 2010 Cascadia subduction zone earthquake activities, QL will automatically search event information from default online data center (IRIS):

1

Note

Request event information (catalog) from online data center needs time. Therefore you need to wait, also the script will notify this:

Loading time varies on your network connections, search region scale, time range, etc. Please be patient, estimated time: 3 mins
Request completed!!!
1525 events have been found!

Once you are informed the events has been found. The script will run into next step. QL will ask you to input following settings to generate datasets:

Please define your own expectation for Seismic labeled samples:

How many samples do you wish to create? [1- ] (input MAX for all available waveform):

The first question is about the total number (volume) of samples you wish to create, for basic machine learning methods, you could enter:

5000

For deep learning applications, they usually need more than 10,000 samples to avoid overfitting. QL does not have a maximum volume limit, however process time might be longer when you want to create a big dataset.

Caution

You need to make sure your local drive has enough memory to save your datasets.

Following questions all runs in the same way, you only need to type in your desired options:

Do you want fixed sample length? [y/n] (default: y):y

Enter sample length (how many sample points do you wish in a trace)?(default 5000):

Select label type: [simple/specific]?
[simple]: P/S;
[specific]: P/Pn/Pb/S/Sn,etc.

Enter a fixed sampling rate(i.e.: 100.0) or skip for keep original sampling rate:
Select filter function for preprocess? [0/1/2/3]:
[0]: Do not apply filter function;
[1]: Butterworth-Lowpass;
[2]: Butterworth-Highpass;
[3]: Butterworth-Bandpass.

Do you want to detrend the waveforms ? [y/n]

Would you like random input? [y/n]n
Input waveforms start at: __ seconds before arrival.

It’s worth to mention that here are 2 different formats to generate sample segment:

  1. Random Input : Arrival time will be set on random position of the waveform;

  2. Input waveform start at __ seconds before arrival.

For other questions, you can leave them all blank to use default parameters, or input the key words which fit your preference. Note that for some question, you can input multiple key words (e.g., `SACMAT or `MAT_MiniSeed)

# Leave blank if you wish to apply default options
Do you want to add random noise: [y/n] n
Select export file format: [SAC/MSEED/SEGY/NPZ/MAT]SAC
Save as single trace or multiple-component seismic data? [y/n]
Do you want to separate save traces as input and output? [y/n]
Do you want to separate save arrival information as a CSV file? [y/n]
Please input a folder name for your dataset (optional):
Do you want to generate statistical charts after creating the dataset? [y/n]

Tip

As a beginner, feel free to skip the option you do not know how to select y

Once the questions are done, QL will automatic deploy customized dataset:

Processing |################################| 5/5Save to target folder: MyDataset2021-05-31T10:06
6 Trace(s) in Stream:
IU.COR.00.BH1 | 2010-09-07T11:39:49.719539Z - 2010-09-07T11:43:59.669539Z | 20.0 Hz, 5000 samples
IU.COR.00.BH2 | 2010-09-07T11:39:49.719539Z - 2010-09-07T11:43:59.669539Z | 20.0 Hz, 5000 samples
IU.COR.00.BHZ | 2010-09-07T11:39:49.719539Z - 2010-09-07T11:43:59.669539Z | 20.0 Hz, 5000 samples
IU.COR.10.BH1 | 2010-09-07T11:39:49.719538Z - 2010-09-07T11:41:54.694538Z | 40.0 Hz, 5000 samples
IU.COR.10.BH2 | 2010-09-07T11:39:49.719538Z - 2010-09-07T11:41:54.694538Z | 40.0 Hz, 5000 samples
IU.COR.10.BHZ | 2010-09-07T11:39:49.719539Z - 2010-09-07T11:41:54.694539Z | 40.0 Hz, 5000 samples

All available waveforms are ready!
5 of event-based samples are successfully downloaded!

Note

If you use n option for multiple-component seismic data, then every Stream will hold all available components from one station at the event time. See the above print information, the last Stream object has 6 available Trace(s) as one rendered sample.

Advanced Mode

If you are already an expert in machine learning. You can apply advanced mode to fill in all customized options for your search fields. As simple as beginner mode, you can start in your interactive shell with command:

QuakeLabeler

Select 2 or Advanced to enter:

# type 2 also works
Advanced

QL will initiate advanced mode once it received valid input:

Initialize Advanced Mode...
Alternative region options are provided. Please select your preferred input function:

Please select one :  [STN/GLOBAL/RECT/CIRC/FE/POLY]
                     [STN]: Stations are restricted to specific station code(s);
                     [GLOBAL]: Stations are not restricted by region (i.e. all available stations);
                     [RECT]: Rectangular search of stations (recommended);
                     [CIRC]: Circular search of stations(recommended);
                     [FE]: Flinn-Engdahl region search of stations;
                     [POLY]: Customized polygon search.

Note

QL provides multiple ways to select your research region. You can select one best fit your study case. In general, we will use RECT to search in a rectangular region or use STN to input certain stations which you concerned. Note that large region usually need long time for computing.

Once you enter a specific mode, QL will run related function to ask you input your regional parameters. Let’s take RECT function for instance, QL will request 4 parameters of the rectangular region:

Please enter the latitudes(-90 ~ 90) at the bottom and top, the longitudes(-180 ~ 180) on the left and the right of the rectangular boundary.

Input rectangular bottom latitude: 31
Input rectangular top latitude: 46
Input rectangular left longitude: -128
Input rectangular right longitude: -114

When you finish input, QL will display you input parameters to confirm there is no type-in error:

The input region is:
searchshape: RECT
bot_lat: 31
top_lat: 46
left_lon: -128
right_lon: -114

Input parameters confirm?  [y/n]
y

Once you setup research region, you can set time range in the same way:

Please enter time range:

Input start year (1900-):
2010
Input start month(1-12):
1
Input start day (1-31):
7
Input start time(00:00:00-23:59:59):
01:00:00
Input end year (1900-):
2010
Input end month(1-12):
1
Input end day (1-31):
10
Input end time(00:00:00-23:59:59):
03:00:00
start_year: 2010
start_month: 1
start_day: 7
start_time: 01:00:00
end_year: 2010
end_month: 1
end_day: 10
end_time: 03:00:00

Input parameters confirm?  [y/n]
y

Apart from research region and time range, the following input are optional, e.g., you can select magnitude range or specific magnitude type which you interest in. You can skip these questions, QL will use default options:

Enter event magnitude limits (optional, enter blank space for default sets)
Input minimum magnitude (0.0-9.0 or blank space for skip this set):

Input maximum magnitude (0.0-9.0 or blank space for skip this set):

Enter specific magnitude types. Please note: the selected magnitude type will search for all possible magnitudes in that category:
                   E.g. MB will search for mb, mB, Mb, mb1mx, etc
                   Available input:
                   <Any>|<MB>|<MS>|<MW>|<ML>|<MD> or blank space for skip this set

After the above specific definitions, subsequent options are same as beginner mode. User will go through all questions to define their dataset

How many samples do you wish to create? [1- ] (input MAX for all available waveform):5000
Do you want fixed sample length? [y/n] (default: y):y
Enter sample length (how many sample points do you wish in a trace)?(default 5000): 5000
Select label type: [simple/specific]?
[simple]: P/S;
[specific]: P/Pn/Pb/S/Sn, etc.
specific
Enter a fixed sampling rate(i.e.: 100.0) or skip for keep original sampling rate:
Select filter function for preprocess? [0/1/2/3]:
[0]: Do not apply filter function;
[1]: Butterworth-Lowpass;
[2]: Butterworth-Highpass;
[3]: Butterworth-Bandpass. 0
Do you want to detrend the waveforms ? [y/n]n
Would you like random input? [y/n]y
Do you want to add random noise: [y/n] n
Select export file format: [SAC/MSEED/SEGY/NPZ/MAT]SAC
Save as single trace or multiple-component seismic data? [y/n]y
Do you want to separate save traces as input and output? [y/n]y
Do you want to separate save arrival information as a CSV file? [y/n]y
Please input a folder name for your dataset (optional): NewDataset
Do you want to generate statistical charts after creating the dataset? [y/n]y

Note

  • Time varies based on the dataset volume.

  • Only use pre-processing options if it’s necessary.