NoisePy AWS Batch Tutorial#
Here’s a tutorial on using Amazon EC2 Batch with Fargate Spot and containers to perform a job that involves writing to and reading from AWS S3.
1. Checklist and prerequisites#
1.1 Tools#
You are not required to run this on a AWS EC2 instance, but two tools are required for this tutorail: AWS Command Line Tool (CLI) and JQ. Note that the code cell below only works for x86_64 CentOS where you have sudo permission. You can find installation instructions for other OS below.
# Install AWS CLI (Command line interface)
# This tool may already be installed if you are on a EC2 instance running Amazon Linux
! curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
! unzip awscliv2.zip
! sudo ./aws/install
# You may check the correct installation of CLI with the following command,
# which lists the files in SCEDC public bucket.
! aws s3 ls s3://scedc-pds
# Install jq
! sudo yum install -y jq
1.2 AWS Account#
The account ID is a 12-digit number uniquely identify your account. You can find it on your AWS web console.
⚠️ Save the workshop <ACCOUNT_ID>
here: REPLACE_ME
1.3 Role#
AWS role is a virtual identity that has specific permissions where its ID (called ARN
) is in the format of arn:aws:iam::<ACCOUNT_ID>:role/<ROLE>
. AWS batch requires a role to be created for running the jobs. This can be done from the IAM panel on the AWS web console. Depending on the type of service to use, separate roles may be created. A specific role is required for AWS Batch Service.
Trusted Entity Type: AWS Service
Use Case: Elastic Container Service
Elastic Container Service Task
Permission Policies, search and add:
AmazonECSTaskExecutionRolePolicy
AmazonS3FullAccess
Once the role is created, one more permission is needed:
Go to: Permissions tab –> Add Permissions –> Create inline policy
Search for “batch”
Click on Batch
Select Read / Describe Jobs
Click Next
Add a policy name, e.g. “Describe_Batch_Jobs”
Click Create Policy
⚠️ Workshop participants please use arn:aws:iam::<ACCOUNT_ID>:role/NoisePyBatchRole
1.4 S3 Storage#
NoisePy uses S3 cloud store to store the cross correlations and stacked data. For this step, it is important that your role and the bucket have the appropriate permissions for users to read/write into the bucket.
The following statement in the JSON format is called a policy. It explicitly defined which operation is allowed/denied by which user/role. The following bucket policy defines that
all operations (
"s3:*"
) are allowed by your account with attached role ("arn:aws:iam::<ACCOUNT_ID>:role/<ROLE>"
) on any file in the bucket ("arn:aws:s3:::<S3_BUCKET>/*"
).anyone is allowed to read the data within the bucket (
"s3:GetObject"
,"s3:GetObjectVersion"
)anyone is allowed to list the file within the bucket (
"s3:ListBucket"
)
{
"Version": "2012-10-17",
"Id": "Policy1674832359797",
"Statement": [
{
"Sid": "Stmt1674832357905",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<ACCOUNT_ID>:role/<ROLE>"
},
"Action": "s3:*",
"Resource": "arn:aws:s3:::<S3_BUCKET>/*"
},
{
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": "arn:aws:s3:::<S3_BUCKET>/*"
},
{
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::<S3_BUCKET>"
}
]
}
⚠️ Save your <S3_BUCKET>
name here: REPLACE_ME
2. Setup Batch Jobs#
2.1 Compute Environment#
You’ll need two pieces of information to create the compute environment. The list of subnets in your VPC and the default security group ID. You can use the following commands to retrieve them.
! aws ec2 describe-subnets | jq ".Subnets[] | .SubnetId"
! aws ec2 describe-security-groups --filters "Name=group-name,Values=default" | jq ".SecurityGroups[0].GroupId"
Use these values to update the missing fields subnets
and securityGroupIds
in compute_environment.yaml and run the code afterwards. If you have multiple subnets, choose one of them.
For HPS-book reader, the file is also available here on GitHub.
! aws batch create-compute-environment --no-cli-pager --cli-input-yaml file://compute_environment.yaml
2.2 Create a Job Queue#
Add the computeEnvironment
and the jobQueueName
in job_queue.yaml and then run the following command.
For HPS-book reader, the file is also available here on GitHub.
! aws batch create-job-queue --no-cli-pager --cli-input-yaml file://job_queue.yaml
2.3 Create a Job Definition#
Update the jobRoleArn
and executionRoleArn
fields in the job_definition.yaml file with the ARN of the role created in the first step (they should be the same in this case). Add a name for the jobDefinition
and run the code below.
For HPS-book reader, the file is also available here.
! aws batch register-job-definition --no-cli-pager --cli-input-yaml file://job_definition.yaml
3. Submit the Job#
3.1 Cross-correlation Configuration#
Update config.yaml for NoisePy configuration. Then copy the file to S3 so that the batch job can access it after launching. Replace the <S3_BUCKET>
with the bucket we just used, as well as an intermediate <PATH>
to separate your runs from others.
! aws s3 cp ./config.yaml s3://<S3_BUCKET>/<PATH>/config.yaml
3.2 Run Cross-correlation#
Update job_cc.yaml with the names of your jobQueue
and jobDefinition
created in the last steps. Also give your job a name in jobName
. Then update the S3 bucket paths to the locations you want to use for the output and your config.yaml
file.
For HPS-book reader, the file is also available here.
! aws batch submit-job --no-cli-pager --cli-input-yaml file://job_cc.yaml
3.3 Run Stacking#
Update job_stack.yaml with the names of your jobQueue
and jobDefinition
created in the last steps. Also give your job a name in jobName
. Then update the S3 bucket paths to the locations you want to use for your input CCFs (e.g. the output of the previous CC run), and the stack output. By default, NoisePy will look for a config file in the --ccf_path
location to use the same configuration for stacking that was used for cross-correlation.
For HPS-book reader, the file is also available here.
! aws batch submit-job --no-cli-pager --cli-input-yaml file://job_stack.yaml
4. Visualization#
You can use plot_stacks tutorials for cross-correlation visualization after all jobs return.