ECS Cluster for Staging

Cluster Tech Eggs by ChatGPT

Store Locator Plus® is being migrated to an Elastic Container Service (ECS) cluster that is expected to be active Q4 2024. This cluster is to be automatically updated via the myslp_aws_ecs_kit git repo which triggers a CodePipeline build that deploys updates to the cluster.

ECS Cluster

The ECS cluster that is accessed by the pipeline is myslp-ecs-cluster.
arn:aws:ecs:us-east-1:744590032041:cluster/myslp-staging-cluster

This cluster is designed to run EC2 instances that host the SLP SaaS containers.

Infrastructure

The instances are managed by the following Auto Scaling Group (ASG):

Infra-ECS-Cluster-myslp-staging-cluster-a97a9fa8-ECSAutoScalingGroup-zoFBNbZvjeFk

arn:aws:autoscaling:us-east-1:744590032041:autoScalingGroup:e0255cb5-e03b-4f35-adb4-398b947028b8:autoScalingGroupName/Infra-ECS-Cluster-myslp-staging-cluster-a97a9fa8-ECSAutoScalingGroup-zoFBNbZvjeFk

This provides the compute capacity (EC2 instances here) to run the container service that defined services will use to run tasks.

Auto Scaling Group Details

Should have a minimum capacity of 1.

The group uses the following launch template: lt-07e8f4ebedbe1c2ff

That launch template runs image ID: ami-05a490ca1a643e9ea

It runs on an “gravitron compute” instance which is ARM64 compatible. Currently it runs on a c6g.xlarge.

The system tags help associate any resources launched by this ASG with the ECS cluster. The special sauce is in the launch template inline scripts, however.

Launch Template Details

The following “advanced details” in the launch template seem to be what registers any EC2 instances that this ASG fires up with the ECS Cluster:

User data contains scripts or other things that run as soon as the container comes online.

The AMI likely has AWS libraries loaded, one of which is an ECS tool that works with the AWS fabric and reads the /etc/ecs/ecs.config file to figure out how to connect a resource to the cluster on boot or on a daemon service refresh.

Tasks

These are the ECS equivalent of Docker Composer files with added information about what type of container to create.

The task definition on AWS Console for the configuration below is named slp_saas_staging:3 (as of Oct 31 2024). In addition to the environment variables noted below, an addition environment variable is added when creating the task definitions via the console to set the WORDPRESS_DB_PASSWORD environment variable. This is set for the myslp_dashboard database (baked into the ECR image that is built with CodePipeline via the WORDPRESS_DB_NAME environment variable) with a user of myslp_genesis (also per the ECR image in the WORDPRESS_DB_USER environment variable).

From the myslp_aws_ecs_kit repo AWS/ECS/tasks/slp_saas_staging.json

{
"family": "slp_saas_staging",
"requiresCompatibilities": ["EC2"],
"runtimePlatform": {
"operatingSystemFamily": "LINUX",
"cpuArchitecture": "ARM64"
},
"networkMode": "awsvpc",
"cpu": "3 vCPU",
"memory": "6 GB",
"executionRoleArn": "arn:aws:iam::744590032041:role/ecsTaskExecutionRole",
"containerDefinitions": [
{
"name": "slp_saas",
"essential": true,
"image": "744590032041.dkr.ecr.us-east-1.amazonaws.com/myslp2024-aarch64:staging",
"portMappings": [
{
"containerPort": 80,
"hostPort": 80
}
],
"environment" : [
{
"name" : "WP_HOSTURL",
"value" : "staging.storelocatorplus.com"
},
{
"name" : "WP_HOME",
"value" : "https://staging.storelocatorplus.com/"
},
{
"name" : "WP_SITEURL",
"value" : "https://staging.storelocatorplus.com/"
},
{
"name" : "WORDPRESS_DB_HOST",
"value" : "slp-staging-2023-aug-cluster-cluster.cluster-c0glwpjjxt7q.us-east-1.rds.amazonaws.com"
},
{
"name" : "WORDPRESS_DEBUG",
"value" : "true"
},
{
"name" : "WORDPRESS_CONFIG_EXTRA",
"value": "define( 'WP_DEBUG_LOG', '/var/www/html/debug.log');define( 'WP_DEBUG_DISPLAY', true);define( 'WP_DEBUG_SCRIPT', true);@ini_set('display_errors',1);define('SUNRISE', true);defined('DOMAIN_CURRENT_SITE') || define('DOMAIN_CURRENT_SITE', getenv_docker('WP_HOSTURL', 'staging.storelocatorplus.com') );define('WP_ALLOW_MULTISITE', true );define('MULTISITE', true);define('SUBDOMAIN_INSTALL', false);define('PATH_CURRENT_SITE', '/');define('SITE_ID_CURRENT_SITE', 1);define('BLOG_ID_CURRENT_SITE', 1);if ( ! defined( 'WPMU_PLUGIN_DIR' ) ){define('WPMU_PLUGIN_DIR', dirname( __FILE__ ) . '/wp-content/mu-plugins' );}"
}
]
}
]
}

Services

Services run various parts of the application. For SLP in the initial Q4 2024 state there is only one service – the SLP SaaS web service.

The staging service that runs the SaaS staging task is at:
arn:aws:ecs:us-east-1:744590032041:service/myslp-staging-cluster/myslp-staging-service

The service is set to run the slp_saas_staging task in daemon mode. That means it will run one task per container.

The service definition sets up the containers.
Container Image (on ECR): 744590032041.dkr.ecr.us-east-1.amazonaws.com/myslp2024-aarch64:staging

It also sets up the environment variables passed into the container.

0 comments on “Analyzing AWS Scaling Group Traffic”

Analyzing AWS Scaling Group Traffic

With the re-listing of the Store Locator Plus® WordPress plugin in the WordPress plugin directory, there has been a notable increase in outward scaling of the Store Locator Plus® application cluster. Key Store Locator Plus® websites and services run on horizontally scalable cluster built on AWS Scaling Groups, AWS Load Balancer, and EC2 instances. Every night starting around midnight EST the scaling group adds one node per hour until 3AM EST at which point they start scaling back.

The AWS cluster is handling the load well, but we want to investigate in case there is something else going on. Scaling can be caused by a number of issues including network attacks, application misconfiguration, coding errors, routine bot traffic, or routine customer interaction patterns. It is best to get insight into the issue and know for certain the root cause.

This article walks through a real-time analysis of the events and traffic patterns that are triggering the scaling.

Background

The AWS Scaling Group that manages the Store Locator Plus® cluster is configured to monitor average CPU usage over time and add nodes to the cluster when the servers start to climb above 80% utilization. This is often an early indicator of impending server overload and a good baseline metric on which to base scaling events.

These events are triggered almost nightly at the same time. This typically indicates a routine scheduled process such as a site crawler via a bot (aka spider) or a scheduled routine running on a customer’s website.

The latest scaling event started at 11:55PM EST last night, so we’ll start there.

First Stop : AWS Dashboard

We want to verify our timestamps with more specificity as the email notifications are not necessarily precise.

AWS Scaling Groups

We’ll look at the AWS Scaling Groups first. We have monitoring enabled and can get a quick overview of the group activity.

We can see the instance count jump from our baseline of 2 nodes to 3 nodes almost exactly at 03:56 UTC. Remember AWS mostly notes times in UTC, which puts us at 11:55PM EST. The traffic drops back to baseline at 13:34 which is around 9:34AM EST.

We can also look at our aggregate EC2 instance metrics for all instances that are part of this scaling group. We can see the bulk of CPU usage starts to fire up around 03:48 UTC but really kicks in an hour later around 4:48 UTC before regularly grinding away from 5:48 UTC through 7:48 UTC. The pattern looks a lot like an external process, possibly a bot.

Inbound network requests on the EC2 instances reflects the same with a single 502Mbps spike starting at 05:03 UTC.

Inspecting EC2 Logs

While our load balancer is logging access to an S3 bucket, it is often difficult to locate and parse the logs with the volume of requests being pushed to the bucket every day. While there are log parsing and reporting services out there, there is a faster way to get insight into requests — looking at the local disk logs on a running EC2 instance in the cluster.

If you are lucky one of the current nodes will be running as part of the cluster that was online for the entire event. Given the span of time and amount of traffic the single node will provide a reasonable cross-section of requests, starting with logging 50% of the requests as part of the 2-node cluster to start with. We can assume it was logging at least 33% of the requests during the initial spikes as the cluster expands to 3 nodes.

Since our nodes are all running some form of web application, we want to check our web server (nginx) log files in /var/log/nginx. Keep in mind the EC2 servers are configured to be in the data center’s time zone, so the log files for our US-East-1 zone servers will be in EST. We want to look between 11:55PM EST and 9:34AM EST with a focus on the 1:48 – 6:48AM EST entries.

Bad Actor In Log Files

The nginx access log have the fingerprints of a brute force attack against the server around 01:28 EST (05:30 UTC). Many of the URLs here are known weak points in apps that may be running on a server (not ours though). These can be blocked by a Web Application Firewall update on AWS, a service that provides edge-of-cloud protection and can keep the network requests from reaching our cluster in the first place.

And a similar attack on the demo site