load balancer – Store Locator Plus® Internal Docs

With the re-listing of the Store Locator Plus® WordPress plugin in the WordPress plugin directory, there has been a notable increase in outward scaling of the Store Locator Plus® application cluster. Key Store Locator Plus® websites and services run on horizontally scalable cluster built on AWS Scaling Groups, AWS Load Balancer, and EC2 instances. Every night starting around midnight EST the scaling group adds one node per hour until 3AM EST at which point they start scaling back.

The AWS cluster is handling the load well, but we want to investigate in case there is something else going on. Scaling can be caused by a number of issues including network attacks, application misconfiguration, coding errors, routine bot traffic, or routine customer interaction patterns. It is best to get insight into the issue and know for certain the root cause.

This article walks through a real-time analysis of the events and traffic patterns that are triggering the scaling.

Background

The AWS Scaling Group that manages the Store Locator Plus® cluster is configured to monitor average CPU usage over time and add nodes to the cluster when the servers start to climb above 80% utilization. This is often an early indicator of impending server overload and a good baseline metric on which to base scaling events.

These events are triggered almost nightly at the same time. This typically indicates a routine scheduled process such as a site crawler via a bot (aka spider) or a scheduled routine running on a customer’s website.

The latest scaling event started at 11:55PM EST last night, so we’ll start there.

First Stop : AWS Dashboard

We want to verify our timestamps with more specificity as the email notifications are not necessarily precise.

AWS Scaling Groups

We’ll look at the AWS Scaling Groups first. We have monitoring enabled and can get a quick overview of the group activity.

We can see the instance count jump from our baseline of 2 nodes to 3 nodes almost exactly at 03:56 UTC. Remember AWS mostly notes times in UTC, which puts us at 11:55PM EST. The traffic drops back to baseline at 13:34 which is around 9:34AM EST.

We can also look at our aggregate EC2 instance metrics for all instances that are part of this scaling group. We can see the bulk of CPU usage starts to fire up around 03:48 UTC but really kicks in an hour later around 4:48 UTC before regularly grinding away from 5:48 UTC through 7:48 UTC. The pattern looks a lot like an external process, possibly a bot.

Inbound network requests on the EC2 instances reflects the same with a single 502Mbps spike starting at 05:03 UTC.

Inspecting EC2 Logs

While our load balancer is logging access to an S3 bucket, it is often difficult to locate and parse the logs with the volume of requests being pushed to the bucket every day. While there are log parsing and reporting services out there, there is a faster way to get insight into requests — looking at the local disk logs on a running EC2 instance in the cluster.

If you are lucky one of the current nodes will be running as part of the cluster that was online for the entire event. Given the span of time and amount of traffic the single node will provide a reasonable cross-section of requests, starting with logging 50% of the requests as part of the 2-node cluster to start with. We can assume it was logging at least 33% of the requests during the initial spikes as the cluster expands to 3 nodes.

Since our nodes are all running some form of web application, we want to check our web server (nginx) log files in /var/log/nginx. Keep in mind the EC2 servers are configured to be in the data center’s time zone, so the log files for our US-East-1 zone servers will be in EST. We want to look between 11:55PM EST and 9:34AM EST with a focus on the 1:48 – 6:48AM EST entries.

Bad Actor In Log Files

The nginx access log have the fingerprints of a brute force attack against the server around 01:28 EST (05:30 UTC). Many of the URLs here are known weak points in apps that may be running on a server (not ours though). These can be blocked by a Web Application Firewall update on AWS, a service that provides edge-of-cloud protection and can keep the network requests from reaching our cluster in the first place.

And a similar attack on the demo site