Cross Container (ECS) WordPress Session Management

Since containers are ephemeral and each instance handles requests independently, sharing session data requires using a centralized session storage such as AWS ElastiCache.

ElastiCache can be configured for Valkey (open source Reddis) or Memcache. Valkey is lower cost.

Set up ElastiCache

  • Create a Valkey server.
  • Configure a publicly accessible or VPC-limited endpoint, depending on your ECS networking setup.
  • Choose 3 subnets on same network as ECS containers
  • Choose the ECS security group

Configure The Docker Image

Add the Redis extension to PHP and enable it in the php.ini configuration. This configuration uses environment variables so the Redis server can be configured with environment variables for each container instance.

Create The Host Image Builder PHP Ini File

Review the PHP Runtime Configuration page on session settings.

Create ./Docker/Images/Files/php/docker-php-ext-redis.ini

extension=redis.so
session.save_handler = ${PHP_SESSION_SAVE_HANDLER}
session.save_path = ${PHP_SESSION_SAVE_PATH}

Update The Host Dockerfile

Update the host Dockerfile to install Redis and the libs needed to support it. Copy the php ini file into conf.d so it is loaded when PHP starts. This example is from a WordPress 6 image running PHP 8 on Apache.

Create ./Docker/Images/Dockerfile

# -- base image

FROM public.ecr.aws/docker/library/wordpress:6.4.2-php8.3-apache
LABEL authors="lancecleveland" \
      image="WordPress Multisite on Apache"

# -- ports

EXPOSE 443

# -- os utilities

RUN set -eux; \
	apt-get update; \
	apt-get install -y --no-install-recommends \
		dnsutils \
        inetutils-traceroute \
        iputils-ping \
        libz-dev \
        libssl-dev \
        libmagickwand-dev \
	; \
	rm -rf \
        /var/lib/apt/lists/* \
	    /usr/src/wordpress/wp-content/themes/* \
	    /usr/src/wordpress/wp-content/plugins/* \
	    /usr/src/wordpress/wp-config-example.php \
    ;

# -- install Redis PHP extension
RUN pecl channel-update pecl.php.net \
    && pecl install redis \
    && docker-php-ext-enable redis

# -- PHP redis
COPY ./Files/php/docker-php-ext-redis.ini /usr/local/etc/php/conf.d/docker-php-ext-redis.ini

# -- apache rewrite

RUN a2enmod ssl && a2enmod rewrite; \
    mkdir -p /etc/apache2/ssl

# -- apache SSL

COPY ./Files/ssl/*.pem /etc/apache2/ssl/
COPY ./Files/apache/sites-available/*.conf /etc/apache2/sites-available/

# -- WordPress , gets copies to apache root /var/www/html
COPY ./Files/wordpress/ /usr/src/wordpress/

# -- php xdebug

RUN pecl channel-update pecl.php.net
RUN pecl install xdebug \
    && docker-php-ext-enable xdebug

# -- Standard WordPress Env Vars

ENV WORDPRESS_DB_USER="blah_blah_user"
ENV WORDPRESS_DB_NAME="blah_blah_database"
ENV WORDPRESS_TABLE_PREFIX="wp_"
ENV WORDPRESS_DB_CHARSET="utf8"
ENV WORDPRESS_DB_COLLATE=""

Configure The Docker Container

Update Docker Composer and ECS Task Definitions

Docker Composer is for local development container setup. ECS Task definitions are for AWS Cloud Elastic Container Services.

For our local Docker Composer configuration we use a docker-compose secrets file that is not committed to our repository for setting sensitive environment variables.

In this example the PHP_SESSION_* environment variables are read by the PHP startup and substituted in the session.* variables.

./Docker/Composers/Secrets/docker-compose-secrets.yml

This configuration uses local file based session storage. This is what you’d use on a typical single-server development file.

services:
wp:
environment:
PHP_SESSION_SAVE_HANDLER: 'files'
PHP_SESSION_SAVE_PATH: ''

For a PHP connection to a cluster, like we have on our AWS fault-tolerant container clusters you and fault-tolerant ElastiCache clusters you need to set something similar in the Task Definition environment variables using the same names as above.

      PHP_SESSION_SAVE_HANDLER: 'redis'
PHP_SESSION_SAVE_PATH: 'tcp://blah-saas-staging.blah.blah.blah.amazonaws.com:6379?persistent=1&failover=1&timeout=2&read_timeout=2&serialize=php&cluster=redis'

Load Balancer Sticky Sessions Option

Configure your Application Load Balancer (or Elastic Load Balancer) to enable sticky sessions to reduce the need to share session data across containers. Sticky sessions ensure that a user is always directed to the same container instance during their session.

– Application Load Balancer: Enable Session Stickiness.
– Set a **duration-based stickiness** cookie to control how long the user remains connected to the same task/container.

**Note**: Sticky sessions are not ideal for auto-scaling environments or when maintaining container independence is critical, so this should complement, not replace, shared session storage.

Additional Considerations

1. **Security**:
– Encrypt session data in transit using TLS (especially when connecting to Redis or RDS).
– Ensure that only trusted ECS tasks and resources can access session storage by restricting permissions through IAM roles and security groups.

2. **Performance Tuning**:
– Cache session data effectively using low TTLs for Redis or Memcached.
– Monitor ElastiCache or RDS instance performance to prevent bottlenecks caused by session sharing.

3. **Scaling and Resilience**:
– Use multi-AZ configurations for Redis or RDS.
– Consider Redis Cluster for read/write scaling and high availability.

By offloading session management to centralized storage and using ECS best practices, your WordPress instances can efficiently share session information while scaling seamlessly.

Tweaking The Configuration

The cluster is not working exactly as expected.

One container will connect and appears to work properly, but the user experience will swap form a logged in page to a not logged in page mid-session. The assumption is that this is due to the user connection jumping to a different server in the container cluster.

Attempted Resolution: Set PHP session.save_handler to rediscluster

On the staging server the initial php session_save handler (set via environment variable) was set to redis.

Changing this to rediscluster did not change the session switching behavior.

Attempted Resolution: Revise the PHP session_start() call

In WordPress the session_start() was moved from the prior invocation in the WordPress init() hook to the muplugins_loaded hook which loads earlier in the process. This did not seem to have an impact on the issue. Some minor updates to deal with configurations using a Redis Cluster and not were made as well as ensuring we check if a session was already started.

Our Redis Cluster code, invoked during muplugins_loaded with a MySLP_RedisCluster::get_instance() call.

<?php
defined( 'MYSLP_VERSION' ) || exit;


/**
 *
 */
class RedisClusterSessionHandler implements SessionHandlerInterface {
	private $redis;

	public function __construct() {
		$redisClusterEndpoint = get_cfg_var( 'session.save_path' );
		if ( empty( $redisClusterEndpoint ) ) {
			throw new RuntimeException( 'No Redis Cluster endpoint configured' );
		}


		// Parse and extract host/port (handle both single node and cluster)
		$parsedUrl = parse_url( $redisClusterEndpoint );
		$redisHost = $parsedUrl['host'] ?? 'localhost';
		$redisPort = $parsedUrl['port'] ?? 6379;

		// Use an array format required by RedisCluster
		$redisClusterNodes = [ "$redisHost:$redisPort" ];

		try {
			// Initialize RedisCluster
			$this->redis = new RedisCluster( null, $redisClusterNodes, 5, 5, true );
		} catch ( RedisClusterException $e ) {
			throw new RuntimeException( 'Failed to connect to Redis Cluster: ' . $e->getMessage() );
		}

	}

	/**
	 * Initialize session
	 * @link https://php.net/manual/en/sessionhandlerinterface.open.php
	 *
	 * @param $savePath
	 * @param $sessionName
	 *
	 * @return bool <p>
	 * The return value (usually TRUE on success, FALSE on failure).
	 * Note this value is returned internally to PHP for processing.
	 * </p>
	 * @since 5.4
	 */
	public function open( $savePath, $sessionName ): bool {
		return true; // No need to do anything here
	}

	/**
	 * Close the session
	 * @link https://php.net/manual/en/sessionhandlerinterface.close.php
	 * @return bool <p>
	 * The return value (usually TRUE on success, FALSE on failure).
	 * Note this value is returned internally to PHP for processing.
	 * </p>
	 * @since 5.4
	 */
	public function close(): bool {
		return true; // No need to close anything explicitly
	}

	/**
	 * Read session data
	 * @link https://php.net/manual/en/sessionhandlerinterface.read.php
	 *
	 * @param $sessionId
	 *
	 * @return string <p>
	 * Returns an encoded string of the read data.
	 * If nothing was read, it must return false.
	 * Note this value is returned internally to PHP for processing.
	 * </p>
	 * @since 5.4
	 */
	public function read( $sessionId ): string {
		$sessionData = $this->redis->get( "PHPREDIS_SESSION:$sessionId" );

		return $sessionData ?: ''; // Return session data or empty string if not found
	}

	/**
	 * Write session data
	 * @link https://php.net/manual/en/sessionhandlerinterface.write.php
	 *
	 * @param $sessionId
	 * @param string $data <p>
	 * The encoded session data. This data is the
	 * result of the PHP internally encoding
	 * the $_SESSION superglobal to a serialized
	 * string and passing it as this parameter.
	 * Please note sessions use an alternative serialization method.
	 * </p>
	 *
	 * @return bool <p>
	 * The return value (usually TRUE on success, FALSE on failure).
	 * Note this value is returned internally to PHP for processing.
	 * </p>
	 * @since 5.4
	 */
	public function write( $sessionId, $data ): bool {
		return $this->redis->setex( "PHPREDIS_SESSION:$sessionId", 3600, $data ); // 1-hour TTL
	}

	/**
	 * Destroy a session
	 * @link https://php.net/manual/en/sessionhandlerinterface.destroy.php
	 *
	 * @param $sessionId
	 *
	 * @return bool <p>
	 * The return value (usually TRUE on success, FALSE on failure).
	 * Note this value is returned internally to PHP for processing.
	 * </p>
	 * @since 5.4
	 */
	public function destroy( $sessionId ): bool {
		return $this->redis->del( [ "PHPREDIS_SESSION:$sessionId" ] ) > 0;
	}

	/**
	 * Cleanup old sessions
	 * @link https://php.net/manual/en/sessionhandlerinterface.gc.php
	 *
	 * @param $maxLifetime
	 *
	 * @return int|false <p>
	 * Returns the number of deleted sessions on success, or false on failure. Prior to PHP version 7.1, the function returned true on success.
	 * Note this value is returned internally to PHP for processing.
	 * </p>
	 * @since 5.4
	 */
	public function gc( $maxLifetime ): int|false {
		return true; // Redis handles expiration via TTL, so no need to do anything
	}
}

/**
 *
 */
class MySLP_RedisCluster extends MySLP_Base {
	private $redis;

	/**
	 * Catch cluster redirects (MOVED) using the built-in PHP RedisCluster lib
	 * @return void
	 * @throws RedisClusterException
	 */
	final function initialize() {
		$redisClusterEndpoint = get_cfg_var( 'session.save_path' );
		if ( class_exists( 'RedisCluster' ) && ! empty( $redisClusterEndpoint ) ) {
			try {
				$handler = new RedisClusterSessionHandler();
				session_set_save_handler( $handler, true );

			} catch ( RuntimeException $e ) {
				error_log( 'Error initializing RedisClusterSessionHandler: ' . $e->getMessage() );
			}
		}
		if ( ! session_id() && ! headers_sent() ) {
			session_start();
		}
	}
}

SaaS WP Login Processing

  • wp-login.php
    • $reauth = empty( $_REQUEST[‘reauth’] ) ? false : true; is set to false.
    • $user = wp_signon( array() , $secure_cookie “” )
      • do_action( ‘wp_login’ , $user->user_login “lcleveland” , $user “WP_User” is set)
    • if ( empty( $_COOKIE[ LOGGED_IN_COOKIE ] )) is NOT empty
      • LOGGED_IN_COOKIE is something like “wordpress_logged_in_e2ec4afff4940eebb6cd200cc8206825”
        which IS set on this session
    • $requested_redirect_to ==> ‘https://staging.storelocatorplus.com/wp-admin/”
      as set in $_REQUEST[‘redirect_to’]
    • if ( ! is_wp_error( $user ) && ! $reauth ) { // This is executing because user is set and reauth is not set.

Need to set the WP Secrets the same (keys and salts) on ALL nodes in the cluster that share login. The auth (login) cookies have salt and keys in them and with each server generating their own they will not be validated.

Docker has a method to pass these in via an ENV setting.

Image by Robert from Pixabay

ECS Cluster for Staging

Cluster Tech Eggs by ChatGPT

Store Locator Plus® is being migrated to an Elastic Container Service (ECS) cluster that is expected to be active Q4 2024. This cluster is to be automatically updated via the myslp_aws_ecs_kit git repo which triggers a CodePipeline build that deploys updates to the cluster.

ECS Cluster

The ECS cluster that is accessed by the pipeline is myslp-ecs-cluster.
arn:aws:ecs:us-east-1:744590032041:cluster/myslp-staging-cluster

This cluster is designed to run EC2 instances that host the SLP SaaS containers.

Infrastructure

The instances are managed by the following Auto Scaling Group (ASG):

Infra-ECS-Cluster-myslp-staging-cluster-a97a9fa8-ECSAutoScalingGroup-zoFBNbZvjeFk

arn:aws:autoscaling:us-east-1:744590032041:autoScalingGroup:e0255cb5-e03b-4f35-adb4-398b947028b8:autoScalingGroupName/Infra-ECS-Cluster-myslp-staging-cluster-a97a9fa8-ECSAutoScalingGroup-zoFBNbZvjeFk

This provides the compute capacity (EC2 instances here) to run the container service that defined services will use to run tasks.

Auto Scaling Group Details

Should have a minimum capacity of 1.

The group uses the following launch template: lt-07e8f4ebedbe1c2ff

That launch template runs image ID: ami-05a490ca1a643e9ea

It runs on an “gravitron compute” instance which is ARM64 compatible. Currently it runs on a c6g.xlarge.

The system tags help associate any resources launched by this ASG with the ECS cluster. The special sauce is in the launch template inline scripts, however.

Launch Template Details

The following “advanced details” in the launch template seem to be what registers any EC2 instances that this ASG fires up with the ECS Cluster:

User data contains scripts or other things that run as soon as the container comes online.

The AMI likely has AWS libraries loaded, one of which is an ECS tool that works with the AWS fabric and reads the /etc/ecs/ecs.config file to figure out how to connect a resource to the cluster on boot or on a daemon service refresh.

Tasks

These are the ECS equivalent of Docker Composer files with added information about what type of container to create.

The task definition on AWS Console for the configuration below is named slp_saas_staging:3 (as of Oct 31 2024). In addition to the environment variables noted below, an addition environment variable is added when creating the task definitions via the console to set the WORDPRESS_DB_PASSWORD environment variable. This is set for the myslp_dashboard database (baked into the ECR image that is built with CodePipeline via the WORDPRESS_DB_NAME environment variable) with a user of myslp_genesis (also per the ECR image in the WORDPRESS_DB_USER environment variable).

From the myslp_aws_ecs_kit repo AWS/ECS/tasks/slp_saas_staging.json

{
"family": "slp_saas_staging",
"requiresCompatibilities": ["EC2"],
"runtimePlatform": {
"operatingSystemFamily": "LINUX",
"cpuArchitecture": "ARM64"
},
"networkMode": "awsvpc",
"cpu": "3 vCPU",
"memory": "6 GB",
"executionRoleArn": "arn:aws:iam::744590032041:role/ecsTaskExecutionRole",
"containerDefinitions": [
{
"name": "slp_saas",
"essential": true,
"image": "744590032041.dkr.ecr.us-east-1.amazonaws.com/myslp2024-aarch64:staging",
"portMappings": [
{
"containerPort": 80,
"hostPort": 80
}
],
"environment" : [
{
"name" : "WP_HOSTURL",
"value" : "staging.storelocatorplus.com"
},
{
"name" : "WP_HOME",
"value" : "https://staging.storelocatorplus.com/"
},
{
"name" : "WP_SITEURL",
"value" : "https://staging.storelocatorplus.com/"
},
{
"name" : "WORDPRESS_DB_HOST",
"value" : "slp-staging-2023-aug-cluster-cluster.cluster-c0glwpjjxt7q.us-east-1.rds.amazonaws.com"
},
{
"name" : "WORDPRESS_DEBUG",
"value" : "true"
},
{
"name" : "WORDPRESS_CONFIG_EXTRA",
"value": "define( 'WP_DEBUG_LOG', '/var/www/html/debug.log');define( 'WP_DEBUG_DISPLAY', true);define( 'WP_DEBUG_SCRIPT', true);@ini_set('display_errors',1);define('SUNRISE', true);defined('DOMAIN_CURRENT_SITE') || define('DOMAIN_CURRENT_SITE', getenv_docker('WP_HOSTURL', 'staging.storelocatorplus.com') );define('WP_ALLOW_MULTISITE', true );define('MULTISITE', true);define('SUBDOMAIN_INSTALL', false);define('PATH_CURRENT_SITE', '/');define('SITE_ID_CURRENT_SITE', 1);define('BLOG_ID_CURRENT_SITE', 1);if ( ! defined( 'WPMU_PLUGIN_DIR' ) ){define('WPMU_PLUGIN_DIR', dirname( __FILE__ ) . '/wp-content/mu-plugins' );}"
}
]
}
]
}

Services

Services run various parts of the application. For SLP in the initial Q4 2024 state there is only one service – the SLP SaaS web service.

The staging service that runs the SaaS staging task is at:
arn:aws:ecs:us-east-1:744590032041:service/myslp-staging-cluster/myslp-staging-service

The service is set to run the slp_saas_staging task in daemon mode. That means it will run one task per container.

The service definition sets up the containers.
Container Image (on ECR): 744590032041.dkr.ecr.us-east-1.amazonaws.com/myslp2024-aarch64:staging

It also sets up the environment variables passed into the container.

0 comments on “Analyzing AWS Scaling Group Traffic”

Analyzing AWS Scaling Group Traffic

With the re-listing of the Store Locator Plus® WordPress plugin in the WordPress plugin directory, there has been a notable increase in outward scaling of the Store Locator Plus® application cluster. Key Store Locator Plus® websites and services run on horizontally scalable cluster built on AWS Scaling Groups, AWS Load Balancer, and EC2 instances. Every night starting around midnight EST the scaling group adds one node per hour until 3AM EST at which point they start scaling back.

The AWS cluster is handling the load well, but we want to investigate in case there is something else going on. Scaling can be caused by a number of issues including network attacks, application misconfiguration, coding errors, routine bot traffic, or routine customer interaction patterns. It is best to get insight into the issue and know for certain the root cause.

This article walks through a real-time analysis of the events and traffic patterns that are triggering the scaling.

Background

The AWS Scaling Group that manages the Store Locator Plus® cluster is configured to monitor average CPU usage over time and add nodes to the cluster when the servers start to climb above 80% utilization. This is often an early indicator of impending server overload and a good baseline metric on which to base scaling events.

These events are triggered almost nightly at the same time. This typically indicates a routine scheduled process such as a site crawler via a bot (aka spider) or a scheduled routine running on a customer’s website.

The latest scaling event started at 11:55PM EST last night, so we’ll start there.

First Stop : AWS Dashboard

We want to verify our timestamps with more specificity as the email notifications are not necessarily precise.

AWS Scaling Groups

We’ll look at the AWS Scaling Groups first. We have monitoring enabled and can get a quick overview of the group activity.

We can see the instance count jump from our baseline of 2 nodes to 3 nodes almost exactly at 03:56 UTC. Remember AWS mostly notes times in UTC, which puts us at 11:55PM EST. The traffic drops back to baseline at 13:34 which is around 9:34AM EST.

We can also look at our aggregate EC2 instance metrics for all instances that are part of this scaling group. We can see the bulk of CPU usage starts to fire up around 03:48 UTC but really kicks in an hour later around 4:48 UTC before regularly grinding away from 5:48 UTC through 7:48 UTC. The pattern looks a lot like an external process, possibly a bot.

Inbound network requests on the EC2 instances reflects the same with a single 502Mbps spike starting at 05:03 UTC.

Inspecting EC2 Logs

While our load balancer is logging access to an S3 bucket, it is often difficult to locate and parse the logs with the volume of requests being pushed to the bucket every day. While there are log parsing and reporting services out there, there is a faster way to get insight into requests — looking at the local disk logs on a running EC2 instance in the cluster.

If you are lucky one of the current nodes will be running as part of the cluster that was online for the entire event. Given the span of time and amount of traffic the single node will provide a reasonable cross-section of requests, starting with logging 50% of the requests as part of the 2-node cluster to start with. We can assume it was logging at least 33% of the requests during the initial spikes as the cluster expands to 3 nodes.

Since our nodes are all running some form of web application, we want to check our web server (nginx) log files in /var/log/nginx. Keep in mind the EC2 servers are configured to be in the data center’s time zone, so the log files for our US-East-1 zone servers will be in EST. We want to look between 11:55PM EST and 9:34AM EST with a focus on the 1:48 – 6:48AM EST entries.

Bad Actor In Log Files

The nginx access log have the fingerprints of a brute force attack against the server around 01:28 EST (05:30 UTC). Many of the URLs here are known weak points in apps that may be running on a server (not ours though). These can be blocked by a Web Application Firewall update on AWS, a service that provides edge-of-cloud protection and can keep the network requests from reaching our cluster in the first place.

And a similar attack on the demo site