Initial project status

The MyLO system is a drupal 7 / PHP 5.6 web service with Apache 2.4 web proxy, which is running on Amazon (AWS) Cloud infrastructure, on an AWS Linux EC2 instance.
There were continuous performance issues, and whole site downtimes (especially under higher user traffic)

Investigation

We reviewed the whole AWS infrastructure, and the Linux instance itself, which contains the website. On the Infrastructure side there weren’t any bottleneck (It had some oversized resources, which are not cost-efficient, but do their job well, and we found some security deficiency but they do not cause any performance issue), so we checked the EC2 instance. The server’s CPU usage were higher than we expected, and the half of the RAM was in free state at this point.

After some system log reading we had the main cause of the last downtime. The instance ran out of memory, and the Linux’s ‘oom_kill_process’ terminated the httpd process, so the Apache proxy - which serve the website - has stopped.
We continued the investigation, and checked the php, and Apache configurations, and reviewed the running processes. The instance had an old/not used mysql, and an Apache solr server. All of these services started at boot time, and used lots of RAM. (One of the ‘out of memory’ issues - as we founded it in the kernel logs - caused by the mysqld process, after a server reboot). 

On the next step we checked the php’s configuration. It didn’t contain any ‘opcache’ configuration (which keeps some compiled PHP code in the memory, so it can reduce the CPU usage). And of course we checked the Apache’s configs too. As we saw, it used mpm_prefork_module. This kind of multiprocessing (mpm) is using huge amounts of RAM, because every connection has its own process. It works well with few users - and lots of RAM - only. We checked the average RAM usage by the Apache processes, and they were around 100MB each. On the config we found 'MaxRequestWorkers 240’ which means it will handle 240 connections at the same time. The 240*100MB is around 24GB, and we have 16GB RAM only, but this server in this mode can handle only 120-140 connections.
We founded the possible causes both of the performance, and the downtime issues.

Key challenges

  • Creating a test environment to find the best optimisation
  • Running heavy load tests
  • Using configuration management tool to reach “infrastructure as code” point of view
  • Researching and configuring the PHP/Apache configurations to reach better performance
  • Find and disable unnecessary resource user processes

Implementation highlights

  • Created a new test environment with the same type of resources and configurations as the production server
  • Load tests made by Vegeta load testing tool before, and after the changes
  • The Apache multiprocess module changed from worker to event mode with server specific configuration
  • The Apache built in PHP modul changed to php-fpm service
  • PHP opcache module enabled to better IO performance
  • Unnecessary services removed from the init processes list
  • PHP, Apache and php-fpm configurations added to git repository, and managed by Ansible configuration management tool.
  • Stopped unnecessary AWS EC2 instance to decrease the costs

Results

  • Improved performance
  • The average CPU usage decreased from 60% to 15%
  • The average RAM usage decreased from 7GB to 1.4GB
  • Before the changes the site couldn’t serve the load tests parallel requests because of high CPU usage, and it caused ‘out of memory situation’ too, even with 10 requests/sec. After our changes it can handle 200 requests/sec without any RAM issue. (This upper limit caused by the CPU as bottleneck, with PHP 7.3 it can be a bit higher)
  • Because of the lower RAM usage, we can use smaller AWS EC2 instance, and the stopping of not needed instance we can decrease the AWS costs.

Future performance and security upgrades

  • AWS security audit
  • Tweaking Apache
  • Upgrading PHP 5 to 7
  • AWS infrastructure to CloudFormation template
  • Creating a load balanced, distributed system
  • Using a CDN service to serve assets
  • Using search engine to speed up the searching
  • Using a monitoring system

Technologies used

  • AWS: RDS Aurora, EFS
  • Apache httpd, php-fpm
  • Vegeta 
  • Ansible