Amazon Linux security mitigations and Postgres performance
Hi, it's me again. You might remember me as the person who broke Aurora Postgres during the beta period because I have too much data for it to index via pg_restore (over 2 billion records in just *one* of my tables).
So anyhow, while Aurora Postgres is broken for me, RDS Postgres until recently worked fine as a temporary spinup. Spin it up, dump several billion records to it over the course of about 12 hours of pg_restore, run some batch jobs against it for a week or so, tear it down. Much more convenient than spinning up my own Postgres servers. Until now.
Last week I spun up a RDS Postgres instance and ran my batch jobs against it and... a batch job that was supposed to run in 12 minutes instead ran in 21 minutes. That is, the Postgres RDS instance was running 43% slower than normal. Things I tried in order to speed it back up:
1) Upgraded the instance size of the RDS instance to two sizes larger than I usually run. No effect on performance -- still 43% slower.
2) Moved to a different availability zone due to possible fail in the availability zone -- spun up a replica in another AZ, failed over to the replica, altered my compute node ASG to use the other AZ to fail over the compute nodes to the other AZ (one that I'm already running things in). No improvement.
3) Upgraded to higher provisioned IOPS. No improvement.
4) Upgraded compute node instances to latest C5 instances. No improvement.
At that point I was done with RDS Postgres since clearly it was not going to be able to run my batch jobs within the designated time frame, so I spun up my own Postgres instance with a c5.2xlarge striping data across multiple EBS volumes in a setup that I'd previously used (all this setup/configuration is puppet-driven BTW, I don't *manually* set up any of this, that'd be insane). *STILL* the 43% performance impact on my batch jobs. And this is a configuration that I'm successfully using on the production cluster that's churning out 12 minute processing times -- still taking 21 minutes now.
At that point I started looking at the Linux security mitigations for SPECTRE and MELTDOWN bugs, which have not been applied to the production cluster's Postgres server because there hasn't been any service windows for our service since Amazon introduced their mitigations. First I disabled the retpoline mitigation via adding retpoline=off to the grub command line. This had minimal impact upon the performance of the Postgres server.
Then I disabled the pti mitigation using pti=off. Immediately my Postgres was running at full speed again.
Recommendations: The MELTDOWN vulnerability allows unvetted software running on a server to access data in other processes running on the server. Unfortunately, the PTI mitigation has severe and drastic impacts upon Postgres performance.
RDS does not require the PTI mitigation for security since RDS servers do not run any unvetted software. Thus it may be worthwhile for Amazon to provide an option to disable PTI mitigation in order to restore performance for Postgres RDS instances. Without that, Postgres RDS instances will simply be too slow for many purposes.
Looking at the Postgres lists, it looks like they acknowledge that certain workloads will make PTI have dire impact upon Postgres servers due to Postgres making heavy use of syscalls where it doesn't really need that many syscalls. Unfortunately any mitigation on the part of the Postgres team will be some time away, and will likely *never* happen in the 9.x series. So if you are running your own Postgres servers with currently supported versions of Postgres and are going to be doing this on AWS Linux, remember this: pti=off is your friend.