How Envoy proxy can help in solving HAProxy problems in distributed applications?
Slack earlier used HAproxy to load balance the requests to it's Web-socket servers. (Web-socket is communication protocol that provides two way simultaneous communications). Slack creates and uses a Web-socket for each of it's user/app to provide two way messaging.
HAProxy hot restart was a problem for them because for Slack HAProxy backend list of servers dynamically changes when instance are added/removed in Web-socket server fleets. HAProxy had two way to update it's backend configuration list one by using it's Runtime API and another one by updating it's backend list configuration and do HAProxy reload. Slack tried both ways in both ways slack had operational overhead to update the backend server list.
Update with Runtime API: They had an incident with the approach when they were using Runtime API to update the HAProxy backend server list. Because of a bug the list of backend server were not updating and they had a smaller fleet of server to serve the incoming api requests which resulted in higher number of 503s.
HAProxy reload after Backend configuration Update: Updating the HAproxy backend configuration and then reloading the HAProxy server, HAproxy creates new processes to handle the new Web-socket connections and it kept open previous websocket connections to drain messages for hours. Creating too many processes in HAProxy server slows down the HAProxy performance but abruptly closing processes could close user connections with it's backend messaging servers which creates issues. To mitigate the performance issue they either periodically had to clean some of the HAProxy processes or somehow limit the number of processes that can be created on HAproxy server instance.
Either way operational overhead was found in the each approaches used by Slack to update the HAProxy backend list configurations.
Recommended by LinkedIn
How Envoy helped?
Envoy comes with some useful capabilities:
Envoy's inbuilt capabilities helped Slack to solve its backend server lists update problem and reduced their operational overhead when backend list of servers changes dynamically.