Reply performs maintenance operations on its infrastructure regularly. These interventions are required to provide new features, fix issues and improve the overall system performance. Some of these operations require changes to the database, which can result in a few seconds of downtime to guarantee data integrity.
From time to time, changes to the database require significantly more time to be applied. In order to maintain application availability, these operations are prepared by the team to be run concurrently —ie. Without affecting the normal behavior of the application.
Reply v1.5.8, to be released on Tuesday, November 13th 2018, modified the way bot messages are stored in the database. Given the high volume of messages in Reply, this is a long-running operation. However, we failed to flag this intervention and did not take necessary measures to ensure availability. When the release was published and the operation started, the database prevented any data access until all modifications were complete.
All releases are closely monitored by the team. When an unusual running time was detected, the team announced an outage and started investigating. After the problem was identified, the team started verifying possible solutions that would guarantee no data loss. When a solution was finally applied, the release was rolled back and the system availability verified.