Degradation of underlying hardware in Analytics frontend in Japan instance
Incident Report for Reply.ai
Postmortem

On Jun 1st, we received the following from AWS:

We have important news about your account (AWS Account ID: 866597297522). EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance (instance-ID: i-7f8940e1) in the ap-northeast-1 region. Due to this degradation, your instance could already be unreachable. After 2018-06-12 11:00 UTC your instance, which has an EBS volume as the root device, will be stopped.

The solution was:

You can wait for the scheduled retirement date - when the instance is stopped - or stop the instance yourself any time before then. Once the instances has been stopped, you can start the instance again at any time. 

We started and stopped the instance on Jun 6th, which should have been enough for not having any degradation. But this was not the case. We reacted in 70min, and restored the Analytics service.

No data has been lost, and no message has not been sent. It was only the Analytics frontend.

We have added extra monitoring so that we react faster next time.

Posted 11 months ago. Jun 14, 2018 - 04:27 EDT

Resolved
Our cloud provider has detected degradation of the underlying hardware of our Analytics infrastructure in the Japan region. The hardware has been replaced and the service restored after some maintenance.

No impact in data lost, message delivery. The problem was in the frontend machine of Analytics.
Posted 11 months ago. Jun 14, 2018 - 01:05 EDT
This incident affected: Japan Region (transcosmos.reply.ai) (Analytics).