Intermittent failures with TechTime's Cloud apps
Incident Report for TechTime Initiative Group
Postmortem

To Our Cloud Customers:

On behalf of TechTime Initiative Group I want to apologise if this incident has affected performance of our apps for you.

What Happened?

Our Cloud app hosting is clustered. During a routine update of the hosting platform, one of the nodes didn’t quite come back to life – though the hosting platform was a alive and well, the actual apps (Google Maps Embedding Macro for Confluence Cloud and EasyTime for Jira Cloud) on that node didn’t re-enable successfully. Unfortunately due to the clustered nature of the platform – the problem was masked as the load balancer is configured to perform round-robin distribution of requests and so was happily sending the rest of the requests to the healthy nodes.

Once we identified what was going on – we restarted the software on the faulty node and everything came back working again.

What Are We Going to Do Now?

We have identified several areas where we can improve.

  1. The smoke tests we perform after updates will now explicitly test individual nodes and the state of apps, not just the platform
  2. Our regular health-checks will be enhanced in the same way
  3. The load balancer will be re-configured to test the health of nodes in a more involved manner i.e. take into account apps, not just the platform
  4. There will be more alerts – so we can be notified faster and deal with outages like this faster

More Information

Please feel free to contact our support if you require more information: https://techtime.co.nz/support

Sincerely,
Ed Letifov
CTO

Posted Nov 12, 2021 - 20:04 NZDT

Resolved
We consider this incident resolved
Posted Nov 12, 2021 - 19:52 NZDT
Monitoring
The node has been restarted – we continue to monitor
Posted Nov 05, 2021 - 09:54 NZDT
Identified
We have identified that one of the hosting nodes didn't come back properly after an update
Posted Nov 05, 2021 - 09:27 NZDT
Investigating
Our monitoring has identified intermittent failures with our Cloud apps – some requests are failing with 404
Posted Nov 05, 2021 - 04:45 NZDT
This incident affected: TechTime Cloud Apps (GoogleMaps Embed macro in Atlassian Cloud, EasyTime for Jira Cloud).