On March 22nd, CivicEngage Evolve experienced multiple request timeouts for the Redis database which led to some outages on client sites and intermittent availability. CivicPlus Engineers were able to implement a fix for that specific issue by increasing the limit for the minimum number of nodes (in auto-scaling)
This did not entirely address the issue. CivicEngage Evolve clients were continuing to experience intermittent timeouts and other network related issues to Azure resources (e.g. Redis, SQLServer) from Azure Kubernetes Service (AKS) pods
- CivicPlus Hosting and Security and the Evolve Engineering Team contacted Microsoft to assist in troubleshooting the issues that were causing service interruptions with our Evolve clients.
- Microsoft verified that our AKS Cluster, Nodes, and Pods were all working as expected.
- Microsoft verified that networking and SNAT looked good
- Microsoft then noticed while looking at the AKS worker node, that there were multiple VNet alerts (lost carrier)
- Microsoft investigated and quickly learned that on Azure Kubernetes Service (AKS), a few subscriptions had experienced intermittent, brief connectivity issues across multiple nodes while mounting and/or unmounting persistent volumes to and from Kubernetes pods.
- The Microsoft backend team then reported back to CivicPlus that they had determined that a recent deployment task to a backend service caused intermittent connectivity issues for a subset of customers including CivicPlus and that the issue was known at that time and they were working on a hotfix.
- On March 31st, 2021, Microsoft deployed a hotfix to address the code bug in order to mitigate the problem.
CivicEngage Evolve Engineers are continuing to monitor the system but are reporting that it is stable at this time.
This is not related to the Azure issues affecting Microsoft services worldwide on 4/1. More information about those issues, which have also impacted CivicPlus clients is available here: https://status.azure.com/en-us/status