Service Disruption: January 29, 2019
CompletedCivicPlus engineers are currently investigating a disruption of service. CivicPlus is experiencing a networking issue resulting in some sites having availability issues. Engineers are working to address the issue ASAP, and are hoping to have the issue resolved by 10:45 AM CST. A status update will be added to this post by 11:00 AM CST.
Comments
10 comments
-
Official comment
Update Provided February 6, 2019
Following up to the January 29th outage, CivicPlus has completed the root-cause analysis with the datacenter, and it's vendors. As originally reported, a switch had failed, exposing a bug that was previously unknown. While multiple switches exist to provide High Availability, the bug caused additional issues resulting in a partial storage outage affecting some of CivicPlus websites.
In working with the vendor, the data center engineers were able to remediate the issue and we do not expect any additional outages related to this issue. On Monday, February 4th, the vendor did release a patch that has passed QA, and is ready for deployment. The data center will be deploying the patch during this Saturday’s regularly scheduled maintenance window.
CivicPlus architects will be working directly with the data center over the next couple of weeks to evaluate, and optimize the system architecture. We will be looking at ways to ensure resiliency is maximized, as well as designing, and implementing additional isolation to minimize the impact of these types of situations.
------------------------------
At approximately 9:39 AM Central Standard Time (CST), engineers were alerted to an issue affecting your CivicPlus software solution. Upon initial review, it was determined that one of the switches had failed. Due to the mesh topology of the Netsolus network, this should not have been an issue, as there is redundancy built in to increase resiliency. However, as a result of the switch failure, it appears that our Spine switches that use MCT trunking started to produce a 100Gbps looping traffic storm through a previously unknown bug that appears to be related to the MCT protocol. This overwhelmed some of the 10Gbps switches that carry traffic throughout the datacenter and cloud pods.
We immediately began working on a fix and engaged the switching vendor to find a resolution. At this point a firmware bug was suspected to be the issue, and a workaround for the MCT trunking protocol settings was implemented to mitigate the issue. With heavy latency and packet loss for pods communications due to the traffic storm, VMware started treating the network issues as an HA services event, which caused additional problems resulting in a delayed recovery
CivicPlus continues to work with the vendor and will work with them to identify the root cause that led to this outage. While we do not expect any additional outages related to this event, we continue to monitor the situation and are taking steps to stage additional hardware in the event the issue returns.
An in depth incident investigation is in progress. Once we have completed the root cause analysis with the vendor, we will share those details. -
CivicPlus engineers continue their investigation to identify and resolve this issue as quickly as possible. A status update will be added to this post by 11:35 AM CST.
0 -
CivicPlus engineers continue their investigation to identify and resolve this issue as quickly as possible. A status update will be added to this post by 12:00 PM CST.
0 -
CivicPlus engineers continue their investigation to identify and resolve this issue as quickly as possible. A status update will be added to this post by 12:30 PM CST.
0 -
CivicPlus engineers continue their investigation to identify and resolve this issue as quickly as possible. A status update will be added to this post by 1:00 PM CST.
0 -
CivicPlus engineers continue their investigation to identify and resolve this issue as quickly as possible. We appreciate your patience and understanding as we continue to diligently work on this issue. A status update will be added to this post by 1:45 PM CST.
0 -
CivicPlus Systems Engineers have applied a fix and have began seeing a restoration of service. Gradual service restoration is to be expected. An Incident Report outlining details of the issue will be added to this post as soon as it is available.
0 -
CivicPlus Systems Engineers are closely monitoring the restoration of service to CivicPlus software solutions. It is expected that recovering sites will experience degraded load speeds for the next hour as systems normalize. A status update will be added to this post by 3:00 PM CST.
0 -
CivicPlus Systems Engineers are continuing to closely monitor the restoration of service to CivicPlus software solutions. It is expected that recovering sites will experience degraded load speeds as systems normalize. Engineers will distribute additional details of the outage in a post before 5 PM CST.
0 -
CivicPlus Systems Engineers are continuing to closely monitor the restoration of service to CivicPlus software solutions. It is expected that recovering sites will experience degraded load speeds as systems normalize. Engineers will distribute additional details of the outage in a post before 8 PM CST.
0
Post is closed for comments.
Available Templates
No Change Update
CivicPlus Systems Engineers continue to work on resolving this issue as quickly as possible. An update will be added to this post when it is available.Timeline Identified
CivicPlus Systems Engineers have identified the issue and expect to have a resolution deployed by ESTIMATED TIMELINE (if not a specific date, include business hours / business days clarification).Fix Implemented
CivicPlus Systems Engineers have applied a fix and have restored service to your CivicPlus Solution. An incident report will be posted here within 24 business hours.