Skip to main content

Service Disruption: January 29, 2019

Completed

Comments

10 comments

  • Official comment
    Constance Cooke

    Update Provided February 6, 2019

    Following up to the January 29th outage, CivicPlus has completed the root-cause analysis with the datacenter, and it's vendors. As originally reported, a switch had failed, exposing a bug that was previously unknown. While multiple switches exist to provide High Availability, the bug caused additional issues resulting in a partial storage outage affecting some of CivicPlus websites.

    In working with the vendor, the data center engineers were able to remediate the issue and we do not expect any additional outages related to this issue. On Monday, February 4th, the vendor did release a patch that has passed QA, and is ready for deployment. The data center will be deploying the patch during this Saturday’s regularly scheduled maintenance window.

    CivicPlus architects will be working directly with the data center over the next couple of weeks to evaluate, and optimize the system architecture. We will be looking at ways to ensure resiliency is maximized, as well as designing, and implementing additional isolation to minimize the impact of these types of situations.

     

    ------------------------------

     

    At approximately 9:39 AM Central Standard Time (CST), engineers were alerted to an issue affecting your CivicPlus software solution. Upon initial review, it was determined that one of the switches had failed. Due to the mesh topology of the Netsolus network, this should not have been an issue, as there is redundancy built in to increase resiliency. However, as a result of the switch failure, it appears that our Spine switches that use MCT trunking started to produce a 100Gbps looping traffic storm through a previously unknown bug that appears to be related to the MCT protocol. This overwhelmed some of the 10Gbps switches that carry traffic throughout the datacenter and cloud pods.

    We immediately began working on a fix and engaged the switching vendor to find a resolution. At this point a firmware bug was suspected to be the issue, and a workaround for the MCT trunking protocol settings was implemented to mitigate the issue. With heavy latency and packet loss for pods communications due to the traffic storm, VMware started treating the network issues as an HA services event, which caused additional problems resulting in a delayed recovery

    CivicPlus continues to work with the vendor and will work with them to identify the root cause that led to this outage. While we do not expect any additional outages related to this event, we continue to monitor the situation and are taking steps to stage additional hardware in the event the issue returns.

    An in depth incident investigation is in progress. Once we have completed the root cause analysis with the vendor, we will share those details.

  • Jackson Wright

    CivicPlus engineers continue their investigation to identify and resolve this issue as quickly as possible. A status update will be added to this post by 11:35 AM CST.

    0
  • Morgan King

    CivicPlus engineers continue their investigation to identify and resolve this issue as quickly as possible. A status update will be added to this post by 12:00 PM CST.

    0
  • Morgan King

    CivicPlus engineers continue their investigation to identify and resolve this issue as quickly as possible. A status update will be added to this post by 12:30 PM CST.

    0
  • Jackson Wright

    CivicPlus engineers continue their investigation to identify and resolve this issue as quickly as possible. A status update will be added to this post by 1:00 PM CST.

    0
  • Morgan King

    CivicPlus engineers continue their investigation to identify and resolve this issue as quickly as possible. We appreciate your patience and understanding as we continue to diligently work on this issue. A status update will be added to this post by 1:45 PM CST.

    0
  • Morgan King

    CivicPlus Systems Engineers have applied a fix and have began seeing a restoration of service. Gradual service restoration is to be expected. An Incident Report outlining details of the issue will be added to this post as soon as it is available.

    0
  • Jackson Wright

    CivicPlus Systems Engineers are closely monitoring the restoration of service to CivicPlus software solutions. It is expected that recovering sites will experience degraded load speeds for the next hour as systems normalize. A status update will be added to this post by 3:00 PM CST.

    0
  • Jackson Wright

    CivicPlus Systems Engineers are continuing to closely monitor the restoration of service to CivicPlus software solutions. It is expected that recovering sites will experience degraded load speeds as systems normalize. Engineers will distribute additional details of the outage in a post before 5 PM CST.

    0
  • Jackson Wright

    CivicPlus Systems Engineers are continuing to closely monitor the restoration of service to CivicPlus software solutions. It is expected that recovering sites will experience degraded load speeds as systems normalize. Engineers will distribute additional details of the outage in a post before 8 PM CST.

    0

Available Templates

No Change Update

CivicPlus Systems Engineers continue to work on resolving this issue as quickly as possible. An update will be added to this post when it is available.

Timeline Identified

CivicPlus Systems Engineers have identified the issue and expect to have a resolution deployed by ESTIMATED TIMELINE (if not a specific date, include business hours / business days clarification).

Fix Implemented

CivicPlus Systems Engineers have applied a fix and have restored service to your CivicPlus Solution. An incident report will be posted here within 24 business hours.

Post is closed for comments.