Skip to main content

Performance Issue Identified | Nov 17 2022

Comments

11 comments

  • Official comment
    Mike Smith

    System Disruption Analysis 

    At approximately 3:45 p.m. CST on Nov 17, one of the Storage Area Networks (SAN) stopped serving data. This particular SAN serviced drives for database servers and a portion of the web servers.  

     

    After the manufacturer of the SAN received access, its staff members began their troubleshooting and repair process. By approximately 9:55 p.m., the SAN was functional again. Based on the feedback from the vendor support personnel, a software bug caused the High Availability (HA) to fail back and forth between primary and secondary heads. It is Netsolus' understanding that the storage vendor applied a software patch as part of the repair process. Netsolus does not believe there to be a high risk of a repeat. After the SAN was recovered, Netsolus engineers began the process of restoring the database servers, and they were fully functional by 10:45 p.m. CST. 

     

    Once the database servers were functional, full restoration of the affected web servers began. This is when the normally simple VMware pathing refresh for iSCSI failed to restore the data stores to function so that the affected web servers could be brought back online.  

     

    During troubleshooting, Netsolus engineers determined that a reboot of the affected VMware hosters was required to resolve the issue. Since hosters have multiple datastores and web servers on the SAN that were functional, vmotion migrations were required to limit the outage to only the web servers that were offline. This vmotion of servers added an additional hour to the recovery efforts. 

     

    At approximately 10:45 p.m. CST, customer websites began to recover. Most websites had recovered by 1 a.m. CST, and the remaining sites were recovered by 2:45 a.m. CST. 

     

    Future Safeguards 

    In response to the outage, Netsolus and CivicPlus are accelerating the implementation of a new architecture that was planned before the incident. The necessary hardware has begun to arrive, and we are moving forward with the implementation.  

     

    This new design decentralizes the architecture to minimize the number of websites impacted at any given time during a major outage. The new design also provides the ability to recover to on-premises infrastructure as an additional step in recovery processes that the CivicPlus Hosting Team can perform while Netsolus continues to diagnose and recover from primary infrastructure outages. This additional option means that there would be little to no data loss, and the ability to recover any lost data more quickly, as opposed to a complete disaster recovery cutover that could result in a maximum of 24-hour data loss. As part of the implementation, we will be performing tests on the failover within the primary data center and testing the full disaster recovery capability before migrating customer websites onto the new infrastructure. 

     

    Additional steps will be taken to integrate the CivicPlus Incident Response plans further to improve response and recovery objectives and set clearer thresholds for major decisions such as cutover to secondary data centers. In addition, CivicPlus is currently evaluating system options to enable expedited customer communications during outages to ensure that customers receive system information and progress updates promptly. 

     

    If you have any questions about this incident report, please contact security@civicplus.com

  • Carl Bowen

    CivicPlus Systems Engineers continue to work on resolving this issue as quickly as possible. An update will be added to this post when it is available.

    8
  • Carl Bowen

    CivicPlus Systems Engineers continue to work on resolving this issue as quickly as possible. An update will be added to this post when it is available.

    7
  • Carl Bowen

    Data Center Provider continues to work on resolving this issue as quickly as possible. We will continue to work with Data Center to recover sites, and will post updates every 30 minutes until the issue is resolved.

    7
  • Carl Bowen

    Data Center Provider continues to work on resolving this issue as quickly as possible. We will continue to work with the Data Center to recover sites, and will post updates every 30 minutes until the issue is resolved.

     

    7
  • Carl Bowen

    Data Center Provider continues to work on resolving this issue as quickly as possible. We will continue to work with the Data Center to recover sites, and will post updates every 30 minutes until the issue is resolved.

    6
  • Carl Bowen

    CivicPlus’ Data Center is experiencing an issue with the storage layer affecting availability of your site. The Data Center is working directly with the Storage Area Network (SAN) vendor to diagnose the issue. We will continue to work with the Data Center to restore service to your site ASAP.

    5
  • Carl Bowen

    Data Center Provider continues to work on resolving this issue as quickly as possible. We will continue to work with the Data Center to recover sites, and will post updates every 30 minutes until the issue is resolved.

    2
  • Carl Bowen

    Data Center Systems Engineers have applied a fix and are we are actively restoring sites. CivicPlus will provide an incident report by EOD Friday 11/18/22. CivicPlus will also work with vendor a full root cause analysis and an action plan to mitigate the risk of this type of incident

     

    1
  • Carl Bowen

    At approximately 4 p.m. CST on November 17, 2022, CivicPlus’ third-party hosting provider was alerted to an infrastructure disruption that impacted CivicPlus-managed websites as well as other websites not managed by CivicPlus that are hosted by the data center. Investigation into the disruption determined that the underlying infrastructure was experiencing storage issues and was not caused by a cybersecurity attack. No CivicPlus-managed data was compromised due to the outage.

    The hosting provider immediately engaged its storage vendor. At approximately 7:15 p.m. CST, a resolution was determined, and the data center engineers proceeded to restore services.

    However, during the restoration process, issues were discovered that had not previously been observed. At this time, the vendor pulled in additional engineers to diagnose the issue. At approximately 8:15 p.m. CST, the storage vendor identified the issue and applied a patch.

    The data center was able to then proceed with service recovery. At approximately 11 p.m. CST, websites began to recover. The majority of websites had recovered by 1 a.m. CST, and the remaining websites were recovered by 3 a.m. CST.

    Our hosting provider will be working with the storage provider on a root cause analysis, and it is expected that the analysis will take up to five business days. Once CivicPlus receives the analysis, we will work with the hosting provider on any action items related to this service disruption.

    Please contact your customer success manager or CivicPlus Technical Support with any questions.

    1
  • Carl Bowen

    Data Center Provider continues to work on resolving this issue as quickly as possible. We will continue to work with the Data Center to recover sites, and will post updates every 30 minutes until the issue is resolved.

    -5

Available Templates

No Change Update

CivicPlus Systems Engineers continue to work on resolving this issue as quickly as possible. An update will be added to this post when it is available.

Timeline Identified

CivicPlus Systems Engineers have identified the issue and expect to have a resolution deployed by ESTIMATED TIMELINE (if not a specific date, include business hours / business days clarification).

Fix Implemented

CivicPlus Systems Engineers have applied a fix and have restored service to your CivicPlus Solution. An incident report will be posted here within 24 business hours.

Please sign in to leave a comment.