Technical Root Cause Analysis: May 18th and May 19th, 2019 Service Disruption
Event: May 18, 2019 at 9:13 a.m. CST
On May 18, 2019, at 9:13 a.m. CST, Netsolus’ incoming power feeds experienced a voltage sag, also known as a voltage dip. The full floor UPS would normally correct the voltage sag, by using batteries to supplement power to the data center until the generators come online if needed. The battery string that this UPS uses experienced a failure and the load to the data center was dropped for two minutes before the systems restored power.
UPS manufacturer technicians have scheduled an emergency battery string replacement, and the batteries are scheduled to be on site by Wednesday, May 22. Vendor technicians will be providing a more in-depth technical explanation to us at that time.
Timeline:
- 9:13 a.m.: Voltage sag on phase C occurred, and the load was dropped in the data center since the battery string had a failure.
- 9:15 a.m.: Power was restored to normal, colocation clients were returned to service.
- 9:20 a.m.: Networking was returned to full service, with the exception of SDWAN head units.
- 11:30 a.m.: Storage area networks and 15% of the hosting hardware (about 500 servers) were rebooted.
- 1:15 p.m.: Full cloud layer was restored, and client cloud instances were booted
- 2:15 p.m.: Most clients’ cloud instances were restored, and engineers moved to dealing with individual issues related to systems not coming up correctly and finally moved to full verification of individual client systems.
- 3:52 p.m.: All sites and services were reported as recovered, including Netsolus Phone Systems
Moving Forward:
Although the main UPS is partially redundant and has full bypass capabilities, it was identified as a risk in last year's risk assessment, specifically concerning the private cloud and hosting platforms. Budget and purchase of a secondary UPS to protect the core and key hosting platforms were already done and scheduled for a July 2019 install. The secondary UPS has already been delivered. This step will mitigate the battery string risk to those platforms.
The battery string was on a standard replacement schedule and was scheduled to be replaced in June. Due to this weekend's failure, batteries will be replaced this week.
Event: May 19, 2019 at 6:54 p.m. CST
On May 19, 2019, at 6:54 p.m. CST, Netsolus, was performing a quality assurance (QA) assessment on all systems to catalog what was out of standard due to the emergency maintenance from May 18. During this QA process, one of the SANs was found to have booted in a non-HA mode. This is not uncommon for certain models and SAN. The risk was assessed, and it was determined that having the device with disabled fail-over was a larger risk than clicking the enable fail-over button.
During the enablement of fail-over, a previously unknown bug was discovered, resulting in an issue with the connection of the volumes (drives). Due to the unexpected behavior, Netsolus engaged the vendor that confirmed that they had also not seen this before. Without a quick way to restore connections, Netsolus was forced to disconnect and import the volumes, which required the manual process of re-registration of all affected servers. This process was extremely lengthy, resulting in a slow recovery.
Timeline:
- 7:00 p.m.: Fail-over was enabled.
- 7:15 p.m.: It was determined that volumes had to be relabeled.
- 9:00 p.m.: Manufacturer L2 support returned our support call.
- 9:30 p.m.: It was determined that re-registration would be needed and work began to do so.
- 2:45 a.m.: All systems were restored to full function with no data loss.
Moving Forward:
The manufacturer will be engaged, and full diagnostics will be run, once all production load is removed from the SAN. We do not want to take the risk that their “diagnostics” will cause another relabeling event.
The SAN was already scheduled to be decommissioned, and migration is already in progress. Servers that had already been migrated were not impacted by this event. We are expediting the migration to remove all production off.
If you have questions about this message please contact our CivicEngage Technical Support Team.
Comments
0 comments
Post is closed for comments.
Available Templates
No Change Update
CivicPlus Systems Engineers continue to work on resolving this issue as quickly as possible. An update will be added to this post when it is available.Timeline Identified
CivicPlus Systems Engineers have identified the issue and expect to have a resolution deployed by ESTIMATED TIMELINE (if not a specific date, include business hours / business days clarification).Fix Implemented
CivicPlus Systems Engineers have applied a fix and have restored service to your CivicPlus Solution. An incident report will be posted here within 24 business hours.