Site issue Aug 6 2012
From Wikitech
(Difference between revisions)
(→Outage Summary) |
(→Detail) |
||
| Line 17: | Line 17: | ||
:FPL FiberNet, after performing initial troubleshooting of the Wikimedia waves, revealed the root cause of the unexpected alarms to be a folded fiber segment through which both of the unprotected Wikimedia services traversed. The fiber cable damage occurred within this folded fiber segment causing the loss of service. The post outage investigation showed that the metro access segment of wave #1 was incorrectly routed on the same side of the long haul segment of wave #2. The fiber cut occurred along the metro access segment, which carried both wave #1 and wave#2. | :FPL FiberNet, after performing initial troubleshooting of the Wikimedia waves, revealed the root cause of the unexpected alarms to be a folded fiber segment through which both of the unprotected Wikimedia services traversed. The fiber cable damage occurred within this folded fiber segment causing the loss of service. The post outage investigation showed that the metro access segment of wave #1 was incorrectly routed on the same side of the long haul segment of wave #2. The fiber cut occurred along the metro access segment, which carried both wave #1 and wave#2. | ||
| − | :We have asked FPL to audit our waves to ensure such | + | :We have asked FPL to audit our waves to ensure such single point of failures are no longer in the system. We are also in the process of replicating and migrating the rest of our backend services to Eqiad, creating full service redundancy in the two data-centers. The plan is to complete that work in Q2. |
Revision as of 00:18, 9 August 2012
Outage Summary
Wikimedia sites experienced an outage on 6th August 2012 that started at about 6:15am PDT (13:15 UTC). Except for the mobile site, the sites were brought back up by 7:18am PDT (14:18 UTC). Mobile site services resumed at about 8:35 am PDT (15:35 UTC). The team worked around the outage by rerouting traffic to Tampa , bypassing the Ashburn site and failing over services to Tampa data center.
- Duration: From about 13:15 UTC to 14:18 UTC; approximately 63 minutes
- Impact: Wikimedia sites were down throughout that period. The mobile site was not up till 15:35 UTC)
- Cause: Fiber cut - resulting in network connectivity loss
- Resolution: Fail-over network traffic and services from Ashburn to Tampa data center
Detail
- At about 6:15am PDT, we were alerted to a site issue and our team found severed network connectivity between our two data centers. Upon checking with our network provider in Tampa, they informed us a fiber cut occurred in Tampa by a third party crew working in the area, thus the outage.
- The data centers — one in Ashburn, Virginia and the other in Tampa, Florida — are connected by two separate fiber links (for redundancy). While Ashburn serves most of the traffic, it needs to talk to our Tampa data center for backend services (e.g. database).
- To provide network redundancy, we engaged FPL to supply us two DWDM systems to deliver the Wikimedia services -- a metro access DWMD system from the Franklin Exchange location to the FPL FiberNet POP and a long-haul DWDM system. Each of these DWDM systems is routed over diverse fibers using the dual entrances into the FPL FiberNet Tampa POP, making the design capable of delivering the two diversely routed 10G waves provided to Wikimedia as long as the metro segment of wave #1 and long haul segment of wave #1 are on the same route.
- FPL FiberNet, after performing initial troubleshooting of the Wikimedia waves, revealed the root cause of the unexpected alarms to be a folded fiber segment through which both of the unprotected Wikimedia services traversed. The fiber cable damage occurred within this folded fiber segment causing the loss of service. The post outage investigation showed that the metro access segment of wave #1 was incorrectly routed on the same side of the long haul segment of wave #2. The fiber cut occurred along the metro access segment, which carried both wave #1 and wave#2.
- We have asked FPL to audit our waves to ensure such single point of failures are no longer in the system. We are also in the process of replicating and migrating the rest of our backend services to Eqiad, creating full service redundancy in the two data-centers. The plan is to complete that work in Q2.