Site issue Aug 6 2012

From Wikitech
(Difference between revisions)
Jump to: navigation, search
(Outage Summary)
(Outage Summary)
Line 1: Line 1:
 
== Outage Summary ==
 
== Outage Summary ==
 +
 +
:Wikimedia sites experienced an outage on 6th August 2012 that started at about 6:15am PDT (13:15 UTC). Except for the mobile site, the sites were brought back up by 7:18am PDT (14:18 UTC). Mobile site services resumed at about 8:35 am PDT (15:35 UTC). The team worked around the outage by rerouting traffic to Tampa , bypassing the Ashburn site.
 +
 
* '''Duration''':  from about 13:15 UTC to 14:18 UTC; approximately 63 minutes
 
* '''Duration''':  from about 13:15 UTC to 14:18 UTC; approximately 63 minutes
 
* '''Impact''':  Wikimedia sites were down throughout that period. The mobile site was not up till 15:35 UTC)
 
* '''Impact''':  Wikimedia sites were down throughout that period. The mobile site was not up till 15:35 UTC)
 
* '''Cause''':  Network Fiber Cut  
 
* '''Cause''':  Network Fiber Cut  
 
* '''Resolution''':  Fail-over network traffic and services from Ashburn to Tampa
 
* '''Resolution''':  Fail-over network traffic and services from Ashburn to Tampa
 
:Wikimedia sites experienced an outage on 6th August 2012 that started at about 6:15am PDT (13:15 UTC). Except for the mobile site, the sites were brought back up by 7:18am PDT (14:18 UTC). Mobile site services resumed at about 8:35 am PDT (15:35 UTC). The team worked around the outage by rerouting traffic to Tampa , bypassing the Ashburn site.
 
  
 
==Detail==
 
==Detail==

Revision as of 23:26, 8 August 2012

Outage Summary

Wikimedia sites experienced an outage on 6th August 2012 that started at about 6:15am PDT (13:15 UTC). Except for the mobile site, the sites were brought back up by 7:18am PDT (14:18 UTC). Mobile site services resumed at about 8:35 am PDT (15:35 UTC). The team worked around the outage by rerouting traffic to Tampa , bypassing the Ashburn site.
  • Duration: from about 13:15 UTC to 14:18 UTC; approximately 63 minutes
  • Impact: Wikimedia sites were down throughout that period. The mobile site was not up till 15:35 UTC)
  • Cause: Network Fiber Cut
  • Resolution: Fail-over network traffic and services from Ashburn to Tampa

Detail

At about 6:15am PDT, we were alerted to a site issue and our team found severed network connectivity between our two data centers. Upon checking with our network provider in Tampa, they informed us a fiber cut occurred and that the outage was caused by a third party crew working in the area.
The data centers — one in Ashburn, Virginia and the other in Tampa, Florida — are connected by two separate fiber links (for redundancy). While Ashburn serves most of the traffic, it needs to talk to our Tampa data center for backend services (e.g. database).
To provide network redundancy, we engaged FPL to supply us two DWDM systems to deliver the Wikimedia services -- a metro access DWMD system from the Franklin Exchange location to the FPL FiberNet POP and a long-haul DWDM system. Each of these DWDM systems is routed over diverse fibers using the dual entrances into the FPL FiberNet Tampa POP, making the design capable of delivering the two diversely routed 10G waves provided to Wikimedia as long as the metro segment of wave #1 and long haul segment of wave #1 are on the same route.
FPL FiberNet, after performing initial troubleshooting of the Wikimedia waves, revealed the root cause of the unexpected alarms to be a folded fiber segment through which both of the unprotected Wikimedia services traversed.  The fiber cable damage occurred within this folded fiber segment causing the loss of service.

The post outage investigation showed that the metro access segment of wave #1 was incorrectly routed on the same side of the long haul segment of wave #2. The fiber cut occurred along the metro access segment, which carried both wave #1 and wave#2.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox