Site issue Aug 6 2012
From Wikitech
(Difference between revisions)
(→Detail) |
(→Outage Summary) |
||
| Line 1: | Line 1: | ||
== Outage Summary == | == Outage Summary == | ||
| − | Wikimedia sites experienced an outage on 6th August 2012 that started at about 6:15am PDT (13:15 UTC). Except for the mobile site, the sites were brought back up by 7:18am PDT (14:18 UTC). Mobile site services resumed at about 8:35 am PDT (15:35 UTC). The team worked around the outage by rerouting traffic to Tampa , bypassing the Ashburn site. | + | Wikimedia sites experienced an outage on 6th August 2012 that started at about 6:15am PDT (13:15 UTC). Except for the mobile site, the sites were brought back up by 7:18am PDT (14:18 UTC). Mobile site services resumed at about 8:35 am PDT (15:35 UTC). The team worked around the outage by rerouting traffic to Tampa , bypassing the Ashburn site and failing over services to Tampa datacenter. |
* '''Duration''': from about 13:15 UTC to 14:18 UTC; approximately 63 minutes | * '''Duration''': from about 13:15 UTC to 14:18 UTC; approximately 63 minutes | ||
Revision as of 23:28, 8 August 2012
Outage Summary
Wikimedia sites experienced an outage on 6th August 2012 that started at about 6:15am PDT (13:15 UTC). Except for the mobile site, the sites were brought back up by 7:18am PDT (14:18 UTC). Mobile site services resumed at about 8:35 am PDT (15:35 UTC). The team worked around the outage by rerouting traffic to Tampa , bypassing the Ashburn site and failing over services to Tampa datacenter.
- Duration: from about 13:15 UTC to 14:18 UTC; approximately 63 minutes
- Impact: Wikimedia sites were down throughout that period. The mobile site was not up till 15:35 UTC)
- Cause: Network Fiber Cut
- Resolution: Fail-over network traffic and services from Ashburn to Tampa
Detail
- At about 6:15am PDT, we were alerted to a site issue and our team found severed network connectivity between our two data centers. Upon checking with our network provider in Tampa, they informed us a fiber cut occurred and that the outage was caused by a third party crew working in the area.
- The data centers — one in Ashburn, Virginia and the other in Tampa, Florida — are connected by two separate fiber links (for redundancy). While Ashburn serves most of the traffic, it needs to talk to our Tampa data center for backend services (e.g. database).
- To provide network redundancy, we engaged FPL to supply us two DWDM systems to deliver the Wikimedia services -- a metro access DWMD system from the Franklin Exchange location to the FPL FiberNet POP and a long-haul DWDM system. Each of these DWDM systems is routed over diverse fibers using the dual entrances into the FPL FiberNet Tampa POP, making the design capable of delivering the two diversely routed 10G waves provided to Wikimedia as long as the metro segment of wave #1 and long haul segment of wave #1 are on the same route.
- FPL FiberNet, after performing initial troubleshooting of the Wikimedia waves, revealed the root cause of the unexpected alarms to be a folded fiber segment through which both of the unprotected Wikimedia services traversed. The fiber cable damage occurred within this folded fiber segment causing the loss of service. The post outage investigation showed that the metro access segment of wave #1 was incorrectly routed on the same side of the long haul segment of wave #2. The fiber cut occurred along the metro access segment, which carried both wave #1 and wave#2.