Network design

From Wikitech
(Difference between revisions)
Jump to: navigation, search
m (Current situation: two gateways, same router)
(+links)
 
(56 intermediate revisions by 18 users not shown)
Line 1: Line 1:
The purpose of this page is to give an overview of the current '''design of the network''' of the Wikimedia servers, and to provide a place to develop a new and improved network scheme.
+
== AS 14907 ==
 +
The US network.
  
== Current situation ==
+
=== 2011 ===
Wikimedia servers reside in two racks along with Bomis servers, hosted at [http://www.candidhosting.com Candidhosting]. Wikimedia/Bomis have a dedicated IP range, <tt>207.142.131.192/26</tt>. There are two gateways: <tt>207.142.131.193</tt> and <tt>207.142.131.225</tt>, but they both resolve to the same MAC address, so they are almost certainly the same router. Total burstable bandwidth is 200 Mbit/s, delivered through two separate 100BaseTx uplinks, connected from the same broadcast domain that is ''shared with other customers''.
+
[[File:Eqiad logical.png|thumb|400px|AS14907 Eqiad in 2011]]
 +
[[File:Wikimedia pmtpa management network.png|thumb|400px|AS14907 in 2010]]
  
Wikimedia owns three [[switches]]. As the two uplinks are not allowed to create a loop, they must be connected to different switches that are not connected to eachother (when not using [[Wikipedia:Spanning Tree Protocol|STP]]), which is not an ideal situation. A third switch is currently used to connect internal servers, that don't have public IPs and should not be accessible from the Internet. The IP range used for this internal network is <tt>10.0.0.0/8</tt>.
+
=== Subnets ===
  
== Problems ==
+
==== [[pmtpa]] ====
The current network setup is not optimal in many ways, as will be described here.
+
  
=== Multiple uplinks ===
+
==== [[eqiad]] ====
Recently, Wikimedia traffic spiked to 100Mbit/s multiple times, which is the limit of a single 100BaseTx connection. Also, [http://65.59.189.201/www.bomis-total/www.bomis-total.html average outgoing traffic] at this moment is about 45 Mbit/s, so it is clear that Wikimedia was slowly becoming network limited. However, the colo provider charges $400 dollar per month just to provide us with a Gigabit uplink, unless we commit to 60 Mbit/s average traffic or higher. Instead, they decided to give us a second 100BaseTx for free.
+
{| class="wikitable"
 +
|-
 +
!subnet          !! vlan ID !! IPv4 !! IPv6
 +
|-
 +
| public1-a-eqiad  || 1001 || 208.80.154.0/26  || 2620:0:861:1::/64
 +
|-
 +
| public1-b-eqiad  || 1002 || 208.80.154.128/26 || 2620:0:861:2::/64
 +
|-
 +
| public1-c-eqiad  || 1003 ||                  || 2620:0:861:3::/64
 +
|-
 +
| public1-d-eqiad  || 1004 ||                  || 2620:0:861:4::/64
 +
|-
 +
| private1-a-eqiad || 1017 || 10.64.0.0/22      || 2620:0:861:101::/64
 +
|-
 +
| private1-b-eqiad || 1018 || 10.64.16.0/22    || 2620:0:861:102::/64
 +
|-
 +
| private1-c-eqiad || 1019 || 10.64.32.0/22    || 2620:0:861:103::/64
 +
|-
 +
| private1-d-eqiad || 1020 || 10.64.48.0/22    || 2620:0:861:104::/64
 +
|}
  
This does pose some problems though. Because the two uplinks are connected from the same [[Wikipedia:broadcast domain|broadcast domain]], we cannot connect them internally, or we would create a loop. One solution to this problem is to connect the uplinks to different switches that are not connected, but this means that hosts on the two different switches can only exchange traffic between eachother through the uplinks. This traffic is ''graphed and billed'' '''twice''', and is a ''bottleneck'', as it has to traverse both relatively slow uplinks.
 
  
=== Shared broadcast domain ===
+
== AS 43821 ==
 +
The European network.
  
It appears that, even though Wikimedia has a dedicated IP range, the broadcast domain is shared with other customers. Running <tt>tethereal</tt> shows a lot of non-wikipedia traffic. It's odd that Wikipedia doesn't have it's own broadcast domain (probably implemented as a separate [[Wikipedia:VLAN|VLAN]] at the upstream provider), as there doesn't seem to be a reason not to.
+
=== 2008 ===
 +
[[File:Knams-multihomed.png|thumb|400px|AS43821 in 2008]]
  
Within a shared broadcast domain, other customers can snoop Wikimedia traffic, spoof our IPs, and cause unnecessary traffic through our uplinks.
+
BGP default transit from AS1145 (Kennisnet), with some partial transit and peering over a 1 Gbps AMS-IX link. Everything on one core router/switch, csw1-knams (Foundry BigIron RX-8).
  
=== Inflexible internal network setup ===
+
=== 2009 ===
The Wikimedia network was recently split in two parts: the ''external'', publicly visible network containing machines that need to be accessed from the Internet (the Squids, mostly), and an ''internal'' network for machines that are only accessed by other wikimedia servers (Apaches, DB servers, management devices). Some servers, like the Squids, need to be in both networks because they serve as gateways between the Internet and the internal machines.
+
[[File:AS43821 2009.png|thumb|400px|AS43821 in 2009]]
  
The internal network is currently implemented as a physically separate switch. This switch is not connected to the other two, and the only paths to the external network are through the servers that are on both networks. These, however, don't route traffic. These servers use separate interfaces to connect to the different networks (<tt>eth0</tt> for internal, <tt>eth1</tt> for external).
+
Temporary situation after the move from knams to esams. The network is split, with a new Foundry BigIron RX-4 as a pure router at knams for external connectivity, with Telia, DataHop, Init7 (partial) transit, and 2x 1 Gbps AMS-IX for peering. Connectivity between the two sites is supplied by a 10GBase-ER link over dark fiber, and a 3 Gbps MPLS backup link. A second dark fiber is being installed to form a ring.
  
Using physically separate switches for different networks is inflexible. This design  does not permit efficient use of resources like switch ports and bandwidth. It requires extra switches when the internal network is full, even if the switches for the external network have plenty of ports free. Even the currently used switches support [[Wikipedia:Virtual LAN|VLANs]] (including '''802.1Q''') and all of its advantages, so it would be good to use them.
 
  
=== Limited switch features ===
+
=== 2010 ===
 +
[[File:AS43821 Q3 2010.png|thumb|400px|AS43821 late 2010]]
  
== Proposed solutions ==
+
The purchase of several Juniper EX4200s in a stack, for extra access ports for servers, also brings some opportunities w.r.t. the network topology. Since the EX4200s have excellent L3 support they can help create redundancy.
  
== Proposed design ==
+
The 2nd dark fiber is linked between [[br1-knams]] and [[csw2-esams]] to create a ring. [[csw1-esams]] and [[csw2-esams]] can then share responsibility as core switches, for inter-vlan routing and switching, using VRRP. Since an EX4200 can not install a full BGP routing table in FIB, it defaults to either of the two Foundry routers using OSPF.
  
-- [[User:Mark|Mark]] 15:46, 22 Oct 2004 (UTC)
+
Toolserver can be connected redundantly as well, using (R)STP to both core switches and VRRP, or alternatively a LAG to the EX4200 stack.
 +
 
 +
== Configuration guidelines ==
 +
* Firewall filters, policies, prefix lists etc that are specific to a certain protocol family (e.g. only IPv4, or only IPv6) should have a '4' or '6' appended to their name. Filters, policies and prefix lists that are protocol family agnostic, should lack this suffix.
 +
 
 +
== See also ==
 +
* [[Multicast]]
 +
* [[TCP Tuning]]
 +
 
 +
[[Category:Network]]
 +
[[Category:knams cluster| *]]
 +
[[Category:Pmtpa cluster| *]]

Latest revision as of 22:43, 20 February 2012

Contents

[edit] AS 14907

The US network.

[edit] 2011

AS14907 Eqiad in 2011
AS14907 in 2010

[edit] Subnets

[edit] pmtpa

[edit] eqiad

subnet vlan ID IPv4 IPv6
public1-a-eqiad 1001 208.80.154.0/26 2620:0:861:1::/64
public1-b-eqiad 1002 208.80.154.128/26 2620:0:861:2::/64
public1-c-eqiad 1003 2620:0:861:3::/64
public1-d-eqiad 1004 2620:0:861:4::/64
private1-a-eqiad 1017 10.64.0.0/22 2620:0:861:101::/64
private1-b-eqiad 1018 10.64.16.0/22 2620:0:861:102::/64
private1-c-eqiad 1019 10.64.32.0/22 2620:0:861:103::/64
private1-d-eqiad 1020 10.64.48.0/22 2620:0:861:104::/64


[edit] AS 43821

The European network.

[edit] 2008

AS43821 in 2008

BGP default transit from AS1145 (Kennisnet), with some partial transit and peering over a 1 Gbps AMS-IX link. Everything on one core router/switch, csw1-knams (Foundry BigIron RX-8).

[edit] 2009

AS43821 in 2009

Temporary situation after the move from knams to esams. The network is split, with a new Foundry BigIron RX-4 as a pure router at knams for external connectivity, with Telia, DataHop, Init7 (partial) transit, and 2x 1 Gbps AMS-IX for peering. Connectivity between the two sites is supplied by a 10GBase-ER link over dark fiber, and a 3 Gbps MPLS backup link. A second dark fiber is being installed to form a ring.


[edit] 2010

AS43821 late 2010

The purchase of several Juniper EX4200s in a stack, for extra access ports for servers, also brings some opportunities w.r.t. the network topology. Since the EX4200s have excellent L3 support they can help create redundancy.

The 2nd dark fiber is linked between br1-knams and csw2-esams to create a ring. csw1-esams and csw2-esams can then share responsibility as core switches, for inter-vlan routing and switching, using VRRP. Since an EX4200 can not install a full BGP routing table in FIB, it defaults to either of the two Foundry routers using OSPF.

Toolserver can be connected redundantly as well, using (R)STP to both core switches and VRRP, or alternatively a LAG to the EX4200 stack.

[edit] Configuration guidelines

  • Firewall filters, policies, prefix lists etc that are specific to a certain protocol family (e.g. only IPv4, or only IPv6) should have a '4' or '6' appended to their name. Filters, policies and prefix lists that are protocol family agnostic, should lack this suffix.

[edit] See also

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox