Network design

From Wikitech
(Difference between revisions)
Jump to: navigation, search
(Inflexible internal network setup: - smart ass comment :-))
(+links)
 
(42 intermediate revisions by 16 users not shown)
Line 1: Line 1:
The purpose of this page is to give an overview of the current '''design of the network''' of the Wikimedia servers, and to provide a place to develop a new and improved network scheme.
+
== AS 14907 ==
 +
The US network.
  
==Overall system design==
+
=== 2011 ===
The folowing is the general system design plan which the network layer must efficiently accommodate.
+
[[File:Eqiad logical.png|thumb|400px|AS14907 Eqiad in 2011]]
 +
[[File:Wikimedia pmtpa management network.png|thumb|400px|AS14907 in 2010]]
  
*Databases in a central pool with each serving a subset of the wikis, so each has high cache efficiency and the total number needed to handle any query load is minimised. Database servers cost US$5,000-$8,000 each, depending on exact equipment.
+
=== Subnets ===
*A central pair of old text database servers (part of the long term storage growth plan for the databases, to move this high volume and seldom accessed data off costly and comparatively small disk systems).
+
*Memcached caching spread on apaches across the whole cluster, producing one very large cache pool, accessible from any apache and stored on half or more of the apaches. Segmenting the pool would decrease the overall hit rate, increasing the number of apaches and database servers required for any given system load level.
+
*Load balancing of squids and apaches, currently expected to use two or three systems between the internet and the squids and the same set between the squids and the apaches.
+
  
A key network systems design requirement is efficient access from any apache to any apache running memcached (expected to be more than half of all apaches) and efficient access from any apache to any database server. Losing this capability would dramatically increase overall system cost.
+
==== [[pmtpa]] ====
  
== Current situation ==
+
==== [[eqiad]] ====
Wikimedia servers reside in two racks along with Bomis servers, hosted at [http://www.candidhosting.com Candidhosting]. Wikimedia/Bomis have a dedicated IP range, <tt>207.142.131.192/26</tt>. There are two gateways: <tt>207.142.131.193</tt> and <tt>207.142.131.225</tt>, but they both resolve to the same MAC address, so they are almost certainly the same router. Total burstable bandwidth is 200 Mbit/s, delivered through two separate 100BaseTx uplinks, connected from the same broadcast domain that is ''shared with other customers''.
+
{| class="wikitable"
 +
|-
 +
!subnet          !! vlan ID !! IPv4 !! IPv6
 +
|-
 +
| public1-a-eqiad  || 1001 || 208.80.154.0/26  || 2620:0:861:1::/64
 +
|-
 +
| public1-b-eqiad  || 1002 || 208.80.154.128/26 || 2620:0:861:2::/64
 +
|-
 +
| public1-c-eqiad  || 1003 ||                  || 2620:0:861:3::/64
 +
|-
 +
| public1-d-eqiad  || 1004 ||                  || 2620:0:861:4::/64
 +
|-
 +
| private1-a-eqiad || 1017 || 10.64.0.0/22      || 2620:0:861:101::/64
 +
|-
 +
| private1-b-eqiad || 1018 || 10.64.16.0/22    || 2620:0:861:102::/64
 +
|-
 +
| private1-c-eqiad || 1019 || 10.64.32.0/22    || 2620:0:861:103::/64
 +
|-
 +
| private1-d-eqiad || 1020 || 10.64.48.0/22    || 2620:0:861:104::/64
 +
|}
  
Wikimedia owns three [[switches]]. As the two uplinks are not allowed to create a loop, they must be connected to different switches that are not connected to eachother (when not using [[Wikipedia:Spanning Tree Protocol|STP]]), which is not an ideal situation. A third switch is currently used to connect internal servers, that don't have public IPs and should not be accessible from the Internet. The IP range used for this internal network is <tt>10.0.0.0/8</tt>.
 
  
Future plans include remote Squid cache servers which will relieve this network of about 70% of the traffic each set of squids is configured to handle.
+
== AS 43821 ==
 +
The European network.
  
== Problems ==
+
=== 2008 ===
The current network setup is not optimal in many ways, as will be described here.
+
[[File:Knams-multihomed.png|thumb|400px|AS43821 in 2008]]
  
In December 2004 Wikimedia [http://65.59.189.201/www.bomis-total/www.bomis-total.html average outgoing traffic] is 45 megabits per second on a one month average, with daily peak times over 60 megabits per second for many hours, over 70 for significant times and occasional brief surges at 100mb/s. several months ago the colo offered gigabit for $400 per month extra unless we averaged over 60mb/s and instead provided a pair of 100mb/s uplinks, which created some potential routing issues. As bandwidth rose the colo agreed to provide a gigabit optical connection at no extra charge.
+
BGP default transit from AS1145 (Kennisnet), with some partial transit and peering over a 1 Gbps AMS-IX link. Everything on one core router/switch, csw1-knams (Foundry BigIron RX-8).
  
=== Inflexible internal network setup ===
+
=== 2009 ===
The Wikimedia network was recently split in two parts: the ''external'', publicly visible network containing machines that need to be accessed from the Internet (the Squids, mostly), and an ''internal'' network for machines that are only accessed by other wikimedia servers (Apaches, DB servers, management devices). Some servers, like the Squids, need to be in both networks because they serve as gateways between the Internet and the internal machines.
+
[[File:AS43821 2009.png|thumb|400px|AS43821 in 2009]]
  
The internal network is currently implemented as a physically separate switch. This switch is not connected to the other two, and the only paths to the external network are through the servers that are on both networks. These servers use separate interfaces to connect to the different networks (<tt>eth0</tt> for internal, <tt>eth1</tt> for external).
+
Temporary situation after the move from knams to esams. The network is split, with a new Foundry BigIron RX-4 as a pure router at knams for external connectivity, with Telia, DataHop, Init7 (partial) transit, and 2x 1 Gbps AMS-IX for peering. Connectivity between the two sites is supplied by a 10GBase-ER link over dark fiber, and a 3 Gbps MPLS backup link. A second dark fiber is being installed to form a ring.
  
Using physically separate switches for different networks is inflexible. This design  does not permit efficient use of resources like switch ports and bandwidth. It requires extra switches when the internal network is full, even if the switches for the external network have plenty of ports free. Even the currently used switches support [[Wikipedia:Virtual LAN|VLANs]] (including '''802.1Q''') and all of its advantages, so it would be good to use them.
 
  
:Plan is to switch to a VLAN once we find out what's connected to each switch port - Kate
+
=== 2010 ===
 +
[[File:AS43821 Q3 2010.png|thumb|400px|AS43821 late 2010]]
  
::So I'm guessing if Jimbo and I like one another's body odor, that some phy mapping will be high on my list?  -- [[User:Baylink|Baylink]] 19:03, 4 Jan 2005 (UTC)
+
The purchase of several Juniper EX4200s in a stack, for extra access ports for servers, also brings some opportunities w.r.t. the network topology. Since the EX4200s have excellent L3 support they can help create redundancy.
  
=== Failover default routing using BGP ===
+
The 2nd dark fiber is linked between [[br1-knams]] and [[csw2-esams]] to create a ring. [[csw1-esams]] and [[csw2-esams]] can then share responsibility as core switches, for inter-vlan routing and switching, using VRRP. Since an EX4200 can not install a full BGP routing table in FIB, it defaults to either of the two Foundry routers using OSPF.
Because the internal servers are not directly connected to the Internet, both Zwinger and Albert are setup to ''Source NAT'' traffic originated by these internal servers, to allow them to access Internet servers for management purposes.
+
  
Two hosts are configured as routers, to provide failover support. This, however, is done using [[BGP]] and [[Wikipedia:Quagga|Quagga]] on all boxes. This seems to be a bit excessive, as better and easier solutions exist for this job: [[Wikipedia:VRRP|VRRP]] and [[Wikipedia:Common Address Redundancy Protocol|CARP]]. These solutions only need to be implemented on the routers, and don't require complicated daemons and protocols run on each host.
+
Toolserver can be connected redundantly as well, using (R)STP to both core switches and VRRP, or alternatively a LAG to the EX4200 stack.
  
=== Limited switch features ===
+
== Configuration guidelines ==
 +
* Firewall filters, policies, prefix lists etc that are specific to a certain protocol family (e.g. only IPv4, or only IPv6) should have a '4' or '6' appended to their name. Filters, policies and prefix lists that are protocol family agnostic, should lack this suffix.
  
T.b.d.
+
== See also ==
 +
* [[Multicast]]
 +
* [[TCP Tuning]]
  
=== Load balancing ===
+
[[Category:Network]]
 
+
[[Category:knams cluster| *]]
T.b.d.
+
[[Category:Pmtpa cluster| *]]
 
+
== Proposed solutions ==
+
 
+
This section discusses some possible solutions to the [[#Problems|problems]] mentioned.
+
 
+
=== Separate external VLAN ===
+
 
+
Nowadays it's becoming quite standard in the colocation business to put each customer into a separate [[Wikipedia:VLAN|VLAN]], along with their own IP subnet and gateway. That way, all customers are separated into broadcast domains, and prevents IP conflicts, traffic snooping, paying for broadcast traffic generated by other customers, etc. This solution is commonly used, even for small customers, with only single servers. It does however require some extra configuration work by the colo provider. A separate VLAN has to be created on the switch(es) and router(s), and a specific gateway IP has to be provided to each customer.
+
 
+
Wikimedia already has its own IP range and corresponding gateway(s), and isn't a small customer and is actually the fastest growing customer at the colo. It is therefore surprising we don't already reside in a separate VLAN, separated from other customers. We should ask the colo provider to put us into a dedicated VLAN, as this requires them little effort, and has quite some security and performance benefits. The colo has indicated that we can expect this to happen when we are switched to a gigabit fibre connection.
+
 
+
No configuration changes on our network equipment or servers are required.
+
 
+
'''Note:''' This has nothing to do with VLANs we define on our own equipment, as proposed in the next section. This is totally separate from and transparent to our network. In fact, they do not necessarily have to implement this using VLANs, it's just how it's commonly done.
+
 
+
=== Usings VLANs and 802.1Q ===
+
 
+
T.b.d.
+
 
+
=== Kate's proposed solution ===
+
After some discussion:
+
 
+
I propose we consider the network in terms of blocks of (say, 46) apaches + a db slave, which can be put on one cheap switch, connected to a core managed switch, and managed as one thing.  Thus, we will purchase:
+
 
+
* 1 core switch; [http://www.cisco.com/en/US/products/hw/switches/ps5023/ps5226/index.html Cisco 3750 48-port gigabit layer 2/3 managed switch]. Cost: about $6000. (alternatives: [http://www.cisco.com/en/US/products/ps6021/products_data_sheet0900aecd8017a72e.html 4948] (more expensive); [http://www.cisco.com/en/US/products/hw/switches/ps606/products_data_sheet09186a00801cfafe.html 2948] (cheaper).
+
**The 3750 is stackable and can do 32Gbps over 468 ports, but I don't think we will use it in this configuration.
+
*For now, we can use existing netgear switches as access switches.  In future we can either buy more, or use 2948s - TBD.
+
 
+
This way, we can do management/access control by treating each access switch as a single unit, and simplfy network management without too much extra outlay; the initial investment of the core switch will do us for a long time.
+
 
+
== Proposed design ==
+
 
+
T.b.d.
+
 
+
-- [[User:Mark|Mark]] 15:46, 22 Oct 2004 (UTC)
+

Latest revision as of 22:43, 20 February 2012

Contents

[edit] AS 14907

The US network.

[edit] 2011

AS14907 Eqiad in 2011
AS14907 in 2010

[edit] Subnets

[edit] pmtpa

[edit] eqiad

subnet vlan ID IPv4 IPv6
public1-a-eqiad 1001 208.80.154.0/26 2620:0:861:1::/64
public1-b-eqiad 1002 208.80.154.128/26 2620:0:861:2::/64
public1-c-eqiad 1003 2620:0:861:3::/64
public1-d-eqiad 1004 2620:0:861:4::/64
private1-a-eqiad 1017 10.64.0.0/22 2620:0:861:101::/64
private1-b-eqiad 1018 10.64.16.0/22 2620:0:861:102::/64
private1-c-eqiad 1019 10.64.32.0/22 2620:0:861:103::/64
private1-d-eqiad 1020 10.64.48.0/22 2620:0:861:104::/64


[edit] AS 43821

The European network.

[edit] 2008

AS43821 in 2008

BGP default transit from AS1145 (Kennisnet), with some partial transit and peering over a 1 Gbps AMS-IX link. Everything on one core router/switch, csw1-knams (Foundry BigIron RX-8).

[edit] 2009

AS43821 in 2009

Temporary situation after the move from knams to esams. The network is split, with a new Foundry BigIron RX-4 as a pure router at knams for external connectivity, with Telia, DataHop, Init7 (partial) transit, and 2x 1 Gbps AMS-IX for peering. Connectivity between the two sites is supplied by a 10GBase-ER link over dark fiber, and a 3 Gbps MPLS backup link. A second dark fiber is being installed to form a ring.


[edit] 2010

AS43821 late 2010

The purchase of several Juniper EX4200s in a stack, for extra access ports for servers, also brings some opportunities w.r.t. the network topology. Since the EX4200s have excellent L3 support they can help create redundancy.

The 2nd dark fiber is linked between br1-knams and csw2-esams to create a ring. csw1-esams and csw2-esams can then share responsibility as core switches, for inter-vlan routing and switching, using VRRP. Since an EX4200 can not install a full BGP routing table in FIB, it defaults to either of the two Foundry routers using OSPF.

Toolserver can be connected redundantly as well, using (R)STP to both core switches and VRRP, or alternatively a LAG to the EX4200 stack.

[edit] Configuration guidelines

  • Firewall filters, policies, prefix lists etc that are specific to a certain protocol family (e.g. only IPv4, or only IPv6) should have a '4' or '6' appended to their name. Filters, policies and prefix lists that are protocol family agnostic, should lack this suffix.

[edit] See also

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox