Network design

From Wikitech
Revision as of 09:22, 24 February 2005 by Tim (Talk)

Jump to: navigation, search

The purpose of this page is to give an overview of the current design of the network of the Wikimedia servers, and to provide a place to develop a new and improved network scheme.

Contents

Automatically generated information

Generated 2005-02-24

Default gateway

[root@zwinger node_groups]# dsh -N working "route | grep default"
executing 'route | grep default'
albert:         default         207.142.131.193 0.0.0.0         UG    0      0        0 eth0
ariel:          default         207.142.131.225 0.0.0.0         UG    0      0        0 eth0
avicenna:       default         izwinger        0.0.0.0         UG    0      0        0 eth1
bacon:          default         207.142.131.193 0.0.0.0         UG    0      0        0 eth0
bart:           default         207.142.131.193 0.0.0.0         UG    0      0        0 eth0
bayle:          default         207.142.131.193 0.0.0.0         UG    0      0        0 eth0
browne:         default         207.142.131.193 0.0.0.0         UG    0      0        0 eth0
dalembert:      default         207.142.131.193 0.0.0.0         UG    0      0        0 eth0
diderot:        default         207.142.131.225 0.0.0.0         UG    0      0        0 eth0
friedrich:      default         izwinger        0.0.0.0         UG    0      0        0 eth1
goeje:          default         207.142.131.193 0.0.0.0         UG    0      0        0 eth0
harris:         default         izwinger        0.0.0.0         UG    0      0        0 eth1
suda:           default         207.142.131.225 0.0.0.0         UG    0      0        0 eth0
tingxi:         default         izwinger        0.0.0.0         UG    0      0        0 eth0
will:           default         207.142.131.225 0.0.0.0         UG    0      0        0 eth0
zwinger:        default         207.142.131.225 0.0.0.0         UG    0      0        0 eth0
hypatia:        default         207.142.131.193 0.0.0.0         UG    0      0        0 eth0
humboldt:       default         207.142.131.193 0.0.0.0         UG    0      0        0 eth0
kluge:          default         izwinger        0.0.0.0         UG    0      0        0 eth1
khaldun:        default         207.142.131.193 0.0.0.0         UG    0      0        0 eth0
larousse:       default         207.142.131.225 0.0.0.0         UG    0      0        0 eth0
webster:        default         izwinger        0.0.0.0         UG    0      0        0 eth1
holbach:        default         izwinger        0.0.0.0         UG    0      0        0 eth1
benet:          default         izwinger        0.0.0.0         UG    0      0        0 eth0
ibiruni:        default         10.255.255.254  0.0.0.0         UG    0      0        0 eth0
irose:          default         izwinger        0.0.0.0         UG    0      0        0 eth0
ismellie:       default         izwinger        0.0.0.0         UG    0      0        0 eth0
ianthony:       default         izwinger        0.0.0.0         UG    0      0        0 eth0
ennael:         default         router-wikipedi 0.0.0.0         UG    0      0        0 eth0
chloe:          default         router-wikipedi 0.0.0.0         UG    0      0        0 eth0
bleuenn:        default         router-wikipedi 0.0.0.0         UG    0      0        0 eth0

Cables connected

[root@zwinger node_groups]# dsh -N working mii-tool
executing 'mii-tool'
albert:         SIOCGMIIPHY on 'eth0' failed: Operation not supported
albert:         SIOCGMIIPHY on 'eth1' failed: Operation not supported
albert:         no MII interfaces found
alrazi:         eth0: no link
alrazi:         eth1: negotiated 100baseTx-FD, link ok
ariel:          eth0: negotiated 100baseTx-FD, link ok
ariel:          eth1: negotiated 100baseTx-FD flow-control, link ok
avicenna:       eth0: no link
avicenna:       eth1: negotiated 100baseTx-FD flow-control, link ok
bacon:          eth0: negotiated 100baseTx-FD, link ok
bacon:          eth1: negotiated 100baseTx-FD flow-control, link ok
bart:           eth0: negotiated 100baseTx-FD, link ok
bart:           eth1: no link
bayle:          eth0: negotiated 100baseTx-FD, link ok
bayle:          eth1: no link
browne:         eth0: negotiated 100baseTx-FD, link ok
browne:         eth1: no link
dalembert:      eth0: no link
dalembert:      eth1: negotiated 100baseTx-FD, link ok
diderot:        eth0: no link
diderot:        eth1: negotiated 100baseTx-FD flow-control, link ok
friedrich:      eth0: no link
friedrich:      eth1: negotiated 100baseTx-FD, link ok
goeje:          eth0: no link
goeje:          eth1: negotiated 100baseTx-FD flow-control, link ok
harris:         eth0: no link
harris:         eth1: negotiated 100baseTx-FD flow-control, link ok
suda:           eth0: negotiated 100baseTx-FD, link ok
suda:           eth1: negotiated 100baseTx-FD flow-control, link ok
tingxi:         eth0: negotiated 100baseTx-FD flow-control, link ok
tingxi:         eth1: no link
will:           eth0: negotiated 100baseTx-FD, link ok
will:           eth1: negotiated 100baseTx-FD flow-control, link ok
zwinger:        eth0: negotiated 100baseTx-FD, link ok
zwinger:        eth1: negotiated 100baseTx-FD, link ok
hypatia:        eth0: no link
hypatia:        eth1: negotiated 100baseTx-FD flow-control, link ok
humboldt:       eth0: no link
humboldt:       eth1: negotiated 100baseTx-FD flow-control, link ok
kluge:          eth0: no link
kluge:          eth1: negotiated 100baseTx-FD, link ok
khaldun:        eth0: no link
khaldun:        eth1: negotiated 100baseTx-FD flow-control, link ok
larousse:       eth0: negotiated 100baseTx-FD, link ok
larousse:       eth1: negotiated 100baseTx-FD flow-control, link ok
webster:        eth0: negotiated 100baseTx-FD, link ok
webster:        eth1: negotiated 100baseTx-FD flow-control, link ok
holbach:        eth0: negotiated 100baseTx-FD, link ok
holbach:        eth1: negotiated 100baseTx-FD flow-control, link ok
benet:          eth0: negotiated 100baseTx-FD, link ok
ibiruni:        eth0: negotiated 100baseTx-FD flow-control, link ok
ibiruni:        eth1: 10 Mbit, half duplex, no link
irose:          eth0: negotiated 100baseTx-FD, link ok
irose:          eth1: 10 Mbit, half duplex, no link
ismellie:       eth0: negotiated 100baseTx-FD, link ok
ismellie:       eth1: 10 Mbit, half duplex, no link
ianthony:       eth0: negotiated 100baseTx-FD, link ok
ianthony:       eth1: 10 Mbit, half duplex, no link
ennael:         eth0: negotiated 100baseTx-FD flow-control, link ok
ennael:         eth1: no link
chloe:          eth0: negotiated 100baseTx-FD flow-control, link ok
chloe:          eth1: no link
bleuenn:        eth0: negotiated 100baseTx-FD flow-control, link ok
bleuenn:        eth1: no link


Overall system design

The folowing is the general system design plan which the network layer must efficiently accommodate.

  • Databases in a central pool with each serving a subset of the wikis, so each has high cache efficiency and the total number needed to handle any query load is minimised. Database servers cost US$5,000-$8,000 each, depending on exact equipment.
  • A central pair of old text database servers (part of the long term storage growth plan for the databases, to move this high volume and seldom accessed data off costly and comparatively small disk systems).
  • Memcached caching spread on apaches across the whole cluster, producing one very large cache pool, accessible from any apache and stored on half or more of the apaches. Segmenting the pool would decrease the overall hit rate, increasing the number of apaches and database servers required for any given system load level.
  • Load balancing of squids and apaches, currently expected to use two or three systems between the internet and the squids and the same set between the squids and the apaches.

A key network systems design requirement is efficient access from any apache to any apache running memcached (expected to be more than half of all apaches) and efficient access from any apache to any database server. Losing this capability would dramatically increase overall system cost.

Current situation

Wikimedia servers reside in two racks along with Bomis servers, hosted at Candidhosting. Wikimedia/Bomis have a dedicated IP range, 207.142.131.192/26. There are two gateways: 207.142.131.193 and 207.142.131.225, but they both resolve to the same MAC address, so they are almost certainly the same router. Total burstable bandwidth is 1000 Mbit/s, delivered through one optic fiber 1000base-SX link.

The info here is probably obsolete.

Wikimedia owns three switches. As the two uplinks are not allowed to create a loop, they must be connected to different switches that are not connected to eachother (when not using STP), which is not an ideal situation. A third switch is currently used to connect internal servers, that don't have public IPs and should not be accessible from the Internet. The IP range used for this internal network is 10.0.0.0/8.

Future plans include remote Squid cache servers which will relieve this network of about 70% of the traffic each set of squids is configured to handle.

Problems

The current network setup is not optimal in many ways, as will be described here.

In December 2004 Wikimedia average outgoing traffic is 45 megabits per second on a one month average, with daily peak times over 60 megabits per second for many hours, over 70 for significant times and occasional brief surges at 100mb/s. several months ago the colo offered gigabit for $400 per month extra unless we averaged over 60mb/s and instead provided a pair of 100mb/s uplinks, which created some potential routing issues. As bandwidth rose the colo agreed to provide a gigabit optical connection at no extra charge.

Inflexible internal network setup

The Wikimedia network was recently split in two parts: the external, publicly visible network containing machines that need to be accessed from the Internet (the Squids, mostly), and an internal network for machines that are only accessed by other wikimedia servers (Apaches, DB servers, management devices). Some servers, like the Squids, need to be in both networks because they serve as gateways between the Internet and the internal machines.

The internal network is currently implemented as a physically separate switch. This switch is not connected to the other two, and the only paths to the external network are through the servers that are on both networks. These servers use separate interfaces to connect to the different networks (eth0 for internal, eth1 for external).

Using physically separate switches for different networks is inflexible. This design does not permit efficient use of resources like switch ports and bandwidth. It requires extra switches when the internal network is full, even if the switches for the external network have plenty of ports free. Even the currently used switches support VLANs (including 802.1Q) and all of its advantages, so it would be good to use them.

Plan is to switch to a VLAN once we find out what's connected to each switch port - Kate
So I'm guessing if Jimbo and I like one another's body odor, that some phy mapping will be high on my list? -- Baylink 19:03, 4 Jan 2005 (UTC)

Failover default routing using BGP

Because the internal servers are not directly connected to the Internet, both Zwinger and Albert are setup to Source NAT traffic originated by these internal servers, to allow them to access Internet servers for management purposes.

Two hosts are configured as routers, to provide failover support. This, however, is done using BGP and Quagga on all boxes. This seems to be a bit excessive, as better and easier solutions exist for this job: VRRP and CARP. These solutions only need to be implemented on the routers, and don't require complicated daemons and protocols run on each host.

Limited switch features

T.b.d.

Load balancing

T.b.d.

Proposed solutions

This section discusses some possible solutions to the problems mentioned.

Separate external VLAN

Nowadays it's becoming quite standard in the colocation business to put each customer into a separate VLAN, along with their own IP subnet and gateway. That way, all customers are separated into broadcast domains, and prevents IP conflicts, traffic snooping, paying for broadcast traffic generated by other customers, etc. This solution is commonly used, even for small customers, with only single servers. It does however require some extra configuration work by the colo provider. A separate VLAN has to be created on the switch(es) and router(s), and a specific gateway IP has to be provided to each customer.

Wikimedia already has its own IP range and corresponding gateway(s), and isn't a small customer and is actually the fastest growing customer at the colo. It is therefore surprising we don't already reside in a separate VLAN, separated from other customers. We should ask the colo provider to put us into a dedicated VLAN, as this requires them little effort, and has quite some security and performance benefits. The colo has indicated that we can expect this to happen when we are switched to a gigabit fibre connection.

No configuration changes on our network equipment or servers are required.

Note: This has nothing to do with VLANs we define on our own equipment, as proposed in the next section. This is totally separate from and transparent to our network. In fact, they do not necessarily have to implement this using VLANs, it's just how it's commonly done.

Using VLANs and 802.1Q

Linux now supports 802.1Q right up the wire, so it's possible for a machine with only one Ethernet interface to participate in more than one VLAN group on a switch (group) simultaneously.

This would permit the attachment of up to 46 servers of all types to a 48 port switch simultaneously, while preserving one port for the colo uplink (or an uplink to a core switch), and one port for a subsidiary switch for devices like the SCS and the powerstrip controller.

One possible approach to VLAN grouping would be:

Group 1 - Management 
A network to which all machines interface, with RFC1918 addressing which carries link and backplane priority in the 802.1Q setup, so that DDoS attacks and the like can't make management impossible.
Group 2 - Internal 
This network connects the back side of all the Apache's to the DBMS machines, and also has RFC1918 addresses.
Group 3 - Public 
This network is where the front sides of the Squids live, in public address space provided by the colo. Things like our mail server, also live here, and possibly some firewalling, or a centralized admin server, into which everyone who wants to talk to the management network needs to ssh to get anywhere, for central logging purposes.
Group 4-n - Block 
The Block networks would connect the back sides of the squids to the fronts of their related apaches.

The major advantages to this level of grouping granularity would be that it would be easier to see internal traffic statistics, and that it would be easier to control things when necessary.

Kate's proposed solution

(Kate notes on IRC that these comments do not reflect her current opinion as of Jan 05 -- Baylink) After some discussion:

I propose we consider the network in terms of blocks of (say, 46) apaches + a db slave, which can be put on one cheap switch, connected to a core managed switch, and managed as one thing. Thus, we will purchase:

  • 1 core switch; Cisco 3750 48-port gigabit layer 2/3 managed switch. Cost: about $6000. (alternatives: 4948 (more expensive); 2948 (cheaper).
    • The 3750 is stackable and can do 32Gbps over 468 ports, but I don't think we will use it in this configuration.
  • For now, we can use existing netgear switches as access switches. In future we can either buy more, or use 2948s - TBD.

This way, we can do management/access control by treating each access switch as a single unit, and simplfy network management without too much extra outlay; the initial investment of the core switch will do us for a long time.

Proposed design

T.b.d.

-- Mark 15:46, 22 Oct 2004 (UTC)

A very old proposition by ashar on meta

Current proposal

http://noc.wikimedia.org/~kate/network-design2.png

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox