Projects
(initial list of projects plus descriptions for few of them) |
(add description for more projects) |
||
| Line 65: | Line 65: | ||
* Dates: TBD | * Dates: TBD | ||
| − | === | + | To expand statistics collection from the production site and to collect more performance metrics from various pieces of infrastructure (Varnish, Ceph, Swift etc.). Expand the deployment of Graphite and possibly accompany it with a software like statsd. Integrate performance trend lines/forecasts with alerting. |
| + | |||
| + | === Network-based security === | ||
* Team: TBD | * Team: TBD | ||
* Duration: TBD | * Duration: TBD | ||
* Master RT ticket: TBD | * Master RT ticket: TBD | ||
* Dates: TBD | * Dates: TBD | ||
| + | |||
| + | Add network-based security to parts of the infrastructure, including but not limited to: move more servers to internal IPs, splitting/segmenting networks per role, isolating such networks from each other via router ACLs, add host-based firewalls to misc (and possibly other) hosts. | ||
=== Expand use of AppArmor === | === Expand use of AppArmor === | ||
| Line 76: | Line 80: | ||
* Master RT ticket: TBD | * Master RT ticket: TBD | ||
* Dates: TBD | * Dates: TBD | ||
| + | |||
| + | Create AppArmor profiles for all core components of the infrastructure, starting from image scalers, application servers and caching proxies. | ||
=== Backup infrastructure === | === Backup infrastructure === | ||
| Line 82: | Line 88: | ||
* Master RT ticket: TBD | * Master RT ticket: TBD | ||
* Dates: TBD | * Dates: TBD | ||
| + | |||
| + | Design a new generation backup architecture and data retention plan, recommend hardware or service procurement and transition plan | ||
| + | (this may be a larger project than the scope of these projects is) | ||
=== Netflow collector === | === Netflow collector === | ||
| Line 88: | Line 97: | ||
* Master RT ticket: TBD | * Master RT ticket: TBD | ||
* Dates: TBD | * Dates: TBD | ||
| + | |||
| + | Setup a NetFlow collector or two and point sampled NetFlow version 9 or IPIX from all routers to those (or a multicast group that the collectors will listen to). The goal would be to be able to detect DoS or DDoS more effectively, to get per AS statistics of traffic and help peering & routing decisions. | ||
| + | |||
| + | pmacct is an excellent piece of software for this purpose, although the less complex nfdump could also be used. | ||
Revision as of 23:18, 26 February 2013
This page is meant to be used to coordinate projects within the Ops team. The focus of this page should be coordinating non-geographically based sprints. The ideal length of a project should be 1-2 weeks. If the scope of the project grows beyond this, it will probably require more iterations, if not a larger discussion within the ops team should occur.
Please feel free to add your projects. Please include at least some of the following: description/motivation/dependencies, spec, links to bugzilla/RT tickets, duration, start and end dates, interested parties. Also, this is not meant to replace, but supplement, bug trackers.
Volunteers are also welcome! Please feel free to contact other people working on these projects and help out!
Contents |
DNS
- Team: faidon, mark
- Duration: 1-2 weeks
- Master RT ticket: TBD
- Dates: during the SF Hackathon, 2/2013
The project is to create the next generation of the DNS infrastructure. The goals of this are:
- To upgrade DNS servers to modern software (precise among others)
- To use MaxMind databases for GeoIP lookups instead of the outdated DNSBL list;
- To provide support for IPv6 GeoIP and hence be able to add AAAA records to our NS records;
- To support for the draft edns-client-subnet extension and hence provide better geolocation to Google Public DNS and OpenDNS;
- To support more granular GeoIP than per-country-based, to e.g. be able to direct US per coast or state;
- To support more failure scenarios instead of ${site}-down and hence to be able to scale to more datacenters;
- To move zones from Subversion to Git and to use Gerrit for handling changesets and enable normal ops review processes and contributions from volunteers;
- To support linting and provide more resiliency to the DNS infrastructure from e.g. typos.
Switch "text" to Varnish
- Team: TBD
- Duration: TBD
- Master RT ticket: TBD
- Dates: TBD
Text is one of the few services that haven't migrated to Varnish. The project is about collecting the missing bits for supporting that, such as support for (X-)Vary-Options and communicating to Platform the requirements from their side for this to happen. The project largely depends on hardware procurement and hence might be stalled from that side.
Basic monitoring & alerting
- Team: TBD
- Duration: TBD
- Master RT ticket: TBD
- Dates: TBD
The project is about adding more Nagios checks across the board. We currently average 4.4 checks per host while our goal should be more closer to 50 checks per host. We lack checks about very fundandemental problems (such as disk full). A sprint collecting all those needs and then adding checks in puppet across the infrastructure should happen and possibly iterated again in the future.
The same applies to a lesser extent to Ganglia and the metrics that are being collected there.
Scaling of monitoring infrastructure
- Team: TBD
- Duration: TBD
- Master RT ticket: TBD
- Dates: TBD
Nagios (and Icinga) currently complain about maximum amounts of checks reached. Additionally, we currently lack the infrastructure to do per-DC monitoring and be able to distinguish signal from noise when e.g. we lose an entire datacenter. The infrastructure will need to scale up, especially if we intend to add more checks (see above); possibilities include expanding the use of passive checks, using multiple Nagios boxes in a hierarchy, using check_mk or using something like mod-gearman.
Logging infrastructure
- Team: TBD
- Duration: TBD
- Master RT ticket: TBD
- Dates: TBD
The project is about organizing misc (i.e. not HTTP access logs) log handling across the Wikimedia infrastructure for. Most logs currently go to user.log or some other centralized location, sometimes not being logrotated at all or being incosistently logrotated. Puppetized rsyslog definitions based on process names or facilities should be provided that redirect services to well-known locations across the infrastructure; puppetized logrotate definitions per such file should also be provided, as to have a consistent retention of such logs.
A centralized rsyslog (for ops) should be installed to collect those and archive them. A blind log collector box for security purposes might also be a good idea.
Using some fancier log collections tools (such as logstash/kibana, greylog or a combination of the two) could be installed to provide easily searchable logs and log trends to facilitate monitoring and troubleshooting.
Performance monitoring
- Team: TBD
- Duration: TBD
- Master RT ticket: TBD
- Dates: TBD
To expand statistics collection from the production site and to collect more performance metrics from various pieces of infrastructure (Varnish, Ceph, Swift etc.). Expand the deployment of Graphite and possibly accompany it with a software like statsd. Integrate performance trend lines/forecasts with alerting.
Network-based security
- Team: TBD
- Duration: TBD
- Master RT ticket: TBD
- Dates: TBD
Add network-based security to parts of the infrastructure, including but not limited to: move more servers to internal IPs, splitting/segmenting networks per role, isolating such networks from each other via router ACLs, add host-based firewalls to misc (and possibly other) hosts.
Expand use of AppArmor
- Team: TBD
- Duration: TBD
- Master RT ticket: TBD
- Dates: TBD
Create AppArmor profiles for all core components of the infrastructure, starting from image scalers, application servers and caching proxies.
Backup infrastructure
- Team: TBD
- Duration: TBD
- Master RT ticket: TBD
- Dates: TBD
Design a new generation backup architecture and data retention plan, recommend hardware or service procurement and transition plan (this may be a larger project than the scope of these projects is)
Netflow collector
- Team: TBD
- Duration: TBD
- Master RT ticket: TBD
- Dates: TBD
Setup a NetFlow collector or two and point sampled NetFlow version 9 or IPIX from all routers to those (or a multicast group that the collectors will listen to). The goal would be to be able to detect DoS or DDoS more effectively, to get per AS statistics of traffic and help peering & routing decisions.
pmacct is an excellent piece of software for this purpose, although the less complex nfdump could also be used.