Event logging

From Wikitech
Revision as of 01:32, 28 February 2013 by SPage (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Event logging sequence diagram

This article describes the network architecture for event logging, or the recording of user interactions with web content for analytics purposes. An event is the anonymized record of a user action on the site, such as previewing an edit, or collapsing the "Toolbox" section in the sidebar. We capture and analyze event data in aggregate to better understand how readers and editors interact with our site, to identify usability problems, and to provide feedback for features engineers.

Contents

TL;DR

Data flows in one direction: front ⇒ back.

  1. JavaScript code in the browser captures a UI event as a key-value map.
  2. Browser calls mw.eventLog.logEvent() (provided by mw:Extension:EventLogging to log this event against a particular Schema (name and revision).
  3. See mw:Extension:EventLogging/Schemas for more detail of this client-side activity.
  4. EventLogging adds this as a request to http[s]://bits.wikimedia.org/event.gif?encoded JSON structure
  5. Varnish recognizes URL, returns empty response, and logs request.
  6. Each varnish host uses UDP to transmit to endpoint
    used to be:On each Bits, varnishlog filters for /event.gif? requests, piping them into zpub.
  7. Eventually (WMF Network diagram of UDP log flows goes here) log lines end up on machine "vanadium"
  8. vanadium writes events to local log files, and publishes them over ZeroMQ
  9. ZeroMQ subscribers process these events. Most importantly, subscribers store event log data for analysis:
    • One subscriber takes each event, flattens it, and logs it to a SQL table with a field for each element of the schema, e.g. AccountCreation_590433, automatically creating tables as required.
    • Another subscriber takes each event and loads it into MongoDB tables.

Also server-side events, logged by PHP code. The PHP function efLogServerSideEvent() writes these to a "log". The log endpoint is configured to be a UDP stream, and then the rest of the processing is similar to the above.

Client-side JavaScript

out of date, now implemented by EventLogging
Many user interactions do not result in page navigation, and thus cannot be gleaned from server request logs. To record these interactions, we rely on client-side JavaScript code; JavaScript event handlers represent each event of interest as a simple key-value map, and pass it to mw.eventLog.logEvent(). track() serializes each event into a percent-encoded query string, which it affixes to the base URL for event logging (bits.wikimedia.org/event.gif). It then creates an <img> element and sets the full URL as its source, triggering an asynchronous request.

We use this transport technique to transfer a key-value map that we validate against a schema, but the schema details are out of scope for this article

Varnish

The Varnish configuration for bits machines recognizes (and special-cases) requests for /event.gif. Rather than attempt to resolve the request to an object, Varnish bombs out with an HTTP 204 ("No Content") response, indicating the request was received and understood but that no message-body is forthcoming (see RFC 2616, section 10.2.5). This response is semantically correct, widely supported by browsers old and new, and minimizes bandwidth and latency.

The behavior is specified using VCL ("Varnish Configuration Language") and is managed by Puppet (see source link above):

if (req.url ~ "^/event\.gif") {
        error 204;
}

You can test this configuration easily using curl -i, or in-browser using HTTP Viewer:

$ curl -i http://bits.wikimedia.org/event.gif?foo=bar
HTTP/1.1 204 No Content
Server: Varnish
Accept-Ranges: bytes
Date: Sat, 18 Aug 2012 05:23:59 GMT
X-Varnish: 547318191
Age: 0
Via: 1.1 varnish
Connection: close

Note that the request requires no cache or back-end look-ups: it is handled entirely within Varnish, with microsecond performance. Moreover, Varnish does not parse, aggregate or transport event data, and it has no direct knowledge of the components that do these things. All that Varnish has to do is record the request in its log, which is its default behavior.

Loose coupling provides a measure of safety and security for Varnish, isolating it from failures occurring in other subsystems. Varnish will not block client requests while waiting for log data to be collected. When its log buffer is exhausted, Varnish simply rewinds its pointer and starts writing from the top. It is the responsibility of interested subscribers to keep up.

A consequence of this approach is that errors occurring further back in the event logging data pipeline cannot propagate back to the user. This is desirable, however, since event-logging is orthogonal to the user's immediate interests and experience. Event logging ought never interfere with or complicate the servers' primary function, which is to serve content.

The absence of forward dependencies (front-facing components needing to know what happens behind them) isolates failure and promotes modularity. It is possible to thoroughly alter how we aggregate or store event data without changing the configuration of Varnish. We are therefore free to develop these subsystems iteratively, in response to evolving needs, without compromising the availability of critical services.

VSM

Varnish assumes as little as possible about what log data you'll need and what you'll want to do with it. Rather than log to a file, Varnish logs to a shared memory segment. The mechanism and internal API for reading and writing from this memory segment is known as VSM ("Varnish Shared Memory"). The VSM log can be tailed using a suite of tools distributed with Varnish, the most useful of which are varnishlog, which simply streams the contents of the log buffer to standard output, and varnishncsa, which provides finer controls over the output format of Varnish's request log. Both varnishlog and varnishncsa can optionally filter for log records matching a particular pattern or type.

We can pass varnishncsa the following command-line arguments to stream tracked events:

  $ /usr/bin/varnishncsa -m RxURL:^/event\.gif -F '%q'

The format placeholder "%q" is rendered as the request's query string (or an empty line). Output is unbuffered.

The output of this exact invocation is piped to another tool, zpub, which makes log data available as a streaming network service using ZeroMQ. 2013-01: No, I think the bits servers just route the log data stream over UDP; then some server receiving the UDP re-publishes over ZeroMQ.

ZeroMQ

ZeroMQ is a free and open source networking library that lets you write brokerless message-oriented middleware using a socket-like API. It is performant, well-tested, and broadly used. It exposes common messaging patterns (like pub/sub, fanout, etc.) as different "types" of sockets. zpub is a small application that reads lines from stdin and streams them using a ZeroMQ publisher socket in a fan-out fashion to all connected peers.

A subscriber may identify itself to the publisher using a short identifier string called a socket identity. If the connection between a named subscriber and a publisher drops, the publisher will buffer messages for that subscriber until the subscriber re-connects or the number of messages in the buffer reaches the high-water mark (by default, 1,000 messages).

ZeroMQ's pub/sub pattern reproduces Varnish's message-oriented architecture on the network:

Varnish ZeroMQ
Publishers varnishd zpub, many others
Subscribers varnishlog, varnishncsa, etc. zsub, many others
Transport shared memory segment network link
I/O non-blocking; unidirectional
Filters tags, regexps string prefix
Max. queue SHM_WORKSPACE (default: 8 kB) ZMQ_HWM (default: 1 k messages)
Queue exhaustion Silently overwrite Silently drop

TL;DR: varnishd : varnishlog : : zpub : zsub

zpub/sub has been packaged for both Lucid and Precise and is available in the Wikimedia repository (apt-get install zpubsub).

zpub usage

Source code: zpubsub/zpub.c
$ zpub -h
zpub -- pipe lines from stdin to zmq pub socket
 
Usage: zpub [options]
  -p <port>        Server port (default: 8006)
  -i <interface>   Interface to bind (default: *)
  -m <count>       ZeroMQ high-water mark (default: 1000)
  -h               Show this message and exit
Admonition: zpub performance may increase if it is refactored to read the VSM log directly using the public API (see include/vapi/vsm.h), rather than rely on the output of varnishlog or varnishncsa. The cost of this optimization would be the forfeiture of a supported and production-hardened tool in favor of a custom implementation. We should do this only if this a bottleneck.

zsub usage

Source code: zpubsub/zsub.c
Used by a subscriber to subscribe to data distributed by a publisher.
$ zsub -h
zsub -- pipe zmq sub socket to stdout
 
Usage: zsub [options] host:port
  -h               Show this message and exit
  -f <filter>      Subscribe to filter
  -s <identity>    Set socket identity

Debugging and benchmarking

zpub/sub

The following script will stream a file in a loop using zpub, and listen to it using zsub. It uses pv (pipe viewer), which you may have to install.

#!/usr/bin/env bash
 
# zbench
#
# Benchmark throughput of ZeroMQ pub/sub over local TCP/IP
# socket.
#
# Usage: zbench [/dev/shm/inputfile]
# Reads input file in a loop; outputs LINES / second.
 
set -em
trap 'kill %% &>/dev/null' EXIT
 
[[ "$#" == 1 ]] || { echo >&2 "Usage: ${0} inputfile" ; exit 1 ; }
 
while :; do cat "$1" ; done | zpub &
    zsub localhost:8006 | pv -Walr -N "Messages/sec (cur, avg)" >/dev/null
wait

On vanadium, a modest machine, results have ranged from 70k to 115k messages/sec, depending on load. Keep in mind that in production, a machine is only ever responsible for one role (publisher or subscriber), not both.

Varnish

Benchmark

To measure the amount of time Varnish is taking to service event logging requests, run varnishlog with the following arguments and curl the URL:

$ varnishlog -I^/event\.gif -iReqEnd -iRxURL
   11 SessionOpen  c 75.101.52.39 57737 :80
   11 RxURL        c /event.gif
   11 ReqEnd       c 1886474086 1345278127.152711868 1345278127.152907133 0.000069857 0.000093460 0.000101805

Subtract the second number of the ReqEnd line from the third to get the total time (in decimal seconds) it took Varnish to service the request. (In this case, 195 microseconds.) For additional guidelines on interpreting the output of varnishlog, see the page Varnishlog tag explanation on the Varnish wiki.

Compile VCL

To see the C code a VCL file compiles to, run:

 $ varnishd -d -f bits.inc.vcl -C

Useful Stats

Statistics about the shared memory segment can be viewed using varnishstat:

$ varnishstat -1 -f shm*
shm_records             121368         0.67 SHM records
shm_writes              120391         0.67 SHM writes
shm_flushes                  0         0.00 SHM flushes due to overflow
shm_cont                     0         0.00 SHM MTX contention
shm_cycles                   0         0.00 SHM cycles through buffer

If the shm_flushes count is high, the size of the shared memory segment can be increased by upping the value of the shm_workspace run-time parameter, or by specifying it in the varnishd command-line, using -l. See section "Run-Time Parameters" in the varnishd documentation.

Further reading

Varnish

ZeroMQ

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox