Event logging

From Wikitech
Revision as of 20:34, 28 August 2012 by Ori.livneh (Talk | contribs)

Jump to: navigation, search
Event tracking sequence diagram

This article describes the network architecture for event tracking, or the recording of user interactions with web content for analytics purposes. An event is the anonymized record of a user action on the site, such as previewing an edit, or collapsing the "Toolbox" section in the sidebar. We capture and analyze event data in aggregate to better understand how readers and editors interact with our site, to identify usability problems, and to provide feedback for features engineers.

Contents

TL;DR

Data flows in one direction: front ⇒ back.

  1. The browser captures a UI event as a key-value map.
  2. Browser requests http[s]://bits.wikimedia.org/event.gif?key=val&foo=bar
  3. Varnish recognizes URL, returns empty response, and logs request.
  4. On each Bits, varnishlog filters for /event.gif? requests, piping them into zpub.
  5. zpub uses ZeroMQ to publish event log data to subscribers on port 8006/tcp.
  6. Subscribers store event log data for analysis.

Client-Side JavaScript

Source code: lib/track.js
Many user interactions do not result in page navigation, and thus cannot be gleaned from server request logs. To record these interactions, we rely on client-side JavaScript code; JavaScript event handlers represent each event of interest as a simple key-value map, and pass it to mw.experiments.track(). track() serializes each event into a percent-encoded query string, which it affixes to the base URL for event tracking (bits.wikimedia.org/event.gif). It then creates an <img> element and sets the full URL as its source, triggering an asynchronous request.

Varnish

The Varnish configuration for bits machines recognizes (and special-cases) requests for /event.gif. Rather than attempt to resolve the request to an object, Varnish bombs out with an HTTP 204 ("No Content") response, indicating the request was received and understood but that no message-body is forthcoming (see RFC 2616, section 10.2.5). This response is semantically correct, widely supported by browsers old and new, and minimizes bandwidth and latency.

The behavior is specified using VCL ("Varnish Configuration Language") and is managed by Puppet (see source link above):

if (req.url ~ "^/event\.gif") {
        error 204;
}

You can test this configuration easily using curl -i, or in-browser using HTTP Viewer:

$ curl -i http://bits.wikimedia.org/event.gif?foo=bar
HTTP/1.1 204 No Content
Server: Varnish
Accept-Ranges: bytes
Date: Sat, 18 Aug 2012 05:23:59 GMT
X-Varnish: 547318191
Age: 0
Via: 1.1 varnish
Connection: close

Note that the request requires no cache or back-end look-ups: it is handled entirely within Varnish, with microsecond performance. Moreover, Varnish does not parse, aggregate or transport event data, and it has no direct knowledge of the components that do these things. All that Varnish has to do is record the request in its log, which is its default behavior.

Loose coupling provides a measure of safety and security for Varnish, isolating it from failures occurring in other subsystems. Varnish will not block client requests while waiting for log data to be collected. When its log buffer is exhausted, Varnish simply rewinds its pointer and starts writing from the top. It is the responsibility of interested subscribers to keep up.

A consequence of this approach is that errors occurring further back in the event tracking data pipeline cannot propagate back to the user. This is desirable, however, since event-tracking is orthogonal to the user's immediate interests and experience. Event tracking ought never interfere with or complicate the servers' primary function, which is to serve content.

The absence of forward dependencies (front-facing components needing to know what happens behind them) isolates failure and promotes modularity. It is possible to thoroughly alter how we aggregate or store event data without changing the configuration of Varnish. We are therefore free to develop these subsystems iteratively, in response to evolving needs, without compromising the availability of critical services.

VSM

Varnish assumes as little as possible about what log data you'll need and what you'll want to do with it. Rather than log to a file, Varnish logs to a shared memory segment. The mechanism and internal API for reading and writing from this memory segment is known as VSM ("Varnish Shared Memory"). The VSM log can be tailed using a suite of tools distributed with Varnish, the most useful of which are varnishlog, which simply streams the contents of the log buffer to standard output, and varnishncsa, which provides finer controls over the output format of Varnish's request log. Both varnishlog and varnishncsa can optionally filter for log records matching a particular pattern or type.

We can pass varnishncsa the following command-line arguments to stream tracked events:

  $ /usr/bin/varnishncsa -m RxURL:^/event\.gif -F '%q'

The format placeholder "%q" is rendered as the request's query string (or an empty line). Output is unbuffered.

The output of this exact invocation is piped to another tool, zpub, which makes log data available as a streaming network service using ZeroMQ.

ZeroMQ

TL;DR: Varnish : varnishlog :: zpub : zsub

ZeroMQ is a free and open source networking library that lets you write brokerless message-oriented middleware using a socket-like API. It is performant, well-tested, and broadly used. It exposes common messaging patterns (like pub/sub, fanout, etc.) as different "types" of sockets. zpub is a small application that reads lines from stdin and streams them using a ZeroMQ publisher socket in a fan-out fashion to all connected peers.

A subscriber may identify itself to the publisher using a short identifier string called a socket identity. If the connection between a named subscriber and a publisher drops, the publisher will buffer messages for that subscriber until the subscriber re-connects or the number of messages in the buffer reaches the high-water mark (by default, 1,000 messages).

ZeroMQ's pub/sub pattern reproduces Varnish's message-oriented architecture on the network:

Varnish ZeroMQ
Publishers varnishd zpub (and many others)
Subscribers varnishlog, varnishncsa zsub (and many others)
Transport shared memory segment network link
Filters tags, regexps string prefix
Max buffer shm_workspace (default: 8kb) ZMQ_HWM (default: 1k messages)
On buffer exhaustion Silently overwrite Silently drop

zpub/sub has been packaged for both Lucid and Precise and is available in the Wikimedia repository (apt-get install zpubsub).

zpub usage

Source code: zpubsub/zpub.c
Admonition: zpub performance may increase if it is refactored to read the VSM log directly using the public API (see include/vapi/vsm.h), rather than rely on the output of varnishlog or varnishncsa. The cost of this optimization would be the forfeiture of a supported and production-hardened tool in favor of a custom implementation. We should do this only if this a bottleneck.
$ zpub -h
zpub -- pipe lines from stdin to zmq pub socket
 
Usage: zpub [options]
  -p <port>        Server port (default: 8006)
  -i <interface>   Interface to bind (default: *)
  -m <count>       ZeroMQ high-water mark (default: 1000)
  -h               Show this message and exit

zsub usage

Source code: zpubsub/zsub.c
Used by a subscriber to subscribe to data distributed by a publisher.
$ zsub -h
zsub -- pipe zmq sub socket to stdout
 
Usage: zsub [options] host:port
  -h               Show this message and exit
  -f <filter>      Subscribe to filter
  -s <identity>    Set socket identity

Debugging and Benchmarking

zpub/sub

The following script will stream a file in a loop using zpub, and listen to it using zsub. It uses pv (pipe viewer), which you may have to install.

#!/usr/bin/env bash
 
# zbench
#
# Benchmark throughput of ZeroMQ pub/sub over local TCP/IP
# socket.
#
# Usage: zbench [/dev/shm/inputfile]
# Reads input file in a loop; outputs LINES / second.
 
set -em
trap 'kill %% &>/dev/null' EXIT
 
[[ "$#" == 1 ]] || { echo >&2 "Usage: ${0} inputfile" ; exit 1 ; }
 
while :; do cat "$1" ; done | zpub &
    zsub localhost:8006 | pv -Walr -N "Messages/sec (cur, avg)" >/dev/null
wait

On vanadium, a modest machine, results have ranged from 70k to 115k messages/sec, depending on load. Keep in mind that in production, a machine is only ever responsible for one role (publisher or subscriber), not both.

Varnish

Benchmark

To measure the amount of time Varnish is taking to service event tracking requests, run varnishlog with the following arguments and curl the URL:

$ varnishlog -I^/event\.gif -iReqEnd -iRxURL
   11 SessionOpen  c 75.101.52.39 57737 :80
   11 RxURL        c /event.gif
   11 ReqEnd       c 1886474086 1345278127.152711868 1345278127.152907133 0.000069857 0.000093460 0.000101805

Subtract the second number of the ReqEnd line from the third to get the total time (in decimal seconds) it took Varnish to service the request. (In this case, 195 microseconds.) For additional guidelines on interpreting the output of varnishlog, see the page Varnishlog tag explanation on the Varnish wiki.

Compile VCL

To see the C code a VCL file compiles to, run:

 $ varnishd -d -f bits.inc.vcl -C

Useful Stats

Statistics about the shared memory segment can be viewed using varnishstat:

$ varnishstat -1 -f shm*
shm_records             121368         0.67 SHM records
shm_writes              120391         0.67 SHM writes
shm_flushes                  0         0.00 SHM flushes due to overflow
shm_cont                     0         0.00 SHM MTX contention
shm_cycles                   0         0.00 SHM cycles through buffer

If the shm_flushes count is high, the size of the shared memory segment can be increased by upping the value of the shm_workspace run-time parameter, or by specifying it in the varnishd command-line, using -l. See section "Run-Time Parameters" in the varnishd documentation.

Further reading

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox