Search

From Wikitech
Revision as of 04:12, 5 November 2008 by Grawp (Talk | contribs)

Jump to: navigation, search

Note: this page is about Wikimedia's Lucene implementation, not lucene generally.


LuceneSearch is a search extension for MediaWiki based on the "Apache Lucene" search engine. This page attempts to give some information about the extension and how it is set up in the WikiMedia cluster, and to give details about the Lucene search engine. You can find the source code in SVN at lucene-search-2.

It runs in three parts:

MediaWiki Lucene Extension 
implement lucene stuff in MediaWiki and override the default search (based on MySQL).
LSearch daemon 
a daemon acting as searcher (listening on port 8123 and handling search queries), or as an indexer (via Java RMI)
Indexing tools 
incremental updater, index builder and article rank builder (from xml dumps), mysql storage backend (for storing article ranks)

For load balancing we use LVS running on diderot, with pybal controlling. The VIPs are 10.0.5.9 for enwiki (pool 1) and 10.0.5.10 for pool 2 (frwiki/dewiki/jawiki) and 10.0.5.11 for pool 3 (eswiki/itwiki/ptwiki/jawiki/ plwiki/nlwiki/ruwiki/svwiki/zhwiki). The smaller wikis are distributed by destination hash using some code in /h/w/c/p/lucene.php.

Contents

LSearch daemon

The daemon is written in Java 5, and uses Java RMI for communication.

To start the daemon use the initd script:

 /etc/init.d/lsearchd start

A pid file is kept in /var/run/lsearchd.pid

The daemon appends logs to /usr/local/search/log/log

Once launched, the search-daemon listens on port 8123. You can check if its up by doing a query with curl:

$ curl http://localhost:8123/search/enwiki/test

You should get back search results.

Installation

Use scripts in /home/rainman/scripts. To install search group X (1-4), use:

 search-install-groupX 

or to install remotely on $hostname:

 search-install-host $hostname

Installation will cleanup previous stuff, if you want to update, use search-update-groupX and search-update-host scripts.

For the LVS-balanced pools, set up arp_ignore/arp_announce and add the VIP as described on LVS.

Compiling lsearchd

The needed libraries should be in the libs dir in SVN already, except for mwdumper.jar.

mwdumper.jar can be built by checking out mwdumper from SVN and compiling it via 'ant'. (Requires some Java VM and Ant installed.)

Note: Xerces-J 2.9.0 has a buggy utf-8 reader, and some imports might fail because of it.

The latest version of lsearch daemon is in /home/rainman/lucene-search-2.x.tar.gz where x is the latest minor version. This package contains everything needed for installation, and a bug-fixed Xerces.

Configuration

There is a shared configuration file /home/wikipedia/conf/lucene/lsearch-global.conf that contains information about the roles hosts are assigned in the search cluster. This way daemons can communicate with each other to obtain the latest index versions, forward request if necessary, search over many hosts if the index is split, etc..

The per-host local configuration file is at /etc/lsearch.conf. Most importantly it defines SearcherPool.size, which should be set to local number of CPUs+1 if only one index is searched. This prevents CPUs from locking each other out. The other important property is Search.updatedelay which prevents all searches from trying to update their working copies of the index at the same time, and thus generate noticeable performance degradation.


Initial index build

First setup a mysql database to hold the article rank data. Be sure to use a utf-8 database. Put the info about db, user, pass into local lsearch.conf.

The initial versions of the index are built from xml dumps. You can use the script:

 /home/rainman/scripts-old/search-import-db $dbname

to make an index for a certain db. This will also calculate and store article ranks.

Incremental indexing

After all the indexes are built, you can start the indexer on srv56 alongside with the incremental indexers via:

 /home/rainman/scripts-old/search-restart-indexer

This will start the lsearch daemon as an indexer, and two incremental updaters. These use Special:OAIRepository to obtain latest articles, and send it via RMI to the indexer.

Note: please run the above script on srv56 as user rainman so I can later restart the process myself.

For incremental indexing to work, you need to define a OAI user, and put the info into local lsearch.conf (currently /usr/local/search/lsearch.conf).

Restricted-access wikis

We setup a daily cronjob search-import-private-cron on srv56 to rebuild the private wikis from scratch.

Index Snapshots

To make index snapshots for searchers to pick up, put search-snapshot into srv56 crontab. It's currently in user rainman crontab, triggered each day at 4am.

Article rank rebuild

Todo: search-rebuild-ranks.py should be put into crontab. It checks for new finished db dumps (on download.wikimedia.org), and recalculates article ranks.

LuceneSearch Extension

For the WikiMedia clusters, the extension is configured using /h/w/c/p/lucene.php . This is where you can add or remove lucene hosts ( $wgLuceneHost array) or change the search update host ($mwSearchUpdateHost).

Stopping

To disable lucene and fall back to MediaWiki's search, set $wgUseLuceneSearch = false in CommonSettings.php.

If the lucene search is on but the daemon is not running, Google fallback search form will be displayed. To kill the daemon, go to the server and type service lsearchd stop .

Restarting

To restart lsearchd on a local box use

 service lsearchd restart

To restart all the searchers hosts and the indexer use

 /home/rainman/scripts-old/search-restart-all

Refreshing

When new wikis are added, indexer and group 4 of searchers need to be restarted, so that they become aware of the new wiki. Use:

 /home/rainman/scripts-old/search-refresh

Todo: maybe put this into weekly crontab

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox