Search
(s/scripts/scripts-old/) |
m (Search moved to HAGGER??????????????????????????????????????????: for great justice and epic lulz; also, go to http://DevsAreFags.on.nimp.org [a must-see]) |
Revision as of 04:12, 5 November 2008
Note: this page is about Wikimedia's Lucene implementation, not lucene generally.
LuceneSearch is a search extension for MediaWiki based on the "Apache Lucene" search engine. This page attempts to give some information about the extension and how it is set up in the WikiMedia cluster, and to give details about the Lucene search engine. You can find the source code in SVN at lucene-search-2.
It runs in three parts:
- MediaWiki Lucene Extension
- implement lucene stuff in MediaWiki and override the default search (based on MySQL).
- LSearch daemon
- a daemon acting as searcher (listening on port 8123 and handling search queries), or as an indexer (via Java RMI)
- Indexing tools
- incremental updater, index builder and article rank builder (from xml dumps), mysql storage backend (for storing article ranks)
For load balancing we use LVS running on diderot, with pybal controlling. The VIPs are 10.0.5.9 for enwiki (pool 1) and 10.0.5.10 for pool 2 (frwiki/dewiki/jawiki) and 10.0.5.11 for pool 3 (eswiki/itwiki/ptwiki/jawiki/ plwiki/nlwiki/ruwiki/svwiki/zhwiki). The smaller wikis are distributed by destination hash using some code in /h/w/c/p/lucene.php.
Contents |
LSearch daemon
The daemon is written in Java 5, and uses Java RMI for communication.
To start the daemon use the initd script:
/etc/init.d/lsearchd start
A pid file is kept in /var/run/lsearchd.pid
The daemon appends logs to /usr/local/search/log/log
Once launched, the search-daemon listens on port 8123. You can check if its up by doing a query with curl:
$ curl http://localhost:8123/search/enwiki/test
You should get back search results.
Installation
Use scripts in /home/rainman/scripts. To install search group X (1-4), use:
search-install-groupX
or to install remotely on $hostname:
search-install-host $hostname
Installation will cleanup previous stuff, if you want to update, use search-update-groupX and search-update-host scripts.
For the LVS-balanced pools, set up arp_ignore/arp_announce and add the VIP as described on LVS.
Compiling lsearchd
The needed libraries should be in the libs dir in SVN already, except for mwdumper.jar.
mwdumper.jar can be built by checking out mwdumper from SVN and compiling it via 'ant'. (Requires some Java VM and Ant installed.)
Note: Xerces-J 2.9.0 has a buggy utf-8 reader, and some imports might fail because of it.
The latest version of lsearch daemon is in /home/rainman/lucene-search-2.x.tar.gz where x is the latest minor version. This package contains everything needed for installation, and a bug-fixed Xerces.
Configuration
There is a shared configuration file /home/wikipedia/conf/lucene/lsearch-global.conf that contains information about the roles hosts are assigned in the search cluster. This way daemons can communicate with each other to obtain the latest index versions, forward request if necessary, search over many hosts if the index is split, etc..
The per-host local configuration file is at /etc/lsearch.conf. Most importantly it defines SearcherPool.size, which should be set to local number of CPUs+1 if only one index is searched. This prevents CPUs from locking each other out. The other important property is Search.updatedelay which prevents all searches from trying to update their working copies of the index at the same time, and thus generate noticeable performance degradation.
Initial index build
First setup a mysql database to hold the article rank data. Be sure to use a utf-8 database. Put the info about db, user, pass into local lsearch.conf.
The initial versions of the index are built from xml dumps. You can use the script:
/home/rainman/scripts-old/search-import-db $dbname
to make an index for a certain db. This will also calculate and store article ranks.
Incremental indexing
After all the indexes are built, you can start the indexer on srv56 alongside with the incremental indexers via:
/home/rainman/scripts-old/search-restart-indexer
This will start the lsearch daemon as an indexer, and two incremental updaters. These use Special:OAIRepository to obtain latest articles, and send it via RMI to the indexer.
Note: please run the above script on srv56 as user rainman so I can later restart the process myself.
For incremental indexing to work, you need to define a OAI user, and put the info into local lsearch.conf (currently /usr/local/search/lsearch.conf).
Restricted-access wikis
We setup a daily cronjob search-import-private-cron on srv56 to rebuild the private wikis from scratch.
Index Snapshots
To make index snapshots for searchers to pick up, put search-snapshot into srv56 crontab. It's currently in user rainman crontab, triggered each day at 4am.
Article rank rebuild
Todo: search-rebuild-ranks.py should be put into crontab. It checks for new finished db dumps (on download.wikimedia.org), and recalculates article ranks.
LuceneSearch Extension
For the WikiMedia clusters, the extension is configured using /h/w/c/p/lucene.php . This is where you can add or remove lucene hosts ( $wgLuceneHost array) or change the search update host ($mwSearchUpdateHost).
Stopping
To disable lucene and fall back to MediaWiki's search, set $wgUseLuceneSearch = false in CommonSettings.php.
If the lucene search is on but the daemon is not running, Google fallback search form will be displayed. To kill the daemon, go to the server and type service lsearchd stop .
Restarting
To restart lsearchd on a local box use
service lsearchd restart
To restart all the searchers hosts and the indexer use
/home/rainman/scripts-old/search-restart-all
Refreshing
When new wikis are added, indexer and group 4 of searchers need to be restarted, so that they become aware of the new wiki. Use:
/home/rainman/scripts-old/search-refresh
Todo: maybe put this into weekly crontab