Search
(→LVS Pools) |
|||
| Line 43: | Line 43: | ||
The per-host local configuration file is at <tt>/etc/lsearch.conf</tt>. Most importantly it defines <tt>SearcherPool.size</tt>, which should be set to local number of CPUs+1 if only one index is searched. This prevents CPUs from locking each other out. The other important property is <tt>Search.updatedelay</tt> which prevents all searches from trying to update their working copies of the index at the same time, and thus generate noticeable performance degradation. | The per-host local configuration file is at <tt>/etc/lsearch.conf</tt>. Most importantly it defines <tt>SearcherPool.size</tt>, which should be set to local number of CPUs+1 if only one index is searched. This prevents CPUs from locking each other out. The other important property is <tt>Search.updatedelay</tt> which prevents all searches from trying to update their working copies of the index at the same time, and thus generate noticeable performance degradation. | ||
| − | == Search Cluster and Load Balancing == | + | == Search Cluster: Shards, Pools, and Load Balancing Oh My! == |
| − | As of Feb 2012 | + | We shard data across a mixture of individual hosts and multi-host pools in the search cluster. Where multi-hosts pools are employed we use pybal/LVS load balancing (running on lvs3) or in-code load balancing. As of Feb 2012 we have the following cluster configuration: |
| − | + | * '''Big RAM pool 1''', via LVS search-pool1.svc.pmtpa.wmnet ('enwiki') | |
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
| − | * Big RAM pool 1, via LVS search-pool1.svc.pmtpa.wmnet ('enwiki') | + | |
search1.pmtpa.wmnet | search1.pmtpa.wmnet | ||
| Line 79: | Line 54: | ||
search9.pmtpa.wmnet | search9.pmtpa.wmnet | ||
| − | * Big RAM pool 2, via LVS search-pool2.svc.pmtpa.wmnet ('dewiki', 'frwiki', 'jawiki') | + | * '''Big RAM pool 2''', via LVS search-pool2.svc.pmtpa.wmnet ('dewiki', 'frwiki', 'jawiki') |
search6.pmtpa.wmnet | search6.pmtpa.wmnet | ||
| − | * Pool 3, via LVS search-pool3.svc.pmtpa.wmnet ('itwiki', 'ptwiki', 'plwiki', 'nlwiki', 'ruwiki', 'svwiki', 'zhwiki') | + | * '''Pool 3''', via LVS search-pool3.svc.pmtpa.wmnet ('itwiki', 'ptwiki', 'plwiki', 'nlwiki', 'ruwiki', 'svwiki', 'zhwiki') |
search7.pmtpa.wmnet | search7.pmtpa.wmnet | ||
| − | * eswiki pool, single host ( 'eswiki' ) | + | * '''eswiki pool''', single host ( 'eswiki' ) |
search14.pmtpa.wmnet | search14.pmtpa.wmnet | ||
| − | * Others pool, split by db hash in php ( everything else ) | + | * '''Others pool''', split by db hash in php ( everything else ) |
search11.pmtpa.wmnet | search11.pmtpa.wmnet | ||
search12.pmtpa.wmnet (currently commented out?) | search12.pmtpa.wmnet (currently commented out?) | ||
| − | + | This section has been derived from the following configuration: | |
| + | * /home/wikipedia/conf/lucene/lsearch-global-2.1.conf | ||
| + | * /home/w/conf/pybal/*/search_pool | ||
| + | * http://noc.wikimedia.org/conf/highlight.php?file=lucene.php | ||
== Indexing == | == Indexing == | ||
| Line 104: | Line 82: | ||
* other indexing jobs, like indexing private wikis, spell-check rebuilds etc are in rainman's crontab on searchidx2 | * other indexing jobs, like indexing private wikis, spell-check rebuilds etc are in rainman's crontab on searchidx2 | ||
* searchidx2 runs rsyncd to allow cluster members to fetch indexes | * searchidx2 runs rsyncd to allow cluster members to fetch indexes | ||
| − | * other cluster hosts fetch indexes by rsync every (interval?) | + | * other cluster hosts fetch indexes by rsync every (what is the interval?) |
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
= Administration = | = Administration = | ||
Revision as of 22:59, 15 February 2012
Note: this page is about Wikimedia's Lucene implementation, not lucene generally.
Contents |
Usage
lucene-search is a search extension for MediaWiki based on the "Apache Lucene" search engine. This page attempts to give some information about the extension and how it is set up in the WikiMedia cluster, and to give details about the Lucene search engine.
Overview
Software
The system has two major software components, Extension:MWSearch and lsearchd.
Extension:MWSearch
Extension:MWSearch is a MW extension that overrides default search backend and send requests to lsearchd.
lsearchd
lsearchd (Extension:Lucene-search) is a versatile java daemon that can act as frontend, backend, searcher, indexer, highlighter, spellchecker, ... we use it to searches, highlight, spell-checks and act as an incremental indexer
Essentials
- configured by /home/wikipedia/conf/lucene/lsearch-global-2.1.conf and /etc/lsearch.conf
- started via /etc/init.d/lsearchd
- search frontent port 8123, index frontend port 8321; backend - RMI (RMI registry port 1099)
- logs in /a/search/logs
- indexes in /a/search/indexes
- jar in /a/search/lucene-search
- test with curl http://localhost:8123/search/enwiki/test
Installation
Scripts in /home/rainman/build:
- build - run on searchidx2 to build jar from sources
- deploy - run on target host to make directory structure, deploy jar, copy config template and start lsearchd
- deploy-jar - run on target host to only update jar, start lsearchd
Configuration
There is a shared configuration file /home/wikipedia/conf/lucene/lsearch-global-2.1.conf that contains information about the roles hosts are assigned in the search cluster. This way lsearchd daemons can communicate with each other to obtain the latest index versions, forward request if necessary, search over many hosts if the index is split, etc..
The per-host local configuration file is at /etc/lsearch.conf. Most importantly it defines SearcherPool.size, which should be set to local number of CPUs+1 if only one index is searched. This prevents CPUs from locking each other out. The other important property is Search.updatedelay which prevents all searches from trying to update their working copies of the index at the same time, and thus generate noticeable performance degradation.
Search Cluster: Shards, Pools, and Load Balancing Oh My!
We shard data across a mixture of individual hosts and multi-host pools in the search cluster. Where multi-hosts pools are employed we use pybal/LVS load balancing (running on lvs3) or in-code load balancing. As of Feb 2012 we have the following cluster configuration:
- Big RAM pool 1, via LVS search-pool1.svc.pmtpa.wmnet ('enwiki')
search1.pmtpa.wmnet search3.pmtpa.wmnet search4.pmtpa.wmnet search9.pmtpa.wmnet
- Big RAM pool 2, via LVS search-pool2.svc.pmtpa.wmnet ('dewiki', 'frwiki', 'jawiki')
search6.pmtpa.wmnet
- Pool 3, via LVS search-pool3.svc.pmtpa.wmnet ('itwiki', 'ptwiki', 'plwiki', 'nlwiki', 'ruwiki', 'svwiki', 'zhwiki')
search7.pmtpa.wmnet
- eswiki pool, single host ( 'eswiki' )
search14.pmtpa.wmnet
- Others pool, split by db hash in php ( everything else )
search11.pmtpa.wmnet search12.pmtpa.wmnet (currently commented out?)
This section has been derived from the following configuration:
- /home/wikipedia/conf/lucene/lsearch-global-2.1.conf
- /home/w/conf/pybal/*/search_pool
- http://noc.wikimedia.org/conf/highlight.php?file=lucene.php
Indexing
- searchidx2 serves as the indexer for the cluster
- searchidx2's lsearchd daemon is configured to act as indexer
- other indexing jobs, like indexing private wikis, spell-check rebuilds etc are in rainman's crontab on searchidx2
- searchidx2 runs rsyncd to allow cluster members to fetch indexes
- other cluster hosts fetch indexes by rsync every (what is the interval?)
Administration
Dependencies
[content needed]
Health/Activity Monitoring
[content needed]
Software Updates
The following script will build the latest version of lucene-search and deploy it to all searchers:
/home/rainman/salsa
(sync-all-lucene-search)
Stopping and fall back to MediaWiki's search
To disable lucene and fall back to MediaWiki's search, set $wgUseLuceneSearch = false in CommonSettings.php.
Main indexer on searchidx2 is stuck
Run this script this script as user rainman (so he can restart later if necessary):
root@searchidx2:~# sudo -u rainman /home/rainman/scripts/search-restart-indexer
Adding new wikis
When a new wiki is created an initial index build needs to be made. First restart the indexer on searchidx2 to make sure the indexer knows about the new wikis, and then run the build-new script on appropriate wiki database name (i.e. replace wikidb with the wiki database name, e.g. wikimania2012wiki).
Run on searchidx2 as user rainman:
root@searchidx2:~# sudo -u rainman nohup /home/rainman/scripts/search-restart-indexer root@searchidx2:~# sudo -u rainman /home/rainman/scripts/build-new wikidb
Space issues
Probably deprecated b/c searchidx1 is no longer in use.
The primary indexer searchidx1 has been low on space. To help it limp along you can do:
cd /a/search/log/ rm -rf log-all log-prefix cd /a/search/indexes/import/ rm -rf *.spell rm -rf *.prefix (citation needed; this ideally would be done after the file has been processed by the updater...) cd /a/search/indexes/snapshot/ rm -rf *.spell rm -rf *.prefix (citation needed; this ideally would be done after the file has been processed by the updater...)) su - rainman nohup /home/rainman/scripts/search-restart-indexer &
The last restart makes the search indexer write to new log files so the old unlinked ones are actually removed. We do it as rainman so he can shoot and restart the process later if needed.
New install of searchidx1
To set up a new indexer we need:
- /etc/rsyncd.conf, /etc/lsearch.conf, and /etc/default/rsync will all be pushed out via puppet.
- the contents of /a/search copied over. this has a local copy of sun-java. sun java is required (according to rainman) as open-jdk can corrupt indexes. may need to recompile sun-java.
- /home, /mnt/thumbs, and /mnt/upload6 mounted and added to /etc/fstab. all the scripts run out of rainman's home dir so he should have an account too.
- (probably no longer the case) /home/wikipedia/conf/lucene/lsearch-global-2.1.conf may need updating
- steal the crontab for rainman.
- ( probably no longer the case due to /home/ all being on nfs) copy over (keeping perms etc) the stuff in /home/ariel/searchidx on fenari. The bad thing about this stuff is that the most recent of any two subdirs in import/* or snapshot/* are really supposed to be hardlinks to the same dir in /a/search/indexes/index, so that needs to be fixed up after the copy. Before going through this see if we can get by without that, ask rainman what he thinks. We need the files no matter what though.
Once the service is up and running the lsearchds on the other search hosts will all have to be restarted. (They run from the script in /home so they will automagically be aware of the new host.)
Rainmain has said he'll likely be around for the install.