User:Ram/Search

From Wikitech
Jump to: navigation, search

NOTE: Areas that I'm still unclear about are marked by a bracketed comment: [?Add ...]

Contents

Overview of MediaWiki search

  • The apache project Lucene provides search capabilities in MediaWiki. The lucene daemon, which is a Java program, runs identically configured on a cluster of 25 machines at our data center in Ashburn, Virginia (a.k.a. eqiad) with a similar cluster running at the data center at Tampa, Florida (a.k.a. pmtpa) as a hot failover standby.
  • Each search server listens on (configurable) port 8123 for search queries.
  • Clustering uses LVS (Linux Virtual Server); further details about that tool at: http://www.linuxvirtualserver.org/whatis.html

[? Add details on how clustering works in our case.]

  • Each server is configured automatically using Puppet; the Puppet code can be cloned from (replace xyz with your user name):
   ssh://xyz@gerrit.wikimedia.org:29418/operations/puppet.git

The config files are under templates/lucene; similarly, LVS clustering is configured via Puppet using files under templates/lvs. [?Add details on how Puppet uses these files.]

  • The status of the various servers can be seen at: http://ganglia.wikimedia.org/latest/. From the Choose source dropdown, select Search eqiad. Click on thePhysical View button at top right to see details like the amount of RAM, number of cores, etc.
  • The MediaWiki extension MWSearch (PHP code) receives search queries and routes them to a search server [?Add details on how this is done.]
  • The file operations/mediawiki-config/wmf-config/lucene.php defines a number of globals to configure search; these include the port number, LVS cluster IP addresses, timeout, cache-expiry, etc. The main search file extensions/MWSearch/MWSearch.php is also require'd here. NOTE: The timeout is defined as 10s: $wgLuceneSearchTimeout = 10; which may be too small when servers are busy.

Search details (PHP)

Some important classes and files defining them.
Class File
LuceneSearch extensions/MWSearch/MWSearch_body.php
LuceneSearchResult extensions/MWSearch/MWSearch_body.php
LuceneSearchSet extensions/MWSearch/MWSearch_body.php
ApiQuerySearch core/includes/api/ApiQuerySearch.php
ApiQueryGeneratorBase core/includes/api/ApiQueryBase.php
ApiQueryBase core/includes/api/ApiQueryBase.php
ApiBase core/includes/api/ApiQueryBase.php
ContextSource core/includes/context/ContextSource.php
IContextSource core/includes/context/IContextSource.ph
Http core/includes/HttpFunctions.php
MWHttpRequest core/includes/HttpFunctions.php
CurlHttpRequest core/includes/HttpFunctions.php

ApiQuerySearch seems to be the main class handling search requests. Its inheritence hiearchy looks like this: ApiQuerySearch → ApiQueryGeneratorBase → ApiQueryBase → ApiBase → ContextSource/IContextSource.

ApiQuerySearch.run() starts query processing [? Where is this called from ?] and does the following:

  • Invokes ApiBase.extractRequestParams() to get parameter list.
  • Creates a new LuceneSearch object and invokes searchText() method on it, which invokes LuceneSearchSet::newFromQuery().
  • That routine does the following:
    • creates the search URL like this: $searchUrl = "http://$host:$wgLucenePort/$method/$wgDBname/$enctext?" to which a few parameters are appended like namespaces, etc.
    • Invokes Http.get() which invokes MWHttpRequest::factory() to get a new request object which is, probably, a CurlHttpRequest object and invokes execute() on it.
    • That method uses the native PHP functions curl_init(), curl_setup(), curl_exec(), curl_close() to make the HTTP call to the Java engine; the results are saved in the request object.

Search details (Java)

Most of the code is in subdirectories of src/org/wikimedia/lsearch/. The main class dealing with search itself is search/SearchServer.java; classes interfacing with PHP are in frontend, those dealing with networking in interoperability and the main entry point is config/StartupManager.java.

Some important classes are described below.

StartupManager
Performs these steps:
  1. Get local and global configurations and retrieve various parameters (language codes, localization data, etc.)
  2. Invoke static methods createRegistry() and bindRMIObjects() in RMIServer (see below for more on this class).
  3. If this is an indexer machine, start new HTTPIndexServer [default] or RPCIndexServer [? Is this ever used?].
  4. If this is an search machine:
    • Start new SearchServer.
    • Create singleton SearcherCache.
    • Start singleton threads UpdateThread and NetworkStatusThread.
HttpHandler
This is an abstract class (with processRequest() the only abstract method) that extends Thread; it is extended by HTTPIndexDaemon (handles index update requests) and SearchDaemon (handles search requests).
SearchDaemon
Extends HttpHandler; one of these is created for each incoming search request and run by the thread-pool in SearchServer (see below). Provides a definition of processRequest() which does the following:
  1. If non-search request (e.g. /robots.txt, /stats, /status), return relevant data.
  2. Otherwise:
    • Create new SearchEngine (top-level search class) and invoke it to get search results.
    • Return results in one of 3 formats: Standard, JSON, or OPENSEARCH.
HTTPIndexDaemon
Similar to SearchDaemon (above); extends HttpHandler; one of these is created for each incoming index request and run by the thread-pool in HTTPIndexServer (see below). Provides a definition of processRequest() which does the following:
SearchServer
Extends Thread. Though not defined as a singleton, appears to be so in practice. Started by StartupManager (see above). Does the following:
  1. Create a Statistics and StatisticsThread objects to supply stats to Ganglia.
  2. Create thread-pool of maxThreads [default: 80] threads.
  3. Listen on ServerSocket [default port: 8123]; when a connection is made, create new SearchDaemon object and run it in the pool if pool is not full. If pool is full, log an error and simply close socket ! [?NOTE There may be an off-by-one error in the check to see if the pool is full.]
HTTPIndexServer
Similar to SearchServer above; extends Thread. Though not defined as a singleton, appears to be so in practice. Started by StartupManager. Does the following:
  1. Create thread-pool of 25 (hardcoded) threads.
  2. Listen on ServerSocket [default port: 8321]; when a connection is made, create new HTTPIndexDaemon object and run it in the pool if pool is not full. If pool is full, log an error and simply close socket ! [?NOTE There may be an off-by-one error in the check to see if the pool is full. There is also a potential issue if both servers are run in the same Java process since the count of open requests is a static member of the common base class, so it becomes a combined count of both search and index requests but the thread pools are separate]
IndexDaemon
Simple class that functions as interface adapter to present a much simpler interface to clients of the somewhat complex IndexThread class. Not clear why this is done via a concrete class rather than an interface implemented by IndexThread.
HttpMonitor
Coming soon
RPCIndexDaemon
Coming soon
RPCIndexServer
Coming soon

Installing MediaWiki and lucene-search-2 for debugging

These instructions are targeted at developers who want to setup an instance of MediaWiki and the Lucene based search functionality for testing and debugging; it is not the intent here to setup a production system.

Details on how to install MediaWiki are at: http://www.mediawiki.org/wiki/Installation A summary appears below along with some additional details.

  • Download the latest release from: http://www.mediawiki.org/wiki/Download and extract the archive; then rename the top-level directory to 'core' or something similar for ease of typing, e.g.
      cd ~/src
      tar xvf mediawiki-1.20.2.tar.gz
      mv mediawiki-1.20.2 core
  • Install prerequisites (if you prefer MySql to SQLite3 replace the sqlite packages below with corresponding MySql packages: mysql-server, php5-mysql):
    list="php5 php5-sqlite sqlite3 apache2 git default-jdk ant debhelper javahelper"
    list="$list liblog4j1.2-java libcommons-logging-java libslf4j-java "
    sudo apt-get install $list
  • Checkout the MWSearch extension from the git repository, e.g.:
    cd ~/src
    mkdir extensions; cd extensions
    git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/MWSearch.git
  • Make sure the apache/php combo is working by creating a file named info.php at /var/www containing:
   <?php phpinfo(); ?>

Now point your browser (or use wget/curl to fetch the page) at http://localhost/info.php; you should see lots of tables with PHP configuration info (replace localhost with appropriate host name or IP if necessary).

  • Create a data directory somewhere and make it world read/write (this is to allow apache to create an sqlite DB file under it). Also make sure that all the directories from your home directory to the MediaWiki root are world readable and searchable (otherwise you'll get errors from Apache as it searches for .htaccess files), e.g.
    mkdir ~/data; chmod 777 ~/data
    chmod 755 ~ ~/src ~/src/core
  • Reconfigure apache by editing /etc/apache2/sites-available/default to remove unnecessary stuff and also set DocumentRoot to point to the freshly unpacked MediaWiki root above. Something close to this should work (replace xyz by a proper user name):
   <VirtualHost *:80>
       ServerAdmin webmaster@localhost
       DocumentRoot /home/xyz/src/core
       Alias /extensions /home/xyz/src/extensions
       ErrorLog ${APACHE_LOG_DIR}/error.log
       LogLevel warn
       CustomLog ${APACHE_LOG_DIR}/access.log combined
       php_admin_flag engine on
       <Directory /home/xyz/src/core/images>
           php_admin_flag engine off
       </Directory>
       <Directory /home/xyz>
           AllowOverride All
       </Directory>
   </VirtualHost>
  • Reload apache configuration with:
  sudo /etc/init.d/apache2 reload
   LocalSettings.php not found.
   Please set up the wiki first.

and follow on-screen instructions to configure MediaWiki; at the end you'll be prompted to download the generated LocalSettings.php file and place it in the MediaWiki root directory to complete the configuration step. This step can also be done from the commandline as described next.

  • The previous step can also be done from the commandline, e.g.
      php core/maintenance/install.php --help

Documentation on the various parameters is at: http://www.mediawiki.org/wiki/Manual:Config_script. A sample invocation might look like this for a MySql install (change dbxyz and DbXyzPass to a suitable user name and password; likewise, wiki_admin and WikiAdminPass to suitable values for the wiki administrator; also change RootPass to the password for the root user of your MySql installation) [?Whats the difference between dbuser and admin user ?]:

   #!/usr/bin/bash
   # install MediaWiki from commandline
   opt='--dbtype mysql '
   opt+='--dbuser dbxyz '
   opt+='--dbpass DbXyzPass '
   opt+='--installdbuser root '
   opt+='--installdbpass RootPass '
   opt+='--pass WikiAdminPass'
   php maintenance/install.php $opt my_wiki wiki_admin

This will generate a new LocalSettings.php file, create a new database named my_wiki and create a number of tables within it; the user table will have a row for dbxyz. You can edit the generated LocalSettings.php file manually to add additional configuration options as needed; for example, some of these may be useful (replace xyzhost with your hostname):

      require( "$IP/../extensions/MWSearch/MWSearch.php" );
      $wgLuceneHost = 'xyzhost';
      $wgLucenePort = 8123;
      $wgLuceneSearchVersion = '2.1';
      $wgLuceneUseRelated = true;
      $wgEnableLucenePrefixSearch = false;
      $wgSearchType = 'LuceneSearch';
  • Checkout lucene-search-2:
     cd ~/src
     git clone https://gerrit.wikimedia.org/r/p/operations/debs/lucene-search-2.git

There is a top-level README.txt file that describes how to build it; we summarize the steps below.

  • Run ant to build everything; the result should be a local file named LuceneSearch.jar:
     cd lucene-search-2; ant
  • The README.txt file mentions running the configure script but that script is missing in the git checkout. Create it to contain:
     #!/bin/bash
     dir=`cd $1; pwd`
     java -cp LuceneSearch.jar org.wikimedia.lsearch.util.Configure $dir

Now run it with the full path to the MediaWiki root directory as an argument, e.g.:

     bash configure ~/src/core

It will examine your MediaWiki configuration and generate these matching configuration files for search:

     lsearch.log4j  lsearch-global.conf  lsearch.conf  config.inc
  • The generated lsearch.log4j uses ScribeAppender which requires installation of additional packages (without them you'll get Java exceptions when you run the lsearchd daemon); one way to get around this is the remove those references and use a RollingFileAppender:
       log4j.rootLogger=INFO, R
       log4j.appender.R=org.apache.log4j.RollingFileAppender
       log4j.appender.R.File=logs/test.log
       log4j.appender.R.MaxFileSize=10MB
       log4j.appender.R.MaxBackupIndex=2
       log4j.appender.R.layout=org.apache.log4j.PatternLayout
       log4j.appender.R.layout.ConversionPattern=%d{ISO8601} %-5p %c %m%n
       log4j.logger.org.wikimedia.lsearch.interoperability=DEBUG
  • Now get an XML dump (replace /var/tmp with a different location if you prefer; the path to the dump file as well as the name of the file itself may change over time):
     pushd /var/tmp
     file='simplewiktionary-20130113-pages-meta-current.xml.bz2'
     wget http://dumps.wikimedia.org/simplewiktionary/20130113/$file
     popd

and build Lucene indexes from it (the last argument is the name of your wiki as defined in LocalSettings.php by the $wgDBname global variable):

     java -cp LuceneSearch.jar org.wikimedia.lsearch.importer.BuildAll /var/tmp/$file my_wiki

This last command is equivalent to running the build script mentioned in README.txt; it creates a new directory named indexes and a number of directories and index files under it; for the dump file mentioned above, it should take around 5m to complete on a modern machine.

  • Finally, you can run the search daemon:
     ./lsearchd &

It listens for search queries on port 8123, so you can test it like this:

       wget http://localhost:8123/search/my_wiki/hello

Logs can be found under the logs directory.

References

These links have useful info about search:

  1. mw:Extension:Lucene-search
  2. Lucene
  3. mw:User:Rainman/search internals
Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox