Sartoris
Contents |
Deployment location
- MediaWiki:
/srv/deployment/mediawiki/common /srv/deployment/mediawiki/slot0 /srv/deployment/mediawiki/slot1
- Parsoid:
/srv/deployment/parsoid/Parsoid /srv/deployment/parsoid/config
Deploying
Initialize the git deploy environnement:
$ git deploy start
You can now proceed updating the code base using git pull, checkout, cherry-pick, commit, or whatever other repo changes you need to make. Once you have finished updating the code base, ask git deploy to actually deploy the modifications:
$ git deploy sync
If you screwed up something during the code update, you can abort your current work using:
$ git deploy abort
Design
Basic design
Git repositories sit on the deployment system behind a web server. Users initiate a deployment using git-deploy (git deploy start). It writes out a lock file to only allow a single deploy at a time and adds a tag to the repo as a rollback point, in case of a deploy abort. At this point the user updates the repo as necessary, or aborts the deploy. If the user aborts the deploy, it rolls back the repo to the start tag and removes the lock file. Once the deploy is ready, the user completes the deploy (git deploy sync), which causes git-deploy to write out a sync tag then trigger a sync script. It also adds a .deploy file to the repo root, which describes the currently deployed code. The sync script updates the repo and submodules so that the application servers can fetch properly. After doing so it calls a salt run for fetch, then a salt run for checkout to the deploy tag. The sync script will report success or failure. After the sync script is run, git-deploy removes the lock file.
Sync hook
- Location (on the deployment host): /var/lib/git-deploy/sync/shared.py
- Managed in the puppet deployment module
- Get the repo and submodules ready for fetching:
- Update the repo: git update-server-info
- Tag all submodules with the same tag as parent repo: git submodule foreach "git tag <tag>"
- Update all submoudles: <for each extension in <repo>/.git/modules/extension> git update-server-info
- Make the application servers do a fetch (via a salt runner)
- Make the application servers do a checkout (via a salt runner)
- Switch core to the git tag
- Update the submodules
Salt deploy runner
- Location (on the salt master): /srv/runners/deploy.py
- Managed in the puppet deployment module
A salt runner is a script that runs on the salt master and can combine many salt calls into a single function.
The salt deploy runner can be called from the deployment server via sudo salt-call publish.runner deploy.<function>. It is called by the git-deploy sync hook. It has two functions:
- deploy.fetch(repo)
- calls fetch (via a salt module) on all application servers for the specified repo
- deploy.checkout(repo)
- calls checkout (via a salt module) on all application servers for the specified repo
Each function returns a report in json on which minions returned successfully, failed, or didn't return.
Salt deploy module
- Location (on the salt master): /srv/salt/_modules/deploy.py
- Managed in the deployment puppet module
A salt module lives on every salt minion and can be called from the salt master or from any peer which is allowed access.
The salt deploy module is called via salt <matching-criteria> deploy.<function>. It has the following functions:
- deploy.sync_all
- sync all repositories configured. This will also fully clone repositories, if they are missing.
- deploy.fetch(repo)
- do a git fetch based on the repo location (repo_locations) and url (repo_urls) defined via salt pillars.
- deploy.checkout(repo,reset=False)
- do a checkout of a repo based on the repo location (repo_locations), and url (repo_urls) from salt pillars, and .deploy file defined on the deployment host. Checkout will also modify the .gitmodules file based on sed configuration defined in salt pillars (repo_regex).
Salt deployment pillars
- Location (on the salt master): /srv/pillars
- Managed in the puppet repo: role::deployment::salt_masters::production
Salt pillars are a set of configuration data available on every salt minion (via salt-call pillar.data). Pillars are managed on the master and are distributed to all minions on update.
Naming
slot0 <- current slot1 <- next slot2 <- next + 1 ...
On the deployment system, we should symlink version numbers to the slots, so that it's easy to tell version we are on, for instance:
/srv/deployment/mediawiki/common/php-1.20wmf1 -> /srv/deployment/mediawiki/slot0 /srv/deployment/mediawiki/common/php-1.20wmf2 -> /srv/deployment/mediawiki/slot1 ...
Discussion:
- This may be a good place for something like perl's 'storable' which allows you to serialize/deserialize complex data structures for writing to disk or transfer. Depending on what we use slots for it's an efficient way to store more data--e.g. metadata about deployment versions
- Python's equivalent is pickle, and in php, we're already using cdb for version info (hetdeploy). The slots scheme would need to work with our hetdeploy stuff, which I think assumes versions. Either we'd need to sync the symlinks to the versions, or do a lot of work on hetdeploy.
Timeline for slots
slot0=wmf1, slot1=wmf2 move all wmf1 wikis to wmf2 over time once all are moved, switch slot0 to wmf2, move wikis to slot0 rinse/repeat for next cycle
Examples
Example deploy of a core change
cd /srv/deployment/mediawiki/common/php-1.20wmf1 git deploy start git pull git deploy sync
In the above scenario, 1.20wmf1 is the current version of MediaWiki we are running. /srv/deployment/mediawiki/common/php-1.20wmf1 is a symlink to /srv/deployment/mediawiki/slot0. When it syncs to the application servers, git deploy is running git fetch and switch to a git tag at /srv/deployment/mediawiki/slot0. After switching to the git tag, git deploy will also update all submodules to the versions listed in the tag point.
Example of changing versions of mediawiki
cd /srv/deployment/mediawiki/common ln -s /srv/deployment/mediawiki/slot1 php-1.20wmf2 cd php-1.20wmf2 git deploy start git branch --track wmf/1.20wmf2 origin/wmf/1.20wmf2 git checkout wmf/1.20wmf2 git submodule update --init git deploy sync
This example will the the same thing as the previous example, but it will update /srv/deployment/mediawiki/common/slot1 rather than /srv/deployment/mediawiki/common/slot0.
Example of an emergency live hack
cd /srv/deployment/mediawiki/common/php-1.20wmf1 git deploy start <make changes> git commit git deploy sync
Reporting
The salt-runner returns a report about the number of minions that returned successfully, timed out, or failed. For minions that timeout or fail, the runner returns a list of the minions. For minions that fail, it also returns a failure status for the minions.
In addition to the report, all minions update redis with information about the current tag deployed, and the status (and timestamp) of fetch and checkout. The status of what's deployed on the cluster can be found using the deploy-info script.
Scaling
Salt runs git fetch and git checkout in parallel on all minions. This could possible cause a high amount of network load. There's a couple possible ways of handling this:
- Short-term fix: have the fetch portion of the salt runner do a batched run. The checkout will still occur in parallel.
- Long-term fix: rack aware deployment. Each rack would have one application server as a designated deployment node, which would also be defined via a grain. The runner would be changed to push to the deployment nodes. The deployment nodes would update themselves, then send a peer command for their rack nodes to update from the deployment node in the same rack. Reporting in this situation is more difficult. Using returners in this situation would be ideal.
- Speculative fix: have the server copy the files using BitTorrent.
Trying it
tin.eqiad.wmnet is the eqiad deployment host. There are a few eqiad mw hosts configured and ready to be tested. Simply go into /srv/deployment/mediawiki/<repo> and try it out. It is not necessary to forward your ssh agent to this host.
Issues
- WARNING
- the issues below should be made bug reports in our Bugzilla
- When calling the salt-runner from the deployment host, the runner is being called by a peer. When a runner is called directly on the salt-master, it displays progress as it occurs. When called from a peer it only displays the end-result. This can take a while and doesn't indicate if the deployment is working or hung.
- When any application server salt minion is down the salt-run calls will take the entirety of their timeout value. fetch is currently set at 2 minutes and checkout is set at 1 minute. This means all deploys will likely take 3 minutes with relatively no feedback (due to the above issue).
- The runner currently is not properly returning results about minions that didn't report back.
- i18n is not being deployed in the new system
TODOs
Required
- Create a sudo policy for wikidev users to be able to call the salt runners
- When 0.10.3 is released, also add an ACL so that we can run this without a sudo policy
- There's a sudo policy temporarily in place on tin. This needs to be puppetized
- Add a finish script to git-deploy to write out to IRC
- Add puppet exec to initialize repo, for new hosts
- There's a function in the salt deploy module for this, but it needs to be puppetized: salt-call deploy.sync_all
- Add puppet exec to bring repos up to date before apache starts
- The above sync_all needs to replace the scap call
- Rewrite git-deploy in python
- We really need fetch and checkout to be separate manual steps.
- We also need to be able to retry a fetch or checkout without starting a completely new deploy
- When retrying a checkout, we could be able to force a reset or just let it retry the checkout