making things better, making better things

Saturday, June 27, 2009

deploying Thinking Sphinx on DreamHost PS

The time had come to add a search engine to Touring Machine. I went pretty far down the road with Xapian/Xapit, but:

You may want to trigger [reindexing] via a cron job on a recurring schedule (i.e. every day) to update the Xapian database. However it will only take effect after the Rails application is restarted because the Xapian database is stored in memory.

I see the Xapit Sync project to fix this has since ceased to be vaporware. Well, maybe next time.

Anyway, so RubyTrends tells me the cool kids use Thinking Sphinx, and I want to be cool, but I’m running Touring Machine on the cheap – on a shared DreamHost server – and they don’t want me to run server processes, and although some guy on the Internet says it’s probably fine, I’m leery of defying them. But last week they were running a discount offer on DreamHost PS, their quasi-VPS service – no root, but you can run whatever you want, within the resources (memory, CPU) you pay for.

Sounds like a fine place to run the search engine, but I didn’t want to run the whole Rails app there – since I’ve already got a place to run it that is effectively free. (DreamHost is very cheap, and I run a bunch of sites on it.) So I set out to run a distributed site, with the web app running on DreamHost shared hosting, and the search engine running on DreamHost PS.

It took some setup, but I think I’ve got it working. Here are some notes. As usual, this isn’t a tutorial. You should read everyone else’s instructions – especially the official Thinking Sphinx documentation, and J. Wade Winningham’s post about Capistrano tasks. (I didn’t use his deploy.rb, though.)

The Approach

As I said, I’ve got software installed on two machines – call them app (the web application) and search (the Sphinx server). (app has no special access to search – it’s open to DOS and other attacks. I’ll fix it later.) So Capistrano must be told about the new server and its new role. In deploy.rb:


To keep things simple (sort of), I deploy the whole codebase to both hosts. app is configured via the DreamHost control panel to run Passenger out of the Rails directory; search is configured not to, and I manage the search server via Capistrano.

There doesn’t seem to be a way not to run a web server on the PS host, so I created a public/redirect directory, told DreamHost to use that as my web directory, and put this in public/redirect/.htaccess:

Redirect 301 /

The standard Capistrano recipes assume you want to deploy to the same directory on each host. Unfortunately DreamHost won’t let you have the same username on both a shared and a PS host – and you can’t install your app anywhere outside of your home directory. Stuff breaks if you use a relative path for installation. This worked for me, at the top of deploy.rb:

set :application, ""
set :applicationdir, "`(cd ~; pwd)`/#{application}"

Thinking Sphinx

Thinking Sphinx’s Sphinx installer can’t be used as-is for DreamHost PS, for two reasons: It fails if PostgreSQL isn’t available, and it assumes you have sudo permission. I’ve submitted a revised version that fixes both problems. (It’ll still use sudo if you set :use_sudo to true.) For now, you’ll need to copy my capistrano.rb into your project if you want to use it, (Update: it’s now in the master repository, and should be in the next gem version) and add this to deploy.rb:

set :thinking_sphinx_configure_args, "--prefix=$HOME/software" # where to install Sphinx
require "#{File.dirname(__FILE__)}/thinking_sphinx_capistrano"

You’ll also need this in config/sphinx.yml:

  bin_path: '/home/USERNAME/software/bin'

Thinking Sphinx lets you specify a path to keep the indexes in, and comes with a task to create the path when you set up the server. Unfortunately the task is assigned to the app role. I haven’t patched Thinking Sphinx for this yet – I just put this in deploy.rb (after the above snippet):

namespace :thinking_sphinx do
  # Override TS's task to use :search instead of :web, and not use sudo
  desc "Add the shared folder for sphinx files for the production environment"
  task :shared_sphinx_folder, :roles => :search do
    run "mkdir -p #{shared_path}/db/sphinx/#{rails_env}"
after "deploy:setup", "thinking_sphinx:shared_sphinx_folder"

The other tasks assume you want them run on all machines; for now I’ve dealt with that by using, e.g., cap ROLES=search thinking_sphinx:install:sphinx to do Sphinx stuff.

Delayed Deltas

I used delayed deltas to keep the index up to date. The delayed_job plugin also comes with Capistrano tasks for managing a DJ daemon. (DJ is packaged with Thinking Sphinx, but you should install the latest version as a plugin.) cap delayed_job:start runs rake delayed_job:start on the server. This Rake task loads the whole Rails environment, then forks itself to run as a daemon. Combine this with Sphinx itself and an ssh session, and suddenly you’re using too much memory for a minimal DreamHost PS server (150M + swap). The forked daemon is killed before it can start.

I could upgrade, but it’s pretty ridiculous: You don’t need a whole Rails environment to start a daemon. I wrote a lo-fi replacement for rake delayed_job:* that does only what I need:

cd `dirname $0`/..
case $command in
    nohup script/delayed_job run 2>&1 >>log/delayed_job.log &
    if [ -f $PIDFILE ]
      pid=`cat $PIDFILE`
      rm $PIDFILE
      kill $pid 2>/dev/null
    $0 stop
    $0 start

Put that in script/dj, and add another chunk to deploy.rb:

namespace :delayed_job do
  desc "Stop the delayed_job process"
  task :stop, :roles => :search do
    run "cd #{current_path}; env RAILS_ENV=#{rails_env} script/dj stop"
    sleep 1
  desc "Start the delayed_job process"
  task :start, :roles => :search do
    run "cd #{current_path}; env RAILS_ENV=#{rails_env} script/dj start"
  desc "Restart the delayed_job process"
  task :restart, :roles => :search do
    run "cd #{current_path}; env RAILS_ENV=#{rails_env} script/dj restart"
before "thinking_sphinx:configure", "delayed_job:stop"  # limited resources
after  "thinking_sphinx:configure", "delayed_job:start" # limited resources
after "deploy:stop",    "delayed_job:stop"
after "deploy:start",   "delayed_job:start"
after "deploy:restart", "delayed_job:restart"

(Note that you can’t get by on delta indexes alone; I have a cron job that rebuilds the Sphinx index every now and then.)

Of course, you also don’t need a whole Rails environment to wake up every 5 seconds, see if there are tables that need reindexing, and fire off a request to the Sphinx server. If I want to keep this site cheap, I may need to optimize delayed_job out of the way. But for now, this is working, and it’s time to get back to making the app cool.

posted by erik at 5:47 pm  

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress