Echographia

making things better, making better things

Monday, February 2, 2009

backup from dreamhost

Yesterday was the first day of February Album Writing Month, so of course I kicked things off by sinking three hours into writing a script that I could have found on the Internet somewhere. See, last week I cleaned out my GMail inbox. Since I haven’t done that, like, ever, that meant reading a lot of old and unimportant mail – like, for example, the July 2008 DreamHost newsletter, which mentioned their Entire Account Backup service, which gives you a downloadable copy of all your data for a given account. I love backups, so in the interest of being able to archive that newsletter, I kicked off a backup right then and there.

And yesterday the bill came due, in the form of a download link with a five-day expiration date. Well, right away I set to it! Unfortunately the download link sent me to a directory with one file for each database or home directory – tedious! And I thought, well, rather than spend 90 seconds clicking each link and telling Firefox where to put them, why don’t I just dash off a quick script to do it. That way it’ll just be automatic next time I think to back up my web hosting.

The thing is, whenever I take on a project – at least if I’m not being paid by the hour – I try to use it as an opportunity to learn something new. That’s where the three hours came in: learning. Mostly in the form of wrong turns and blind alleys.

For example, I tried scRUBYt!, the mystical screen scraper I keep wanting to use, but its support for HTTP basic authentication was only implemented using the Firewatir back end, which means firing up a copy of Firefox and steering it to web sites, which seemed like overkill, but on the other hand kind of cool, so I installed that, but I use multiple Firefox profiles, and Firewatir doesn’t want to wait around for me to pick one, and I don’t want to let the script take over my regular Firefox session, because dur, I’m using it, so then I added basic authentication to the Mechanize back end, but scRUBYt! still crashed, even, it turned out, on their sample scripts….

I also tried Rio, which looks like an elegant I/O abstraction – copying each file from the web server should have been as simple as source > dest – but again it doesn’t support HTTP authentication, and when I did the authentication myself and just handed Rio an IO stream, it crashed trying to parse the stream’s URI (?!), instead of just, you know, reading from the stream. And without the file copying, Rio just didn’t provide much benefit (for this project) over the standard library.

So in the end, the only new tool I used was Nokogiri for HTML parsing – which, for my current purposes, wasn’t different enough from Hpricot to feel like I was doing anything new. To copy the big files from the server, I ended up calling out to curl, because at that point I’d wasted too much time to bother looking up the elegant pure Ruby solution.

Lives of the coders.

Anyway, here’s the script, if anyone wants it. Tested only on OS X (Leopard), but it’s pretty generic. Depends on the Nokogiri and Trollop (command line option parsing) gems.

#!/usr/bin/ruby
 
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'trollop'
 
include FileUtils
 
opts = Trollop::options do
  opt :source, 'Source URL', :type => String, :required => true
  opt :username, 'Username', :type => String, :required => true
  opt :password, 'Password', :type => String, :required => true
  opt :destination, 'Destination folder', :default => "~/Backup/dreamhost"
end
 
opts[:source] += '/' unless opts[:source][-1] == ?/
source = URI.parse(opts[:source])
date = (source.path.match /[0-9]+-[0-9]+-[0-9]+/).to_s
destination = File::expand_path(opts[:destination], date)
 
auth = {
  :http_basic_authentication => [opts[:username], opts[:password]]
}
 
uris = [source]
processed = []
 
while !uris.empty?
  uri = uris.pop
  processed << uri
 
  doc = Nokogiri::HTML.parse(uri.open(auth))
 
  doc.css('a').each do |link|
    href = link['href']
    abs = uri.merge(href)
    path = abs.to_s[source.to_s.length..-1]
    if href.match /\.zip|\.gz$/
      puts "Downloading #{path}..."
      destfile = File::join(destination, path)
      `curl #{abs} -u #{opts[:username]}:#{opts[:password]} -o #{destfile}`
    elsif href.match /^[a-zA-Z]/
      mkdir_p File::join(destination, path)
      uris.push abs unless processed.include? abs
    end
  end
end

Update: wget -r --user=USER --password=PASS -P DESTINATION_FOLDER SOURCE_URL

posted by erik at 9:48 am  

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress