How do I do statistical computation in the cloud?

Cloud Computing

Cloud resources provide the opportunity to do massive statistical calculations on virtual clusters hosted by organizations such as Amazon (EC2), Google (Google Compute Engine), and Rackspace (Rackspace Cloud).

The basic steps involved are to set up an account with the provider, start and configure (with the software you need) a virtual cluster with one or more nodes for computation, run your computations, and return the results to your local machine.

We have prepared tutorials on using Amazon's EC2 and using Google Compute Engine, including template code for starting and configuring virtual clusters and for running computations. You can also download shell script files that just have the code (extracted from the tutorial PDFs) for starting a cluster for either EC2 or Google.

For general information on parallel programming, please see this SCF FAQ, which provides information on both distributed memory (multiple nodes) and shared memory (single node) parallel computation.

If you're interested in Hadoop, it's possible to configure a virtual cluster with Hadoop. Please consult [at] stat [dot] berkeley [dot] edu (email us) if you'd like more information. We also have a small Hadoop cluster within the SCF, but it's not built to handle sizeable jobs.