MapReduce and Distributed Computing Using Spark

MapReduce and the Hadoop framework for implementing distributed computing provide an approach for working with extremely large datasets distributed across a network of machines. Spark, developed by the AmpLab here at Berkeley, is a recent implementation of these ideas that tries to keep computations in the collective memory of the network of machines to speed computations, particularly iterative algorithms.

We have prepared a tutorial, including template code, on using Spark for data processing, simulation, and statistical model fitting. The tutorial includes information about starting up an Amazon virtual cluster using Spark's EC2 script.

The tutorial and template code are available on github. The HTML of the tutorial is also available here. You can clone the repository with a git clone, which can be done from the Linux/Mac command line as:

git clone https://github.com/berkeley-scf/spark-workshop-2014

Here are screencasts of the first session and the second session of Chris Paciorek's workshop on Spark.

In addition, please note that Spark is available on the campus-wide Savio cluster. Materials from a workshop given by Chris Paciorek that covers use of Spark on both AWS and Savio, but with less detail than the other materials is available on github. The PDF of the tutorial is also available here. You can clone the repository with a git clone, which can be done from the Linux/Mac command line as:

git clone https://github.com/berkeley-scf/spark-cloudwg-2015