In the previous blogs, we've discussed how distributed multiprocessing architectures dominate the supercomputing arena. Smaller, somewhat less than "super," clusters continue to find practical applications. As we mentioned, Hadoop and Spark clusters are designed to map code to multiple computers so the code can be applied to a great deal of data distributed across many physical servers.
In a Beowulf-style cluster, the data is often minimal, and the code is distributed so that multiple CPUs can work on the problem in parallel. While such clusters are famous for their application to high-power math problems, they are also helpful in some simple but practical applications. Web scraping, for example, is very time-consuming; easy, immediate benefits are achieved from setting multiple machines in a cluster lose on web-scraping tasks.
Many amateur and homebrew clusters include Windows machines, Linux desktops, and a few Raspberry Pis, all working in a single cluster.
The term "Beowulf" has evolved to refer to an ecosystem of applications, not just the computer cluster itself. Cluster computing involves management tools, security, and job scheduling applications to be truly useful. We shall not review such tasks here but will focus on the distributed execution of Python code. Check out Learning Tree's Introduction to Python On-Demand Training to learn Python in your own schedule. For those interested in implementing an actual Beowulf cluster, the beowulf-python provides much functionality for management and security.
We shall focus on the minimum essentials: how to execute parallel code on distributed processors.
Message Passing Interface (MPI)
No law requires us to use MPI for a cluster, and alternative networking protocols exist for distributed processing. However, we will stick to MPI because it is trendy and has many resources to draw from. There are two freely available MPI implementations, MPICH and OpenMPI; we will illustrate MPICH. Any individual cluster, however, must commit to one or the other. Furthermore, MPICH and OpenMPI cannot be mixed in a single cluster.
Regardless of the implementation, some infrastructure must be in place to handle the network connections and security that MPI requires. For example, MPICH and OpenMPI work with the secure shell ssh. In addition, all the machines participating in a cluster will need their own set of keys so that individual "slave" machines can accept instructions from the "master" and then return results to the master.
The greatest challenge in programming for a distributed environment is managing how data is passed from one participating cluster node to another. However, we will ignore that complexity and focus on the simplest possible case, executing independent, self-standing code on different machines.
When you install an implementation of MPI, you will get some variant of mpiexec. (Some platforms continue to use the classical command mpirun.)This command-line program initiates the execution of code on one or more machines in a cluster. For our example, we will execute a Python program that prints the name of the machine it is running. That way, we can easily confirm that the code execution has been distributed.
Note that since our simple example does not need to communicate with code on other machines, the Python code does not need to reference the mpi4py library.
We could specify the machines on which we want our code to run right in the commandline itself. However, it is easier in the long run to use machinefiles, which are nothing more than lists specifying cluster nodes. A command line merely invokes the appropriate machinefile to specify the nodes. Here is an example machinefile.
Note that the machine can be specified by IP address or by machine name. In this example, Hypatia is the master, and 10.0.0.195 (Lovelace) is enslaved; this distinction is not made in the machinefile. If need be, we can specify multiple processes on each node, which would use more available processor cores.
This machinefile specifies running two processes in the first machine and three in the second. There is no space on either side of the colon. Here is the command line and output:
The -f parameter is the machinefile name, which in this example is in the same folder from which mpiexec is being run. The program commandline to run on multiple machines is python ~/mpistuff/mpi-001.py. It is critical to note that mpiexec does not copy the program code to the slave machine; the code must already be installed on the slave machine, and the full path of that code is included in the command.
As a historical aside, the machines on my network are all named after mathematicians. (Yes, I'm a geek. I admit it.) Hypatia was a famous mathematician in ancient Alexandria, and Ada Lovelace was the world's first computer programmer.
Setting up a distributed processing cluster is pretty straightforward. Fun, too. The most challenging part is ensuring that the cluster nodes all have the correct security credentials to communicate with each other.
This piece was originally posted on April 13, 2020, and has been refreshed with updated styling and links.